Skip to content

[importer] Introduce new Importer module with separate Configs, API Endpoints, and Dependencies #4089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
May 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
75defec
[importer] Add new component and API endpoint with new directory stru…
Harshg999 Apr 1, 2025
935ca5d
[importer] Implement file upload API for CSV and Excel formats with v…
Harshg999 Apr 8, 2025
1ce2856
Refactor importer API: remove unused import and delete obsolete templ…
Harshg999 Apr 8, 2025
c20623e
Refactors file format detection and metadata extraction
Harshg999 Apr 25, 2025
db2a4ec
Add file metadata detection and update dependencies
Harshg999 Apr 25, 2025
6ba570f
Refactors file upload API for better separation of concerns
Harshg999 Apr 29, 2025
5552e78
Refactor file metadata detection API and improve efficiency
Harshg999 Apr 29, 2025
b901255
Improves file metadata extraction and error handling
Harshg999 Apr 30, 2025
48f07aa
Improves file type detection with graceful magic lib fallback
Harshg999 Apr 30, 2025
c7a2a11
Adds file preview API for data import functionality
Harshg999 May 5, 2025
4810c36
Add coerce_bool handling for has_header parameter
Harshg999 May 13, 2025
c9a2760
Enhance field separator handling in preview_file API with unicode dec…
Harshg999 May 20, 2025
880fbb5
Refactor file metadata handling by introducing GuessFileMetadataSeria…
Harshg999 May 21, 2025
e1af698
Refactor file preview functionality by introducing PreviewFileSeriali…
Harshg999 May 21, 2025
70e7fd7
Add has_header field to PreviewFileSerializer for explicit header det…
Harshg999 May 21, 2025
307fe3f
Uncomment old code
Harshg999 May 21, 2025
c312738
Change variable name to uploaded_file
Harshg999 May 21, 2025
3732baf
fix docstring
Harshg999 May 21, 2025
4d163db
Remove unnecessary blank line before api_error_handler function
Harshg999 May 21, 2025
06b5d0a
Refactor comments
Harshg999 May 22, 2025
323fd19
Add configurable importer restrictions and settings
Harshg999 May 27, 2025
1d4ff77
Improves Excel sheet name extraction performance
Harshg999 May 27, 2025
6baa522
Add API and logic for file header row detection
Harshg999 May 28, 2025
590f772
Add API for mapping Polars types to SQL types
Harshg999 May 28, 2025
708e75a
Sort req packages
Harshg999 May 28, 2025
5a9832b
Add unit tests
Harshg999 May 29, 2025
e235b65
Refactor import statements for better organization and clarity
Harshg999 May 29, 2025
dba5f8f
Remove redundant validate methods from GuessFileMetadataSerializer an…
Harshg999 May 30, 2025
a023370
Change "data file importer" to just "importer"
Harshg999 May 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions desktop/conf.dist/hue.ini
Original file line number Diff line number Diff line change
Expand Up @@ -1064,6 +1064,19 @@ tls=no
## Enable integration with Google Storage for RAZ
# is_raz_gs_enabled=false

## Configuration options for the importer
# ------------------------------------------------------------------------
[[importer]]
# Turns on the data importer functionality
## is_enabled=false

# A limit on the local file size (bytes) that can be uploaded through the importer. The default is 157286400 bytes (150 MiB).
## max_local_file_size_upload_limit=157286400

# Security setting to specify local file extensions that are not allowed to be uploaded through the importer.
# Provide a comma-separated list of extensions including the dot (e.g., ".exe, .zip, .rar, .tar, .gz").
## restrict_local_file_extensions=.exe, .zip, .rar, .tar, .gz

###########################################################################
# Settings to configure the snippets available in the Notebook
###########################################################################
Expand Down
13 changes: 13 additions & 0 deletions desktop/conf/pseudo-distributed.ini.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1049,6 +1049,19 @@
## Enable integration with Google Storage for RAZ
# is_raz_gs_enabled=false

## Configuration options for the importer
# ------------------------------------------------------------------------
[[importer]]
# Turns on the data importer functionality
## is_enabled=false

# A limit on the local file size (bytes) that can be uploaded through the importer. The default is 157286400 bytes (150 MiB).
## max_local_file_size_upload_limit=157286400

# Security setting to specify local file extensions that are not allowed to be uploaded through the importer.
# Provide a comma-separated list of extensions including the dot (e.g., ".exe, .zip, .rar, .tar, .gz").
## restrict_local_file_extensions=.exe, .zip, .rar, .tar, .gz

###########################################################################
# Settings to configure the snippets available in the Notebook
###########################################################################
Expand Down
2 changes: 2 additions & 0 deletions desktop/core/base_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,15 @@ Mako==1.2.3
Markdown==3.7
openpyxl==3.0.9
phoenixdb==1.2.1
polars[calamine]==1.8.2 # Python >= 3.8
prompt-toolkit==3.0.39
protobuf==3.20.3
pyarrow==17.0.0
pyformance==0.3.2
python-dateutil==2.8.2
python-daemon==2.2.4
python-ldap==3.4.3
python-magic==0.4.27
python-oauth2==1.1.0
python-pam==2.0.2
pytidylib==0.3.2
Expand Down
4 changes: 3 additions & 1 deletion desktop/core/generate_requirements.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,6 @@ def __init__(self):
]

self.requirements = [
"setuptools==70.0.0",
"apache-ranger==0.0.3",
"asn1crypto==0.24.0",
"avro-python3==1.8.2",
Expand Down Expand Up @@ -88,6 +87,7 @@ def __init__(self):
"Mako==1.2.3",
"openpyxl==3.0.9",
"phoenixdb==1.2.1",
"polars[calamine]==1.8.2", # Python >= 3.8
"prompt-toolkit==3.0.39",
"protobuf==3.20.3",
"psutil==5.8.0",
Expand All @@ -97,6 +97,7 @@ def __init__(self):
"python-daemon==2.2.4",
"python-dateutil==2.8.2",
"python-ldap==3.4.3",
"python-magic==0.4.27",
"python-oauth2==1.1.0",
"python-pam==2.0.2",
"pytidylib==0.3.2",
Expand All @@ -107,6 +108,7 @@ def __init__(self):
"requests-kerberos==0.14.0",
"rsa==4.7.2",
"ruff==0.11.10",
"setuptools==70.0.0",
"six==1.16.0",
"slack-sdk==3.31.0",
"SQLAlchemy==1.3.8",
Expand Down
75 changes: 49 additions & 26 deletions desktop/core/src/desktop/api2.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,20 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import re
import json
import logging
import zipfile
import os
import re
import tempfile
import zipfile
from builtins import map
from datetime import datetime
from io import StringIO as string_io

from celery.app.control import Control
from django.core import management
from django.db import transaction
from django.http import HttpResponse, JsonResponse
from django.http import HttpResponse
from django.shortcuts import redirect
from django.utils.html import escape
from django.utils.translation import gettext as _
Expand All @@ -48,10 +48,11 @@
ENABLE_NEW_STORAGE_BROWSER,
ENABLE_SHARING,
ENABLE_WORKFLOW_CREATION_ACTION,
TASK_SERVER_V2,
get_clusters,
IMPORTER,
TASK_SERVER_V2,
)
from desktop.lib.conf import GLOBAL_CONFIG, BoundContainer, is_anonymous
from desktop.lib.conf import BoundContainer, GLOBAL_CONFIG, is_anonymous
from desktop.lib.connectors.models import Connector
from desktop.lib.django_util import JsonResponse, login_notrequired, render
from desktop.lib.exceptions_renderable import PopupException
Expand All @@ -60,16 +61,16 @@
from desktop.lib.paths import get_desktop_root
from desktop.log import DEFAULT_LOG_DIR
from desktop.models import (
__paginate,
_get_gist_document,
Directory,
Document,
Document2,
FilesystemException,
UserPreferences,
__paginate,
_get_gist_document,
get_cluster_config,
get_user_preferences,
set_user_preferences,
UserPreferences,
uuid_default,
)
from desktop.views import _get_config_errors, get_banner_message, serve_403_error
Expand All @@ -91,7 +92,7 @@
search_entities_interactive as metadata_search_entities_interactive,
)
from metadata.conf import has_catalog
from notebook.connectors.base import Notebook, get_interpreter
from notebook.connectors.base import get_interpreter, Notebook
from notebook.management.commands import notebook_setup
from pig.management.commands import pig_setup
from search.management.commands import search_setup
Expand Down Expand Up @@ -132,19 +133,41 @@ def get_banners(request):

@api_error_handler
def get_config(request):
"""
Returns Hue application's config information.
Includes settings for various components like storage, task server, importer, etc.
"""
# Get base cluster configuration
config = get_cluster_config(request.user)
config['hue_config']['is_admin'] = is_admin(request.user)
config['hue_config']['is_yarn_enabled'] = is_yarn()
config['hue_config']['enable_task_server'] = TASK_SERVER_V2.ENABLED.get()
config['hue_config']['enable_workflow_creation_action'] = ENABLE_WORKFLOW_CREATION_ACTION.get()
config['storage_browser']['enable_chunked_file_upload'] = ENABLE_CHUNKED_FILE_UPLOADER.get()
config['storage_browser']['enable_new_storage_browser'] = ENABLE_NEW_STORAGE_BROWSER.get()
config['storage_browser']['restrict_file_extensions'] = RESTRICT_FILE_EXTENSIONS.get()
config['storage_browser']['concurrent_max_connection'] = CONCURRENT_MAX_CONNECTIONS.get()
config['storage_browser']['file_upload_chunk_size'] = FILE_UPLOAD_CHUNK_SIZE.get()
config['storage_browser']['enable_file_download_button'] = SHOW_DOWNLOAD_BUTTON.get()
config['storage_browser']['max_file_editor_size'] = MAX_FILEEDITOR_SIZE
config['storage_browser']['enable_extract_uploaded_archive'] = ENABLE_EXTRACT_UPLOADED_ARCHIVE.get()

# Core application configuration
config['hue_config'] = {
'is_admin': is_admin(request.user),
'is_yarn_enabled': is_yarn(),
'enable_task_server': TASK_SERVER_V2.ENABLED.get(),
'enable_workflow_creation_action': ENABLE_WORKFLOW_CREATION_ACTION.get(),
}

# Storage browser configuration
config['storage_browser'] = {
'enable_chunked_file_upload': ENABLE_CHUNKED_FILE_UPLOADER.get(),
'enable_new_storage_browser': ENABLE_NEW_STORAGE_BROWSER.get(),
'restrict_file_extensions': RESTRICT_FILE_EXTENSIONS.get(),
'concurrent_max_connection': CONCURRENT_MAX_CONNECTIONS.get(),
'file_upload_chunk_size': FILE_UPLOAD_CHUNK_SIZE.get(),
'enable_file_download_button': SHOW_DOWNLOAD_BUTTON.get(),
'max_file_editor_size': MAX_FILEEDITOR_SIZE,
'enable_extract_uploaded_archive': ENABLE_EXTRACT_UPLOADED_ARCHIVE.get(),
}

# Importer configuration
config['importer'] = {
'is_enabled': IMPORTER.IS_ENABLED.get(),
'restrict_local_file_extensions': IMPORTER.RESTRICT_LOCAL_FILE_EXTENSIONS.get(),
'max_local_file_size_upload_limit': IMPORTER.MAX_LOCAL_FILE_SIZE_UPLOAD_LIMIT.get(),
}

# Other general configuration
config['clusters'] = list(get_clusters(request.user).values())
config['documents'] = {'types': list(Document2.objects.documents(user=request.user).order_by().values_list('type', flat=True).distinct())}
config['status'] = 0
Expand Down Expand Up @@ -624,7 +647,7 @@ def copy_document(request):

# Import workspace for all oozie jobs
if document.type == 'oozie-workflow2' or document.type == 'oozie-bundle2' or document.type == 'oozie-coordinator2':
from oozie.models2 import Bundle, Coordinator, Workflow, _import_workspace
from oozie.models2 import _import_workspace, Bundle, Coordinator, Workflow
# Update the name field in the json 'data' field
if document.type == 'oozie-workflow2':
workflow = Workflow(document=document)
Expand Down Expand Up @@ -998,7 +1021,7 @@ def is_reserved_directory(doc):
documents = json.loads(request.POST.get('documents'))

documents = json.loads(documents)
except ValueError as e:
except ValueError:
raise PopupException(_('Failed to import documents, the file does not contain valid JSON.'))

# Validate documents
Expand Down Expand Up @@ -1112,7 +1135,7 @@ def gist_create(request):
statement = request.POST.get('statement', '')
gist_type = request.POST.get('doc_type', 'hive')
name = request.POST.get('name', '')
description = request.POST.get('description', '')
_ = request.POST.get('description', '')

response = _gist_create(request.get_host(), request.is_secure(), request.user, statement, gist_type, name)

Expand Down Expand Up @@ -1333,7 +1356,7 @@ def _create_or_update_document_with_owner(doc, owner, uuids_map):
doc['pk'] = existing_doc.pk
else:
create_new = True
except FilesystemException as e:
except FilesystemException:
create_new = True

if create_new:
Expand Down
9 changes: 9 additions & 0 deletions desktop/core/src/desktop/api_public_urls_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

from desktop import api_public
from desktop.lib.botserver import api as botserver_api
from desktop.lib.importer import api as importer_api

# "New" query API (i.e. connector based, lean arguments).
# e.g. https://demo.gethue.com/api/query/execute/hive
Expand Down Expand Up @@ -158,6 +159,14 @@
re_path(r'^indexer/importer/submit', api_public.importer_submit, name='indexer_importer_submit'),
]

urlpatterns += [
re_path(r'^importer/upload/file/?$', importer_api.local_file_upload, name='importer_local_file_upload'),
re_path(r'^importer/file/guess_metadata/?$', importer_api.guess_file_metadata, name='importer_guess_file_metadata'),
re_path(r'^importer/file/guess_header/?$', importer_api.guess_file_header, name='importer_guess_file_header'),
re_path(r'^importer/file/preview/?$', importer_api.preview_file, name='importer_preview_file'),
re_path(r'^importer/sql_type_mapping/?$', importer_api.get_sql_type_mapping, name='importer_get_sql_type_mapping'),
]

urlpatterns += [
re_path(r'^connector/types/?$', api_public.get_connector_types, name='connector_get_types'),
re_path(r'^connector/instances/?$', api_public.get_connectors_instances, name='connector_get_instances'),
Expand Down
47 changes: 38 additions & 9 deletions desktop/core/src/desktop/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,30 +16,30 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import sys
import datetime
import glob
import stat
import socket
import logging
import datetime
import os
import socket
import stat
import sys
from collections import OrderedDict

from django.db import connection
from django.utils.translation import gettext_lazy as _

from desktop import appmanager
from desktop.lib.conf import (
Config,
ConfigSection,
UnspecifiedConfigSection,
coerce_bool,
coerce_csv,
coerce_json_dict,
coerce_password_from_script,
coerce_str_lowercase,
coerce_string,
Config,
ConfigSection,
list_of_compiled_res,
UnspecifiedConfigSection,
validate_path,
)
from desktop.lib.i18n import force_unicode
Expand Down Expand Up @@ -106,7 +106,7 @@ def get_dn(fqdn=None):
else:
LOG.warning("allowed_hosts value to '*'. It is a security risk")
val.append('*')
except Exception as e:
except Exception:
LOG.warning("allowed_hosts value to '*'. It is a security risk")
val.append('*')
return val
Expand Down Expand Up @@ -2952,3 +2952,32 @@ def is_ofs_enabled():
def has_ofs_access(user):
from desktop.auth.backend import is_admin
return user.is_authenticated and user.is_active and (is_admin(user) or user.has_hue_permission(action="ofs_access", app="filebrowser"))


IMPORTER = ConfigSection(
key='importer',
help=_("""Configuration options for the importer."""),
members=dict(
IS_ENABLED=Config(
key='is_enabled',
help=_('Enable or disable the new importer functionality'),
type=coerce_bool,
default=False,
),
RESTRICT_LOCAL_FILE_EXTENSIONS=Config(
key='restrict_local_file_extensions',
default=None,
type=coerce_csv,
help=_(
'Security setting to specify local file extensions that are not allowed to be uploaded through the importer. '
'Provide a comma-separated list of extensions including the dot (e.g., ".exe, .zip, .rar, .tar, .gz").'
),
),
MAX_LOCAL_FILE_SIZE_UPLOAD_LIMIT=Config(
key="max_local_file_size_upload_limit",
default=157286400, # 150 MiB
type=int,
help=_('Maximum local file size (in bytes) that users can upload through the importer. The default is 157286400 bytes (150 MiB).'),
),
),
)
16 changes: 16 additions & 0 deletions desktop/core/src/desktop/lib/importer/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env python
# Licensed to Cloudera, Inc. under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. Cloudera, Inc. licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading