Skip to content

[ui-importer] Public API integration #4137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 54 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
cb8883d
[importer] Add new component and API endpoint with new directory stru…
Harshg999 Apr 1, 2025
e763d4c
[importer] Implement file upload API for CSV and Excel formats with v…
Harshg999 Apr 8, 2025
da13af9
Refactor importer API: remove unused import and delete obsolete templ…
Harshg999 Apr 8, 2025
25527a7
Refactors file format detection and metadata extraction
Harshg999 Apr 25, 2025
0b53a51
Add file metadata detection and update dependencies
Harshg999 Apr 25, 2025
355b7a4
Refactors file upload API for better separation of concerns
Harshg999 Apr 29, 2025
3ec9c7b
Refactor file metadata detection API and improve efficiency
Harshg999 Apr 29, 2025
5298263
Improves file metadata extraction and error handling
Harshg999 Apr 30, 2025
4749bc8
Improves file type detection with graceful magic lib fallback
Harshg999 Apr 30, 2025
92bb7d1
Adds file preview API for data import functionality
Harshg999 May 5, 2025
eb1e590
Merge branch 'master' of github.com:cloudera/hue into new-importer-wo…
ramprasadagarwal May 6, 2025
58a40d7
[ui-importer] Public API integration
ramprasadagarwal May 6, 2025
fb34811
[importer] Add new component and API endpoint with new directory stru…
Harshg999 Apr 1, 2025
bad25d1
[importer] Implement file upload API for CSV and Excel formats with v…
Harshg999 Apr 8, 2025
45cb6b7
Refactor importer API: remove unused import and delete obsolete templ…
Harshg999 Apr 8, 2025
5000db1
Refactors file format detection and metadata extraction
Harshg999 Apr 25, 2025
fd931d2
Add file metadata detection and update dependencies
Harshg999 Apr 25, 2025
8ddcea0
Refactors file upload API for better separation of concerns
Harshg999 Apr 29, 2025
a72b3b8
Refactor file metadata detection API and improve efficiency
Harshg999 Apr 29, 2025
30235e4
Improves file metadata extraction and error handling
Harshg999 Apr 30, 2025
d2552ad
Improves file type detection with graceful magic lib fallback
Harshg999 Apr 30, 2025
5ed73c2
Adds file preview API for data import functionality
Harshg999 May 5, 2025
649b789
Merge branch 'new-importer-working-dir' of github.com:cloudera/hue in…
ramprasadagarwal May 6, 2025
7b35fc8
[importer] Add new component and API endpoint with new directory stru…
Harshg999 Apr 1, 2025
f562dc9
[importer] Implement file upload API for CSV and Excel formats with v…
Harshg999 Apr 8, 2025
d7a3037
Refactor importer API: remove unused import and delete obsolete templ…
Harshg999 Apr 8, 2025
faf36c5
Refactors file format detection and metadata extraction
Harshg999 Apr 25, 2025
d2d3c81
Add file metadata detection and update dependencies
Harshg999 Apr 25, 2025
ad8ba89
Refactors file upload API for better separation of concerns
Harshg999 Apr 29, 2025
31a75f1
Refactor file metadata detection API and improve efficiency
Harshg999 Apr 29, 2025
938cee7
Improves file metadata extraction and error handling
Harshg999 Apr 30, 2025
6fceede
Improves file type detection with graceful magic lib fallback
Harshg999 Apr 30, 2025
04bcf00
Adds file preview API for data import functionality
Harshg999 May 5, 2025
f6438be
fix the api integration
ramprasadagarwal May 12, 2025
e316224
Merge branch 'new-importer-working-dir' of github.com:cloudera/hue in…
ramprasadagarwal May 12, 2025
22a4865
Merge branch 'master' of github.com:cloudera/hue into feat/importer-6
ramprasadagarwal Jun 6, 2025
8f843fb
revert extra changes
ramprasadagarwal Jun 6, 2025
cb8e35c
[importer] Refactor file format handling and add support for guessing…
ramprasadagarwal Jun 7, 2025
a37558e
[importer] Update API constants for file guessing and preview URLs
ramprasadagarwal Jun 10, 2025
68a2df2
Merge branch 'master' of github.com:cloudera/hue into feat/importer-6
ramprasadagarwal Jun 10, 2025
e1e2aa8
[importer] Enhance file format handling and update tests for EXCEL su…
ramprasadagarwal Jun 12, 2025
1ab4200
[test] Update test description for non-EXCEL file type in SourceConfi…
ramprasadagarwal Jun 12, 2025
55ba6fa
[test] Enhance tests for ImporterFilePreview and SourceConfiguration …
ramprasadagarwal Jun 24, 2025
324cd43
Merge branch 'master' of github.com:cloudera/hue into feat/importer-6
ramprasadagarwal Jun 24, 2025
457ec6d
[importer] fix the getDefaultTableName function
ramprasadagarwal Jun 25, 2025
a6b28be
[test] Refactor ImporterFilePreview tests to use act for rendering an…
ramprasadagarwal Jun 25, 2025
2725311
Merge branch 'master' of github.com:cloudera/hue into feat/importer-6
ramprasadagarwal Jun 25, 2025
38e1083
lint fix
ramprasadagarwal Jun 25, 2025
bfacaee
remove hardcoded defaultDialect
ramprasadagarwal Jun 26, 2025
5cf00fb
fix the tests mocked url
ramprasadagarwal Jun 26, 2025
e897946
Merge branch 'master' of github.com:cloudera/hue into feat/importer-6
ramprasadagarwal Jun 26, 2025
6884980
Merge branch 'master' of github.com:cloudera/hue into feat/importer-6
ramprasadagarwal Jun 27, 2025
44ae88d
fix test
ramprasadagarwal Jun 27, 2025
14f9248
Merge branch 'master' of github.com:cloudera/hue into feat/importer-6
ramprasadagarwal Jun 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[importer] Implement file upload API for CSV and Excel formats with v…
…alidation
  • Loading branch information
Harshg999 committed Apr 25, 2025
commit e763d4ccf6474d72e7b9e89d7f9bf601f7853f9e
1 change: 1 addition & 0 deletions desktop/core/base_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ Mako==1.2.3
Markdown==3.7
openpyxl==3.0.9
phoenixdb==1.2.1
polars[calamine]==1.8.2 # Python >= 3.8
prompt-toolkit==3.0.39
protobuf==3.20.3
pyarrow==17.0.0
Expand Down
5 changes: 0 additions & 5 deletions desktop/core/src/desktop/api_public.py
Original file line number Diff line number Diff line change
Expand Up @@ -445,11 +445,6 @@ def taskserver_get_available_space_for_upload(request):

# Importer

@api_view(["GET"])
def render_new_importer(request):
django_request = get_django_request(request)
return importer_api.render_new_importer(django_request)


@api_view(["POST"])
def guess_format(request):
Expand Down
3 changes: 2 additions & 1 deletion desktop/core/src/desktop/api_public_urls_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

from desktop import api_public
from desktop.lib.botserver import api as botserver_api
from desktop.lib.importer import api as importer_api

# "New" query API (i.e. connector based, lean arguments).
# e.g. https://demo.gethue.com/api/query/execute/hive
Expand Down Expand Up @@ -159,7 +160,7 @@
]

urlpatterns += [
re_path(r'^importer/new/?$', api_public.render_new_importer, name='importer_render_new_component'),
re_path(r'^importer/upload/file', importer_api.upload_local_file, name='importer_upload_local_file'),
]

urlpatterns += [
Expand Down
250 changes: 247 additions & 3 deletions desktop/core/src/desktop/lib/importer/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,256 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import csv
import uuid
import logging
import tempfile
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple, Union

from desktop.lib.django_util import render
import polars as pl
from rest_framework import status
from rest_framework.decorators import api_view, parser_classes
from rest_framework.parsers import JSONParser, MultiPartParser
from rest_framework.request import Request
from rest_framework.response import Response

from desktop.lib.importer.serializers import LocalFileUploadSerializer

LOG = logging.getLogger()


def render_new_importer(request):
return render('new_importer.mako', request, None)
# @dataclass
# class FileFormat:
# """Data class representing file format configuration"""

# type: str
# has_header: bool
# quote_char: str = '"'
# record_separator: str = '\\n'
# field_separator: str = ','


# class FormatDetector:
# """Service class for detecting file formats"""

# SAMPLE_SIZE = 16384 # 16KB for sampling
# MIN_LINES = 5

# def __init__(self, content: bytes, filename: str):
# self.content = content
# self.filename = filename
# self._sample = self._get_sample()

# def _get_sample(self) -> str:
# """Get a sample of file content for format detection"""
# try:
# return self.content[: self.SAMPLE_SIZE].decode('utf-8')
# except UnicodeDecodeError:
# return self.content[: self.SAMPLE_SIZE].decode('latin-1')

# def detect(self) -> FileFormat:
# """Detect file format based on content and filename"""
# if self._is_excel_file():
# return self._get_excel_format()

# return self._detect_delimited_format()

# def _is_excel_file(self) -> bool:
# """Check if file is Excel based on extension"""
# return self.filename.lower().endswith(('.xlsx', '.xls'))

# def _get_excel_format(self) -> FileFormat:
# """Return Excel file format configuration"""
# return FileFormat(type='excel', has_header=True)

# def _detect_delimited_format(self) -> FileFormat:
# """Detect format for delimited files like CSV"""
# dialect = self._sniff_csv_dialect()

# return FileFormat(
# type='csv',
# has_header=self._detect_header(),
# quote_char=dialect.quotechar,
# record_separator='\\n', # Using standard newline
# field_separator=dialect.delimiter,
# )

# def _sniff_csv_dialect(self) -> csv.Dialect:
# """Detect CSV dialect using csv.Sniffer"""
# try:
# return csv.Sniffer().sniff(self._sample)
# except csv.Error:
# # Fallback to standard CSV format
# return csv.excel

# def _detect_header(self) -> bool:
# """Detect if file has headers"""
# try:
# return csv.Sniffer().has_header(self._sample)
# except csv.Error:
# # Default to True if detection fails
# return True


# @api_view(['POST'])
# @parser_classes([JSONParser, MultiPartParser])
# def detect_format(request: Request) -> Response:
# """
# Detects and returns the format configuration for input files/data sources.

# Args:
# request: REST framework Request object containing either:
# - fileFormat: Dict with file details for HDFS files
# - file: Uploaded file for local files

# Returns:
# Response with format configuration:
# - type: Detected format type (csv, excel)
# - hasHeader: Boolean indicating header presence
# - fieldSeparator: Field delimiter for CSV
# - recordSeparator: Record separator
# - quoteChar: Quote character
# - status: Operation status code

# Raises:
# 400: Bad Request if file format/content cannot be processed
# """
# try:
# if 'fileFormat' in request.data:
# return _handle_hdfs_file(request)
# elif 'file' in request.FILES:
# return _handle_uploaded_file(request.FILES['file'])
# else:
# return Response({'error': 'No file or file format provided'}, status=status.HTTP_400_BAD_REQUEST)
# except Exception as e:
# return Response({'error': str(e)}, status=status.HTTP_400_BAD_REQUEST)


# def _handle_uploaded_file(file) -> Response:
# """Handle format detection for uploaded files"""
# detector = FormatDetector(content=file.read(), filename=file.name)
# file_format = detector.detect()

# return Response(
# {
# 'type': file_format.type,
# 'hasHeader': file_format.has_header,
# 'quoteChar': file_format.quote_char,
# 'recordSeparator': file_format.record_separator,
# 'fieldSeparator': file_format.field_separator,
# 'status': 0,
# }
# )


# def _handle_hdfs_file(request: Request) -> Response:
# """Handle format detection for HDFS files"""
# file_format = request.data.get('fileFormat', {})
# path = file_format.get('path')

# if not path:
# return Response({'error': 'No path provided'}, status=status.HTTP_400_BAD_REQUEST)

# if not request.fs.isfile(path):
# return Response({'error': f'Path {path} is not a file'}, status=status.HTTP_400_BAD_REQUEST)

# with request.fs.open(path) as stream:
# detector = FormatDetector(content=stream.read(FormatDetector.SAMPLE_SIZE), filename=path)
# file_format = detector.detect()

# return Response(
# {
# 'type': file_format.type,
# 'hasHeader': file_format.has_header,
# 'quoteChar': file_format.quote_char,
# 'recordSeparator': file_format.record_separator,
# 'fieldSeparator': file_format.field_separator,
# 'status': 0,
# }
# )


@api_view(['POST'])
@parser_classes([MultiPartParser])
def upload_local_file(request: Request) -> Response:
"""
Upload and process a CSV or Excel file, converting it to CSV format if needed.

Returns the stored file path and metadata.
"""
# Validate the request data using the serializer
serializer = LocalFileUploadSerializer(data=request.data)

if not serializer.is_valid():
return Response(serializer.errors, status=status.HTTP_400_BAD_REQUEST)

try:
upload_file = serializer.validated_data['file']
file_extension = Path(upload_file.name).suffix.lower()[1:]

# Generate a unique filename
username = request.user.username
safe_original_name = re.sub(r'[^0-9a-zA-Z]+', '_', upload_file.name)
unique_id = uuid.uuid4().hex[:8]

filename = f"{username}_{unique_id}_{safe_original_name}"

# Process the file based on its type
result = process_uploaded_file(upload_file, filename, file_extension)

return Response(result, status=status.HTTP_201_CREATED)

except Exception as e:
return Response({"error": f"Error processing file: {str(e)}"}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)


def process_uploaded_file(upload_file, filename: str, file_extension: str) -> Dict[str, Any]:
"""
Process the uploaded file and convert to CSV if needed.

Args:
upload_file: The uploaded file object
filename: The base filename to use
file_extension: The file extension (csv, xlsx, xls)

Returns:
Dict containing file metadata
"""
file_type = 'csv' if file_extension == 'csv' else 'excel'

# Create a temporary file with our generated filename
temp_dir = tempfile.gettempdir()
output_path = os.path.join(temp_dir, f"{filename}.csv")

try:
if file_extension == 'csv':
df = pl.read_csv(upload_file.read())
else:
# For Excel files, use Polars and its default Calamine engine to read and convert to CSV.
# TODO: Currently reads the first sheet. Check if we need to support multiple sheets or specific sheets as input.
df = pl.read_excel(upload_file.read())

df.write_csv(output_path)

# Return metadata about the processed file
file_stats = os.stat(output_path)

# TODO: Verify response fields
return {
'filename': os.path.basename(output_path),
'file_path': output_path,
'row_count': len(df),
'column_count': len(df.columns),
'file_size_bytes': file_stats.st_size,
# 'file_type': file_type,
}

except Exception as e:
# Clean up the file if there was an error
if os.path.exists(output_path):
os.remove(output_path)
raise e
43 changes: 43 additions & 0 deletions desktop/core/src/desktop/lib/importer/serializers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/usr/bin/env python
# Licensed to Cloudera, Inc. under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. Cloudera, Inc. licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from rest_framework import serializers


class LocalFileUploadSerializer(serializers.Serializer):
"""Serializer for file upload validation.

This serializer validates that the uploaded file is present and has an
acceptable file format and size.

Attributes:
file: File field that must be included in the request
"""

file = serializers.FileField(required=True, help_text="CSV or Excel file to upload and process")

def validate_file(self, value):
# Add file format validation
extension = value.name.split('.')[-1].lower()
if extension not in ['csv', 'xlsx', 'xls']:
raise serializers.ValidationError("Unsupported file format. Please upload a CSV or Excel file.")

# TODO: Check upper limit for file size
# Add file size validation (e.g., limit to 150 MiB)
if value.size > 150 * 1024 * 1024: # 150 MiB in bytes
raise serializers.ValidationError("File too large. Maximum file size is 150 MiB.")

return value
Loading