Skip to content

[importer] Introduce new Importer module with separate Configs, API Endpoints, and Dependencies #4089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
May 30, 2025

Conversation

Harshg999
Copy link
Collaborator

@Harshg999 Harshg999 commented Apr 1, 2025

This pull request introduces a new "importer" feature, along with its configuration, API endpoints, and dependencies. The changes include adding the necessary configuration options, implementing the importer functionality, and updating related files to integrate the feature.

Importer Feature Implementation:

  • Configuration Options:

    • Added a new importer section in configuration files (desktop/conf.dist/hue.ini, desktop/conf/pseudo-distributed.ini.tmpl) to enable the importer, set file size limits, and restrict file extensions for uploads. [1] [2]
    • Introduced the IMPORTER configuration section in desktop/core/src/desktop/conf.py to manage importer settings programmatically.
  • API Endpoints:

    • Added several new endpoints in desktop/core/src/desktop/api_public_urls_v1.py to support importer operations, including file upload, metadata guessing, header detection, file preview, and SQL type mapping.
    • Implemented the corresponding API logic in desktop/core/src/desktop/lib/importer/api.py for handling these operations.

Dependency Updates:

  • Requirements:
    • Added polars[calamine]==1.8.2 and python-magic==0.4.27 to desktop/core/base_requirements.txt and desktop/core/generate_requirements.py to support file processing and type detection. [1] [2] [3]

Integration and Refactoring:

  • API Integration:

    • Updated desktop/core/src/desktop/api2.py to include importer settings in the application's configuration response.
    • Imported the new importer API in desktop/core/src/desktop/api_public_urls_v1.py.
  • Codebase Organization:

    • Added __init__.py to desktop/core/src/desktop/lib/importer to initialize the importer module.

@Harshg999 Harshg999 self-assigned this Apr 1, 2025
Copy link

github-actions bot commented Apr 1, 2025

✅ Test files were modified. Ensure that the tests cover all relevant changes. ✅

Copy link

github-actions bot commented Apr 1, 2025

@Harshg999 Harshg999 marked this pull request as draft April 8, 2025 08:21
@Harshg999 Harshg999 force-pushed the new-importer-working-dir branch from 85a9c35 to 49f1b25 Compare April 8, 2025 08:23
@Harshg999 Harshg999 changed the title [importer] Add new component and API endpoint with new directory structure [importer] Add new directory structure and public APIs Apr 8, 2025
@Harshg999 Harshg999 marked this pull request as ready for review April 8, 2025 10:05
@Harshg999 Harshg999 force-pushed the new-importer-working-dir branch from 0923520 to 25527a7 Compare April 25, 2025 13:15
@Harshg999 Harshg999 force-pushed the new-importer-working-dir branch 2 times, most recently from 5ed73c2 to 04bcf00 Compare May 12, 2025 11:48
@Harshg999 Harshg999 force-pushed the new-importer-working-dir branch from 7771b36 to f0d3ecb Compare May 19, 2025 15:49
@Harshg999 Harshg999 requested a review from Copilot May 21, 2025 15:55
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new importer module with REST endpoints for file upload, metadata guessing, and previewing, while disabling legacy indexer format logic and updating dependencies.

  • Comments out deprecated table, query, rdbms, stream, and connector handling in api3.py.
  • Introduces LocalFileUploadSerializer, GuessFileMetadataSerializer, and PreviewFileSerializer plus corresponding endpoints and URL routes.
  • Pins new requirements polars[calamine] and python-magic.

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
desktop/libs/indexer/src/indexer/api3.py Commented out large blocks of legacy format-handling logic
desktop/core/src/desktop/lib/importer/serializers.py Added serializers for upload, metadata guess, and preview
desktop/core/src/desktop/lib/importer/api.py Added upload_file, guess_file_metadata, and preview_file APIs
desktop/core/src/desktop/lib/importer/init.py Initialized new importer package
desktop/core/src/desktop/api_public_urls_v1.py Registered new importer API routes
desktop/core/generate_requirements.py Added polars[calamine] and python-magic to requirements
Files not reviewed (1)
  • desktop/core/base_requirements.txt: Language not supported
Comments suppressed due to low confidence (2)

desktop/core/src/desktop/lib/importer/api.py:77

  • [nitpick] The variable name upload_file shadows the function name and may be confusing. Consider renaming it to something like file_obj or uploaded_file.
upload_file = serializer.validated_data['file']

desktop/core/src/desktop/lib/importer/api.py:51

  • New endpoints (upload_file, guess_file_metadata, preview_file) currently lack automated unit or integration tests. Adding tests will help catch regressions and validate behavior.
@api_view(['POST'])

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new importer package with serializers, API endpoints, and URL routes to support local file upload, metadata guessing, and content preview. Key changes include:

  • New serializers (LocalFileUploadSerializer, GuessFileMetadataSerializer, PreviewFileSerializer) for validating file operations.
  • API views (upload_file, guess_file_metadata, preview_file) with error handling and routing.
  • Updated generate_requirements.py to include polars[calamine] and python-magic.

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
desktop/core/src/desktop/lib/importer/serializers.py Adds serializers for upload, metadata guess, preview
desktop/core/src/desktop/lib/importer/api.py Implements API endpoints and common error handler
desktop/core/src/desktop/lib/importer/init.py Initializes importer package
desktop/core/src/desktop/api_public_urls_v1.py Registers new importer routes
desktop/core/generate_requirements.py Adds new dependencies (polars[calamine], python-magic)
Files not reviewed (1)
  • desktop/core/base_requirements.txt: Language not supported
Comments suppressed due to low confidence (2)

desktop/core/src/desktop/lib/importer/serializers.py:20

  • There are no automated tests covering these new serializers. Consider adding unit tests for file-size validation, metadata guessing, and preview logic to ensure behavior remains correct.
class LocalFileUploadSerializer(serializers.Serializer):

desktop/core/src/desktop/lib/importer/api.py:1

  • [nitpick] Consider adding a module-level docstring below the license header to describe the purpose of this API module.
#!/usr/bin/env python

@Harshg999 Harshg999 changed the title [importer] Add new directory structure and public APIs [importer] Introduce new Importer module with Local File Upload, Metadata Detection, and Preview APIs May 21, 2025
Copy link
Contributor

@JohanAhlen JohanAhlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Harshg999
Copy link
Collaborator Author

Unit tests are work in progress

@Harshg999 Harshg999 requested a review from athithyaaselvam May 22, 2025 06:40
Copy link
Collaborator

@bjornalm bjornalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, see questions and comments. Can you also add unit tests to this PR?

Harshg999 added 17 commits May 29, 2025 01:03
Enhances file type detection reliability by:
- Adding proper error handling for python-magic import
- Removing dependency on file extensions for type detection
- Simplifying delimiter-based file format detection logic
- Improving error messaging when magic lib is unavailable

Makes file type detection more robust and maintainable by focusing on content analysis rather than extensions.
Implements new preview endpoint to support file imports with:
- Excel file preview with sheet selection
- Delimited file preview (CSV, TSV) with configurable separators
- SQL type mapping for multiple dialects (Hive, Impala, Trino, Phoenix, SparkSQL)
- Automatic header detection
- Error handling improvements for file processing

Enhances the importer API with robust data preview capabilities before import
Ensures has_header parameter is properly coerced to boolean when explicitly provided for both Excel and delimited file previews.

Previously, boolean coercion was only done when auto-detecting headers through csv.Sniffer.
This change provides consistent boolean handling across all file preview scenarios.
…lizer and updating operations for improved validation and error handling
…zer for improved validation and error handling, and enhance operations for better file type detection and previewing.
 - Introduces a new configuration section for the data file importer, allowing admins to enable or disable the importer, restrict certain file extensions, and set a maximum upload size.
 - Updates the API and serializer logic to enforce these settings, improving security and flexibility for file uploads.
Uses a lightweight stdlib-based approach to list sheet names from .xlsx files by reading workbook metadata directly, reducing memory usage and avoiding unnecessary data loading. Falls back to the previous method for non-standard formats, improving robustness and efficiency when handling Excel files.
- Introduces a new endpoint and supporting logic to determine if uploaded files (including Excel and delimited formats) contain a header row.
- Refactors file preview APIs to require explicit header row indication, improving reliability and user control.
- Enhances error handling and cleans up temporary files on failure to ensure robustness.
- Introduces a new endpoint to retrieve mappings from Polars data types to SQL types for various SQL dialects, improving support for type-aware file imports and downstream table creation.
- Refactors type mapping logic for reusability and maintainability, and adds serializer validation for dialect selection.
@Harshg999 Harshg999 force-pushed the new-importer-working-dir branch from 583a86d to 708e75a Compare May 28, 2025 19:42
@Harshg999 Harshg999 requested a review from Copilot May 29, 2025 20:51
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Introduces a new Importer module to support local file uploads, metadata/header detection, previews, and SQL type mappings.

  • Adds importer configuration (IMPORTER) to hue.ini, pseudo-distributed template, and conf.py.
  • Implements serializers, unit tests, and API endpoints (upload, guess_metadata, guess_header, preview, sql_type_mapping).
  • Updates requirements to include polars[calamine] and python-magic, and integrates the importer into URL routing and the get_config response.

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
desktop/core/src/desktop/lib/importer/serializers.py New serializers for file upload, metadata guessing, preview, header guessing, and SQL type mapping
desktop/core/src/desktop/lib/importer/serializers_tests.py Unit tests covering valid/invalid cases for all importer serializers
desktop/core/src/desktop/lib/importer/api.py API views for importer operations with error handling
desktop/core/src/desktop/lib/importer/init.py Initializer for the importer module
desktop/core/src/desktop/conf.py Programmatic IMPORTER config section definition
desktop/core/src/desktop/api_public_urls_v1.py Routes for new importer endpoints
desktop/core/src/desktop/api2.py Includes importer settings in the application config response
desktop/core/generate_requirements.py Added polars[calamine], python-magic, and moved setuptools
desktop/core/base_requirements.txt Added new dependencies for file processing and type detection
desktop/conf/pseudo-distributed.ini.tmpl Documentation for importer config options
desktop/conf.dist/hue.ini Documentation for importer config options

Copy link
Contributor

@JohanAhlen JohanAhlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean! Great work!

Copy link
Collaborator

@bjornalm bjornalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work

@Harshg999 Harshg999 changed the title [importer] Introduce new Importer module with Local File Upload, Metadata Detection, and Preview APIs [importer] Introduce new Importer module with separate Configs, API Endpoints, and Dependencies May 30, 2025
@Harshg999 Harshg999 merged commit a615096 into master May 30, 2025
8 checks passed
@Harshg999 Harshg999 deleted the new-importer-working-dir branch May 30, 2025 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants