refactor: pdf extractor #18

a-klos · 2025-06-12T14:00:23Z

This pull request introduces several updates across multiple libraries and modules, focusing on dependency upgrades, feature enhancements, and code refactoring. The most significant changes include upgrading dependencies, adding new sources and tools for text and table extraction, and improving PDF extraction logic with enhanced functionality.

…imports - Updated langfuse version in pyproject.toml and poetry.lock files. - Modified import statements in langfuse_ragas_evaluator.py to reflect new package structure. - Adjusted langfuse_manager.py to use labels instead of is_active for prompt management. - Refactored langfuse_traced_chain.py to utilize the new CallbackHandler import. - Enhanced traced_chain.py to initialize langfuse client and update tracing logic.

- Introduced test suite for enhanced PDF extraction capabilities in `test_enhanced_pdfs.py`. - Created new test files for various PDF types including text-based, mixed content, and scanned documents. - Implemented detailed tests for PDFExtractor's classification, extraction, and linking functionalities in `test_pdf_extractorv2_new.py`. - Added quick functionality verification tests in `test_pdf_functionality.py` to ensure correct operation with real PDF files. - Established mock classes and fixtures to facilitate unit testing of PDF extraction methods.

- Added a new source for PyTorch and its related packages with CPU support in pyproject.toml. - Included additional dependencies: camelot-py, tabula, and easyocr. - Changed the import statement for PDFExtractor to use the new version (pdf_extractorv2) in dependency_container.py.

…imports - Updated langfuse dependency in pyproject.toml and poetry.lock files to version 3.0.0. - Changed import statement for DatasetClient in langfuse_ragas_evaluator.py to reflect new package structure. - Modified langfuse_manager.py to use labels instead of is_active for prompt management. - Updated tracing logic in langfuse_traced_chain.py to utilize the new CallbackHandler and langfuse client. - Refactored traced_chain.py to integrate langfuse client for better tracing capabilities.

…prehensive test suite for PDFExtractor class - Deleted outdated test files: test_pdf_extractorv2.py, test_pdf_extractorv2_new.py, and test_pdf_functionality.py. - Introduced a new comprehensive test suite for the PDFExtractor class, covering various functionalities including content extraction from different PDF types, error handling, and performance testing. - Added mock dependencies and fixtures to streamline testing processes. - Implemented tests for text extraction, table extraction, language detection, and related ID mapping. - Ensured compatibility with multiple PDF formats and validated metadata completeness in extracted content.

…t.py, ensuring comprehensive coverage and maintaining functionality. Removed old test file to streamline the testing structure.

…r improved readability and maintainability

…d consistency; enhance test suite with additional logging and assertions

…f_extractor_test.py

a-klos added 15 commits June 10, 2025 07:58

Merge branch 'main' into fix/orophaned-threads-issue

e6042ec

feat: add pytest-asyncio support for asynchronous testing

1a9d814

refactor: Moved tests from test_pdf_extractor.py to pdf_extractor_tes…

ef51597

…t.py, ensuring comprehensive coverage and maintaining functionality. Removed old test file to streamline the testing structure.

refactor: update flake8 exclusions and clean up PDFExtractor tests fo…

5442463

…r improved readability and maintainability

chore: add pdf files using git lfs

8da09dd

refactor: update parameter names in PDFExtractor class for clarity an…

4b08f1b

…d consistency; enhance test suite with additional logging and assertions

Merge branch 'main' into refactor/pdf-extractor

6b55ef7

chore: remove PyTorch and related dependencies from pyproject.toml

1d9d71d

refactor: remove unused text-based PDF document from test data

8a19347

chore: add sample PDF document for testing in extractor-api-lib

7cbe521

a-klos marked this pull request as ready for review June 13, 2025 06:22

a-klos added 3 commits June 13, 2025 08:25

refactor: remove unused test methods and main execution block from pd…

5b63ab2

…f_extractor_test.py

chore: add pytest-asyncio as a development dependency

d966a0f

Remove unused dependencies: tabula and easyocr from pyproject.toml

7d3fa64

a-klos requested review from MirUlr June 13, 2025 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: pdf extractor #18

refactor: pdf extractor #18

Uh oh!

a-klos commented Jun 12, 2025

Uh oh!

Uh oh!

refactor: pdf extractor #18

Are you sure you want to change the base?

refactor: pdf extractor #18

Uh oh!

Conversation

a-klos commented Jun 12, 2025

Uh oh!

Uh oh!