Skip to content

refactor: pdf extractor #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

refactor: pdf extractor #18

wants to merge 18 commits into from

Conversation

a-klos
Copy link
Member

@a-klos a-klos commented Jun 12, 2025

This pull request introduces several updates across multiple libraries and modules, focusing on dependency upgrades, feature enhancements, and code refactoring. The most significant changes include upgrading dependencies, adding new sources and tools for text and table extraction, and improving PDF extraction logic with enhanced functionality.

a-klos added 15 commits June 10, 2025 07:58
…imports

- Updated langfuse version in pyproject.toml and poetry.lock files.
- Modified import statements in langfuse_ragas_evaluator.py to reflect new package structure.
- Adjusted langfuse_manager.py to use labels instead of is_active for prompt management.
- Refactored langfuse_traced_chain.py to utilize the new CallbackHandler import.
- Enhanced traced_chain.py to initialize langfuse client and update tracing logic.
- Introduced test suite for enhanced PDF extraction capabilities in `test_enhanced_pdfs.py`.
- Created new test files for various PDF types including text-based, mixed content, and scanned documents.
- Implemented detailed tests for PDFExtractor's classification, extraction, and linking functionalities in `test_pdf_extractorv2_new.py`.
- Added quick functionality verification tests in `test_pdf_functionality.py` to ensure correct operation with real PDF files.
- Established mock classes and fixtures to facilitate unit testing of PDF extraction methods.
- Added a new source for PyTorch and its related packages with CPU support in pyproject.toml.
- Included additional dependencies: camelot-py, tabula, and easyocr.
- Changed the import statement for PDFExtractor to use the new version (pdf_extractorv2) in dependency_container.py.
…imports

- Updated langfuse dependency in pyproject.toml and poetry.lock files to version 3.0.0.
- Changed import statement for DatasetClient in langfuse_ragas_evaluator.py to reflect new package structure.
- Modified langfuse_manager.py to use labels instead of is_active for prompt management.
- Updated tracing logic in langfuse_traced_chain.py to utilize the new CallbackHandler and langfuse client.
- Refactored traced_chain.py to integrate langfuse client for better tracing capabilities.
…prehensive test suite for PDFExtractor class

- Deleted outdated test files: test_pdf_extractorv2.py, test_pdf_extractorv2_new.py, and test_pdf_functionality.py.
- Introduced a new comprehensive test suite for the PDFExtractor class, covering various functionalities including content extraction from different PDF types, error handling, and performance testing.
- Added mock dependencies and fixtures to streamline testing processes.
- Implemented tests for text extraction, table extraction, language detection, and related ID mapping.
- Ensured compatibility with multiple PDF formats and validated metadata completeness in extracted content.
…t.py, ensuring comprehensive coverage and maintaining functionality. Removed old test file to streamline the testing structure.
…d consistency; enhance test suite with additional logging and assertions
@a-klos a-klos marked this pull request as ready for review June 13, 2025 06:22
@a-klos a-klos requested review from MirUlr June 13, 2025 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant