Prepares scanned documents and images for AI processing. This tool automatically detects quality issues (blurriness, skew, poor contrast, noise) in PDFs and images, then identifies which preprocessing steps are needed to improve accuracy before feeding documents to AI systems.
Problem it solves: Scanned documents and images often have quality issues that hurt AI accuracy. This tool detects those issues automatically, so you know exactly what corrections to apply before processing documents with AI/machine learning systems.
Intelligent image preprocessing detection system for RAG applications. Automatically analyzes documents (PDFs, images) and detects required preprocessing steps before vector database ingestion.
- Multi-Stage Pipeline Architecture: Text detection gate routes documents to specialized processing paths
- Hybrid IQA: Classical CV + ML for image quality assessment (noise, blur, skew, contrast, orientation)
- Document Element Detection: YOLOv8-based detection of tables, images, handwriting, formulas
- Quality Assessment per Element: IQA on embedded images within text documents
- Structured JSON Output: COCO-aligned metadata with confidence scores and transform history
- Production-Ready: Optimized for 50-150ms latency, 6+ pages/sec throughput per GPU worker
PDF/Image Input
↓
[Ingestion & Standardization]
↓
[Text Detection Gate]
↓ ↓
[NO TEXT] [TEXT DETECTED]
↓ ↓
Classical CV YOLOv8 Layout Detection
+ ML (IQA) + Hybrid IQA on Images
↓ ↓
[Corrections & JSON Output]
See ARCHITECTURE_SUMMARY.md for detailed architecture and PROJECT_PLAN.md for complete implementation plan.
Phase 0: Foundation & Scaffolding (IN PROGRESS)
- Project structure with Poetry (Python 3.12)
- JSON schema with Pydantic v2 validation
- Structured logging (structlog + rich)
- Pre-commit hooks (Black, Ruff, MyPy, Bandit)
- CI/CD pipeline (GitHub Actions)
- Evaluation framework
- Ground-truth test set (500 pages)
Next: Phase 1 - MVP with Classical Methods (Week 4-7)
- Python 3.12+
- Poetry 1.7+
- (Optional) GPU with CUDA for ML models (Phase 2+)
# Clone repository
git clone https://github.com/williaby/image-preprocessing-detector.git
cd image-preprocessing-detector
# Install with Poetry
poetry install
# Install with dev dependencies
poetry install --with dev
# Install with ML dependencies (Phase 2+)
poetry install --with dev,mlfrom image_preprocessing_detector import DocumentMetadata
from image_preprocessing_detector.utils import setup_logging, get_logger
# Setup logging
setup_logging(level="INFO", json_logs=False)
logger = get_logger(__name__)
# Process document (Phase 1+ implementation)
# from image_preprocessing_detector.pipeline import process_document
# metadata = process_document("document.pdf")
# metadata.to_json_file("output.json")
# Validate JSON schema
metadata = DocumentMetadata.from_json_file("output.json")
logger.info("Processed document", pages=metadata.num_pages)# Process single file
poetry run imgprep process input.pdf --output result.json
# Batch processing
poetry run imgprep batch input_dir/ --output-dir results/
# With quality threshold tuning
poetry run imgprep process input.pdf --blur-threshold 0.85 --skew-threshold 0.90# Install all dependencies including dev tools
poetry install --with dev
# Setup pre-commit hooks
poetry run pre-commit install
# Run tests
poetry run pytest -v
# Run with coverage
poetry run pytest --cov=src/image_preprocessing_detector --cov-report=html
# Lint code
poetry run black src tests
poetry run ruff check --fix src tests
poetry run mypy srcimage_detection/
├── src/
│ └── image_preprocessing_detector/
│ ├── __init__.py
│ ├── schema.py # JSON schema (Pydantic models)
│ ├── ingestion/ # PDF/image loading (Phase 1)
│ ├── detection/ # Detection modules (Phase 1-3)
│ ├── correction/ # Image corrections (Phase 1)
│ ├── output/ # JSON generation (Phase 1)
│ └── utils/ # Logging, telemetry
├── tests/
│ ├── unit/ # Unit tests
│ └── integration/ # Integration tests
├── scripts/ # Training & evaluation scripts (Phase 2-3)
├── configs/ # Model configurations
├── data/ # Datasets (managed by DVC)
├── models/ # Trained models
├── docs/ # Documentation
├── pyproject.toml # Dependencies & tool config
├── README.md # This file
├── PROJECT_PLAN.md # Complete implementation plan
├── ARCHITECTURE_SUMMARY.md # Architecture quick reference
└── DECISION_MATRIX.md # Critical decisions tracking
# Run all tests
poetry run pytest -v
# Run specific test categories
poetry run pytest -v -m unit # Unit tests only
poetry run pytest -v -m integration # Integration tests only
poetry run pytest -v -m "not slow" # Exclude slow tests
# Run with coverage requirements
poetry run pytest --cov=src --cov-fail-under=80
# Run tests in parallel
poetry run pytest -n autoFound a bug? Please report it via GitHub Issues:
- Check existing issues: https://github.com/williaby/image-preprocessing-detector/issues
- Create new issue: https://github.com/williaby/image-preprocessing-detector/issues/new
- Include:
- Python version and OS
- Steps to reproduce
- Expected vs actual behavior
- Error messages and logs
Have an idea? We welcome enhancement proposals via GitHub Issues. Please describe:
- Use case and motivation
- Proposed solution (if any)
- Alternatives considered
Please do not report security vulnerabilities through public issues.
See SECURITY.md for responsible disclosure process.
This project uses Semantic Versioning:
- MAJOR version: Incompatible API changes
- MINOR version: Backwards-compatible functionality additions
- PATCH version: Backwards-compatible bug fixes
Current version: 0.1.0 (pre-release, API may change)
See CHANGELOG.md for release history.
- PROJECT_PLAN.md: Complete 50+ page implementation plan with phased roadmap
- ARCHITECTURE_SUMMARY.md: Quick reference for architecture and design decisions
- DECISION_MATRIX.md: Critical decisions tracking and stakeholder requirements
- ARCHITECTURE_CORRECTION.md: Hybrid IQA approach for embedded images
- docs/WTD-Runbook.md: What The Diff integration guide for automated PR summaries
- docs/api-reference.md: API and CLI reference documentation
- SECURITY.md: Security policy and vulnerability reporting
- PDF ingestion and text detection gate
- Classical IQA detectors (skew, blur, contrast)
- Correction pipeline with guardrails
- JSON output generation
- IQA dataset generation (50k synthetic + real images)
- Train MobileNetV3/EfficientNet multi-label classifier
- ONNX optimization for CPU inference
- Integration with classical methods
- Document element dataset (PubLayNet + custom)
- Train YOLOv8n/s for layout detection
- Active learning for rare classes
- INT8 quantization for production
- FastAPI service with Docker
- Performance optimization (batching, quantization)
- Monitoring and telemetry
- Comprehensive testing (80%+ coverage)
- Drift detection and alerting
- Active learning pipeline
- Quarterly retraining and recalibration
| Metric | Target | Notes |
|---|---|---|
| IQA mAP | > 0.88 | Multi-label classification |
| Layout [email protected] | > 0.82 | Object detection |
| JSON Accuracy | > 0.85 | End-to-end pipeline |
| Latency (GPU) | < 150ms/page | With T4 GPU |
| Throughput | > 6 pages/sec | Per GPU worker |
| Test Coverage | > 80% | Unit + integration |
See CONTRIBUTING.md for comprehensive contribution guidelines, development workflow, and code quality standards.
- Formatting: Black (88 chars)
- Linting: Ruff (comprehensive rules)
- Type Checking: MyPy (strict mode)
- Testing: Pytest with 80%+ coverage
- Security: Bandit + dependency scanning
- Commits: Conventional Commits, GPG-signed
All commits must pass:
- Black formatting
- Ruff linting
- MyPy type checking (src/ only)
- Bandit security scanning
- YAML/Markdown linting
MIT License - see LICENSE for details.
@software{image_preprocessing_detector,
title = {Image Preprocessing Detector for RAG Applications},
author = {Byron Williams},
year = {2025},
version = {0.1.0},
url = {https://github.com/username/image-preprocessing-detector}
}Architecture designed with multi-model consensus analysis:
- Gemini 2.5 Pro: Pipeline design and phased roadmap
- GPT-5: Risk assessment and optimization strategies
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
Status: Phase 0 (Foundation) - Week 2-3 of 24-week development timeline