Image Preprocessing Detector

Security & Quality

CI/CD Status

Project Info

What Does This Do?

Prepares scanned documents and images for AI processing. This tool automatically detects quality issues (blurriness, skew, poor contrast, noise) in PDFs and images, then identifies which preprocessing steps are needed to improve accuracy before feeding documents to AI systems.

Problem it solves: Scanned documents and images often have quality issues that hurt AI accuracy. This tool detects those issues automatically, so you know exactly what corrections to apply before processing documents with AI/machine learning systems.

Intelligent image preprocessing detection system for RAG applications. Automatically analyzes documents (PDFs, images) and detects required preprocessing steps before vector database ingestion.

Features

Multi-Stage Pipeline Architecture: Text detection gate routes documents to specialized processing paths
Hybrid IQA: Classical CV + ML for image quality assessment (noise, blur, skew, contrast, orientation)
Document Element Detection: YOLOv8-based detection of tables, images, handwriting, formulas
Quality Assessment per Element: IQA on embedded images within text documents
Structured JSON Output: COCO-aligned metadata with confidence scores and transform history
Production-Ready: Optimized for 50-150ms latency, 6+ pages/sec throughput per GPU worker

Architecture Overview

PDF/Image Input
    ↓
[Ingestion & Standardization]
    ↓
[Text Detection Gate]
    ↓              ↓
[NO TEXT]      [TEXT DETECTED]
    ↓              ↓
Classical CV   YOLOv8 Layout Detection
+ ML (IQA)     + Hybrid IQA on Images
    ↓              ↓
[Corrections & JSON Output]

See ARCHITECTURE_SUMMARY.md for detailed architecture and PROJECT_PLAN.md for complete implementation plan.

Project Status

Phase 0: Foundation & Scaffolding (IN PROGRESS)

Project structure with Poetry (Python 3.12)
JSON schema with Pydantic v2 validation
Structured logging (structlog + rich)
Pre-commit hooks (Black, Ruff, MyPy, Bandit)
CI/CD pipeline (GitHub Actions)
Evaluation framework
Ground-truth test set (500 pages)

Next: Phase 1 - MVP with Classical Methods (Week 4-7)

Quick Start

Prerequisites

Python 3.12+
Poetry 1.7+
(Optional) GPU with CUDA for ML models (Phase 2+)

Installation

# Clone repository
git clone https://github.com/williaby/image-preprocessing-detector.git
cd image-preprocessing-detector

# Install with Poetry
poetry install

# Install with dev dependencies
poetry install --with dev

# Install with ML dependencies (Phase 2+)
poetry install --with dev,ml

Usage

from image_preprocessing_detector import DocumentMetadata
from image_preprocessing_detector.utils import setup_logging, get_logger

# Setup logging
setup_logging(level="INFO", json_logs=False)
logger = get_logger(__name__)

# Process document (Phase 1+ implementation)
# from image_preprocessing_detector.pipeline import process_document
# metadata = process_document("document.pdf")
# metadata.to_json_file("output.json")

# Validate JSON schema
metadata = DocumentMetadata.from_json_file("output.json")
logger.info("Processed document", pages=metadata.num_pages)

CLI Usage (Phase 1+)

# Process single file
poetry run imgprep process input.pdf --output result.json

# Batch processing
poetry run imgprep batch input_dir/ --output-dir results/

# With quality threshold tuning
poetry run imgprep process input.pdf --blur-threshold 0.85 --skew-threshold 0.90

Development

Setup Development Environment

# Install all dependencies including dev tools
poetry install --with dev

# Setup pre-commit hooks
poetry run pre-commit install

# Run tests
poetry run pytest -v

# Run with coverage
poetry run pytest --cov=src/image_preprocessing_detector --cov-report=html

# Lint code
poetry run black src tests
poetry run ruff check --fix src tests
poetry run mypy src

Project Structure

image_detection/
├── src/
│   └── image_preprocessing_detector/
│       ├── __init__.py
│       ├── schema.py              # JSON schema (Pydantic models)
│       ├── ingestion/             # PDF/image loading (Phase 1)
│       ├── detection/             # Detection modules (Phase 1-3)
│       ├── correction/            # Image corrections (Phase 1)
│       ├── output/                # JSON generation (Phase 1)
│       └── utils/                 # Logging, telemetry
├── tests/
│   ├── unit/                      # Unit tests
│   └── integration/               # Integration tests
├── scripts/                       # Training & evaluation scripts (Phase 2-3)
├── configs/                       # Model configurations
├── data/                          # Datasets (managed by DVC)
├── models/                        # Trained models
├── docs/                          # Documentation
├── pyproject.toml                 # Dependencies & tool config
├── README.md                      # This file
├── PROJECT_PLAN.md                # Complete implementation plan
├── ARCHITECTURE_SUMMARY.md        # Architecture quick reference
└── DECISION_MATRIX.md             # Critical decisions tracking

Testing

# Run all tests
poetry run pytest -v

# Run specific test categories
poetry run pytest -v -m unit          # Unit tests only
poetry run pytest -v -m integration   # Integration tests only
poetry run pytest -v -m "not slow"    # Exclude slow tests

# Run with coverage requirements
poetry run pytest --cov=src --cov-fail-under=80

# Run tests in parallel
poetry run pytest -n auto

Reporting Issues

Bug Reports

Found a bug? Please report it via GitHub Issues:

Check existing issues: https://github.com/williaby/image-preprocessing-detector/issues
Create new issue: https://github.com/williaby/image-preprocessing-detector/issues/new
Include:
- Python version and OS
- Steps to reproduce
- Expected vs actual behavior
- Error messages and logs

Feature Requests

Have an idea? We welcome enhancement proposals via GitHub Issues. Please describe:

Use case and motivation
Proposed solution (if any)
Alternatives considered

Security Vulnerabilities

Please do not report security vulnerabilities through public issues.

See SECURITY.md for responsible disclosure process.

Versioning

This project uses Semantic Versioning:

MAJOR version: Incompatible API changes
MINOR version: Backwards-compatible functionality additions
PATCH version: Backwards-compatible bug fixes

Current version: 0.1.0 (pre-release, API may change)

See CHANGELOG.md for release history.

Documentation

PROJECT_PLAN.md: Complete 50+ page implementation plan with phased roadmap
ARCHITECTURE_SUMMARY.md: Quick reference for architecture and design decisions
DECISION_MATRIX.md: Critical decisions tracking and stakeholder requirements
ARCHITECTURE_CORRECTION.md: Hybrid IQA approach for embedded images
docs/WTD-Runbook.md: What The Diff integration guide for automated PR summaries
docs/api-reference.md: API and CLI reference documentation
SECURITY.md: Security policy and vulnerability reporting

Roadmap

Phase 1: MVP with Classical Methods (Weeks 4-7)

PDF ingestion and text detection gate
Classical IQA detectors (skew, blur, contrast)
Correction pipeline with guardrails
JSON output generation

Phase 2: ML for Image Quality (Weeks 8-11)

IQA dataset generation (50k synthetic + real images)
Train MobileNetV3/EfficientNet multi-label classifier
ONNX optimization for CPU inference
Integration with classical methods

Phase 3: ML for Document Layout (Weeks 12-16)

Document element dataset (PubLayNet + custom)
Train YOLOv8n/s for layout detection
Active learning for rare classes
INT8 quantization for production

Phase 4: Production Hardening (Weeks 17-20)

FastAPI service with Docker
Performance optimization (batching, quantization)
Monitoring and telemetry
Comprehensive testing (80%+ coverage)

Phase 5: Continuous Improvement (Ongoing)

Drift detection and alerting
Active learning pipeline
Quarterly retraining and recalibration

Performance Targets

Metric	Target	Notes
IQA mAP	> 0.88	Multi-label classification
Layout [email protected]	> 0.82	Object detection
JSON Accuracy	> 0.85	End-to-end pipeline
Latency (GPU)	< 150ms/page	With T4 GPU
Throughput	> 6 pages/sec	Per GPU worker
Test Coverage	> 80%	Unit + integration

Contributing

See CONTRIBUTING.md for comprehensive contribution guidelines, development workflow, and code quality standards.

Code Quality Standards

Formatting: Black (88 chars)
Linting: Ruff (comprehensive rules)
Type Checking: MyPy (strict mode)
Testing: Pytest with 80%+ coverage
Security: Bandit + dependency scanning
Commits: Conventional Commits, GPG-signed

Pre-commit Checks

All commits must pass:

Black formatting
Ruff linting
MyPy type checking (src/ only)
Bandit security scanning
YAML/Markdown linting

License

MIT License - see LICENSE for details.

Citation

@software{image_preprocessing_detector,
  title = {Image Preprocessing Detector for RAG Applications},
  author = {Byron Williams},
  year = {2025},
  version = {0.1.0},
  url = {https://github.com/username/image-preprocessing-detector}
}

Acknowledgments

Architecture designed with multi-model consensus analysis:

Gemini 2.5 Pro: Pipeline design and phased roadmap
GPT-5: Risk assessment and optimization strategies

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: [email protected]

Status: Phase 0 (Foundation) - Week 2-3 of 24-week development timeline

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.clusterfuzzlite		.clusterfuzzlite
.github		.github
LICENSES		LICENSES
configs		configs
data		data
docs		docs
fuzz		fuzz
models		models
monitoring		monitoring
overrides		overrides
scripts		scripts
src/image_preprocessing_detector		src/image_preprocessing_detector
tests		tests
tools		tools
validation		validation
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.zenodo.json		.zenodo.json
.zenodo.json.license		.zenodo.json.license
ARCHITECTURE_CORRECTION.md		ARCHITECTURE_CORRECTION.md
ARCHITECTURE_SUMMARY.md		ARCHITECTURE_SUMMARY.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
COVERAGE_FIX_COMPLETE.md		COVERAGE_FIX_COMPLETE.md
DECISION_MATRIX.md		DECISION_MATRIX.md
GITHUB_SETUP_INSTRUCTIONS.md		GITHUB_SETUP_INSTRUCTIONS.md
LICENSE		LICENSE
NEXT_STEPS.md		NEXT_STEPS.md
PHASE_0_COMPLETE.md		PHASE_0_COMPLETE.md
PHASE_1_CICD_COMPLETE.md		PHASE_1_CICD_COMPLETE.md
PHASE_1_COMPLETE.md		PHASE_1_COMPLETE.md
PHASE_1_KICKOFF.md		PHASE_1_KICKOFF.md
PHASE_1_READY_FOR_COMMIT.md		PHASE_1_READY_FOR_COMMIT.md
PROJECT_PLAN.md		PROJECT_PLAN.md
README.md		README.md
REUSE.toml		REUSE.toml
SECURITY.md		SECURITY.md
SESSION_SUMMARY_2025-01-15.md		SESSION_SUMMARY_2025-01-15.md
TEST_ANALYSIS_MOCKING_VS_REAL.md		TEST_ANALYSIS_MOCKING_VS_REAL.md
codecov.yml		codecov.yml
codemeta.json		codemeta.json
codemeta.json.license		codemeta.json.license
concept.txt		concept.txt
gemini-review.md		gemini-review.md
image_reference_sets.txt		image_reference_sets.txt
mkdocs.yml		mkdocs.yml
noxfile.py		noxfile.py
osv-scanner.toml		osv-scanner.toml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
renovate.json		renovate.json
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
validate_workflows.sh		validate_workflows.sh

Uh oh!

License

williaby/image-preprocessing-detector

Folders and files

Latest commit

History

Repository files navigation

Image Preprocessing Detector

Security & Quality

CI/CD Status

Project Info

What Does This Do?

Features

Architecture Overview

Project Status

Quick Start

Prerequisites

Installation

Usage

CLI Usage (Phase 1+)

Development

Setup Development Environment

Project Structure

Testing

Reporting Issues

Bug Reports

Feature Requests

Security Vulnerabilities

Versioning

Documentation

Roadmap

Phase 1: MVP with Classical Methods (Weeks 4-7)

Phase 2: ML for Image Quality (Weeks 8-11)

Phase 3: ML for Document Layout (Weeks 12-16)

Phase 4: Production Hardening (Weeks 17-20)

Phase 5: Continuous Improvement (Ongoing)

Performance Targets

Contributing

Code Quality Standards

Pre-commit Checks

License

Citation

Acknowledgments

Support

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 4

Languages

Packages