Document Converter API

A FastAPI service that converts PDF files to text.

Features

Convert PDF files to text
Support for both standard PDFs and scanned PDFs (via OCR)
Advanced document understanding with SmolDocling model (v2 API)
RESTful API with OpenAPI documentation
Proper error handling and validation
Configurable via environment variables
Versioned API for better compatibility tracking

Requirements

Python 3.8+
uv for dependency management (recommended)
Tesseract OCR (optional, for OCR support)
PyTorch and Transformers (for v2 API with SmolDocling model)
pdf2image (for v2 API to convert PDFs to images)

Installation

Using the setup script (recommended)

The easiest way to set up the development environment is to use the provided setup script:

# Clone the repository
git clone https://github.com/yourusername/document-converter.git
cd document-converter

# Run the setup script
./setup.sh

The setup script will:

Check if uv is installed and use it if available
Create a virtual environment
Install all dependencies (including development and example dependencies)
Create a .env file from the template if it doesn't exist
Create data directories for uploads and logs

Manual installation

Using uv

# Install uv if you don't have it
curl -sSf https://astral.sh/uv/install.sh | bash

# Clone the repository
git clone https://github.com/yourusername/document-converter.git
cd document-converter

# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync --dev

Configuration

The application can be configured using environment variables or a .env file:

Variable	Description	Default
`DEBUG`	Enable debug mode	`False`
`MAX_UPLOAD_SIZE`	Maximum upload size in bytes	`10485760` (10MB)
`ALLOWED_EXTENSIONS`	Comma-separated list of allowed file extensions	`pdf`
`OCR_ENABLED`	Enable OCR for scanned PDFs	`True`
`BACKEND_CORS_ORIGINS`	Comma-separated list of allowed CORS origins	`[]`

Setting up environment variables

You can create a .env file in the project root directory to set environment variables. A template is provided in .env.example:

# Copy the example file
cp .env.example .env

# Edit the file with your preferred settings
nano .env  # or use any text editor

Example .env file:

DEBUG=True
MAX_UPLOAD_SIZE=20971520
ALLOWED_EXTENSIONS=pdf
OCR_ENABLED=True
BACKEND_CORS_ORIGINS=http://localhost:3000,http://localhost:8080

Usage

Running the server

Using the run script

# Using the run script (recommended)
./run.py

# With custom options
./run.py --host 127.0.0.1 --port 8080 --reload --workers 4

# Development mode (alternative)
cd src
python -m app.main

# Production mode (using uvicorn directly)
uvicorn src.app.main:app --host 0.0.0.0 --port 8000

Using Docker

# Build and run the Docker image
docker build -t document-converter .
docker run -p 8000:8000 document-converter

Using Docker Compose

# Start the service
docker-compose up -d

# View logs
docker-compose logs -f

# Stop the service
docker-compose down

The API will be available at http://localhost:8000.

Docker Configuration

You can configure the Docker container by modifying the environment variables in the docker-compose.yml file:

environment:
  - DEBUG=False
  - MAX_UPLOAD_SIZE=10485760
  - ALLOWED_EXTENSIONS=pdf
  - OCR_ENABLED=True
  - BACKEND_CORS_ORIGINS=

A data volume is also mounted to persist data between container restarts:

volumes:
  - ./data:/app/data

API Documentation

The API documentation is available at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

API Endpoints

Root Endpoint

GET /

Response:

{
  "message": "Welcome to the Document Converter API",
  "version": "0.1.0",
  "docs": "/docs",
  "redoc": "/redoc"
}

This endpoint provides basic information about the API, including the current version.

Convert PDF to Text (v1 API)

POST /api/v1/pdf/convert

Request:

file: PDF file to convert (multipart/form-data)

Response:

{
  "text": "Extracted text from the PDF...",
  "filename": "example.pdf",
  "page_count": 5,
  "ocr_used": false
}

Convert PDF to Text with SmolDocling (v2 API)

POST /api/v1/v2/pdf/convert

This endpoint uses the SmolDocling-256M-preview model from Hugging Face for advanced document understanding. It's particularly effective for complex document layouts and can handle a wide variety of document formats.

Request:

file: PDF file to convert (multipart/form-data)

Response:

{
  "text": "Extracted text from the PDF using SmolDocling...",
  "filename": "example.pdf",
  "page_count": 5,
  "ocr_used": false
}

Note: The v2 API requires additional dependencies (PyTorch, Transformers, pdf2image) and may have higher computational requirements due to the machine learning model used.

Example Client

An example client script is provided in the examples directory to demonstrate how to use the API programmatically:

# Basic usage
./examples/client_example.py path/to/your/document.pdf

# Save output to a file
./examples/client_example.py path/to/your/document.pdf --output extracted_text.txt

# Use a different API endpoint
./examples/client_example.py path/to/your/document.pdf --api-url http://api.example.com/api/v1/pdf/convert

The example client requires the requests library, which you can install using uv:

# Install requests directly
uv pip install requests

# Or install as an optional dependency group
uv pip install -e ".[examples]"

Versioning

The Document Converter API uses semantic versioning (SemVer) for version management. The version format is MAJOR.MINOR.PATCH:

MAJOR version changes indicate incompatible API changes
MINOR version changes add functionality in a backward-compatible manner
PATCH version changes make backward-compatible bug fixes

Checking the Version

You can check the current version of the API in several ways:

API Root Endpoint: Send a GET request to the root endpoint (/) to see the version in the response.
OpenAPI Documentation: The version is displayed in the Swagger UI and ReDoc pages.
Startup Logs: The version is logged when the application starts.

Version Management

The version is centrally managed in the codebase:

The canonical version is defined in pyproject.toml
The version is exposed through the app.core.version module
All parts of the application reference this central version

Development

Running Tests

pytest

Code Formatting

# Format code
black src tests

# Sort imports
isort src tests

# Lint code
ruff src tests

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
examples		examples
src/app		src/app
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
document-converter.iml		document-converter.iml
pyproject.toml		pyproject.toml
run.py		run.py
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Converter API

Features

Requirements

Installation

Using the setup script (recommended)

Manual installation

Using uv

Configuration

Setting up environment variables

Usage

Running the server

Using the run script

Using Docker

Using Docker Compose

Docker Configuration

API Documentation

API Endpoints

Root Endpoint

Convert PDF to Text (v1 API)

Convert PDF to Text with SmolDocling (v2 API)

Example Client

Versioning

Checking the Version

Version Management

Development

Running Tests

Code Formatting

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

mustafa-zidan/document-converter

Folders and files

Latest commit

History

Repository files navigation

Document Converter API

Features

Requirements

Installation

Using the setup script (recommended)

Manual installation

Using uv

Configuration

Setting up environment variables

Usage

Running the server

Using the run script

Using Docker

Using Docker Compose

Docker Configuration

API Documentation

API Endpoints

Root Endpoint

Convert PDF to Text (v1 API)

Convert PDF to Text with SmolDocling (v2 API)

Example Client

Versioning

Checking the Version

Version Management

Development

Running Tests

Code Formatting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages