A FastAPI service that converts PDF files to text.
- Convert PDF files to text
- Support for both standard PDFs and scanned PDFs (via OCR)
- Advanced document understanding with SmolDocling model (v2 API)
- RESTful API with OpenAPI documentation
- Proper error handling and validation
- Configurable via environment variables
- Versioned API for better compatibility tracking
- Python 3.8+
- uv for dependency management (recommended)
- Tesseract OCR (optional, for OCR support)
- PyTorch and Transformers (for v2 API with SmolDocling model)
- pdf2image (for v2 API to convert PDFs to images)
The easiest way to set up the development environment is to use the provided setup script:
# Clone the repository
git clone https://github.com/yourusername/document-converter.git
cd document-converter
# Run the setup script
./setup.shThe setup script will:
- Check if uv is installed and use it if available
- Create a virtual environment
- Install all dependencies (including development and example dependencies)
- Create a .env file from the template if it doesn't exist
- Create data directories for uploads and logs
# Install uv if you don't have it
curl -sSf https://astral.sh/uv/install.sh | bash
# Clone the repository
git clone https://github.com/yourusername/document-converter.git
cd document-converter
# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv sync --devThe application can be configured using environment variables or a .env file:
| Variable | Description | Default |
|---|---|---|
DEBUG |
Enable debug mode | False |
MAX_UPLOAD_SIZE |
Maximum upload size in bytes | 10485760 (10MB) |
ALLOWED_EXTENSIONS |
Comma-separated list of allowed file extensions | pdf |
OCR_ENABLED |
Enable OCR for scanned PDFs | True |
BACKEND_CORS_ORIGINS |
Comma-separated list of allowed CORS origins | [] |
You can create a .env file in the project root directory to set environment variables. A template is provided in .env.example:
# Copy the example file
cp .env.example .env
# Edit the file with your preferred settings
nano .env # or use any text editorExample .env file:
DEBUG=True
MAX_UPLOAD_SIZE=20971520
ALLOWED_EXTENSIONS=pdf
OCR_ENABLED=True
BACKEND_CORS_ORIGINS=http://localhost:3000,http://localhost:8080
# Using the run script (recommended)
./run.py
# With custom options
./run.py --host 127.0.0.1 --port 8080 --reload --workers 4
# Development mode (alternative)
cd src
python -m app.main
# Production mode (using uvicorn directly)
uvicorn src.app.main:app --host 0.0.0.0 --port 8000# Build and run the Docker image
docker build -t document-converter .
docker run -p 8000:8000 document-converter# Start the service
docker-compose up -d
# View logs
docker-compose logs -f
# Stop the service
docker-compose downThe API will be available at http://localhost:8000.
You can configure the Docker container by modifying the environment variables in the docker-compose.yml file:
environment:
- DEBUG=False
- MAX_UPLOAD_SIZE=10485760
- ALLOWED_EXTENSIONS=pdf
- OCR_ENABLED=True
- BACKEND_CORS_ORIGINS=A data volume is also mounted to persist data between container restarts:
volumes:
- ./data:/app/dataThe API documentation is available at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
GET /
Response:
{
"message": "Welcome to the Document Converter API",
"version": "0.1.0",
"docs": "/docs",
"redoc": "/redoc"
}This endpoint provides basic information about the API, including the current version.
POST /api/v1/pdf/convert
Request:
file: PDF file to convert (multipart/form-data)
Response:
{
"text": "Extracted text from the PDF...",
"filename": "example.pdf",
"page_count": 5,
"ocr_used": false
}POST /api/v1/v2/pdf/convert
This endpoint uses the SmolDocling-256M-preview model from Hugging Face for advanced document understanding. It's particularly effective for complex document layouts and can handle a wide variety of document formats.
Request:
file: PDF file to convert (multipart/form-data)
Response:
{
"text": "Extracted text from the PDF using SmolDocling...",
"filename": "example.pdf",
"page_count": 5,
"ocr_used": false
}Note: The v2 API requires additional dependencies (PyTorch, Transformers, pdf2image) and may have higher computational requirements due to the machine learning model used.
An example client script is provided in the examples directory to demonstrate how to use the API programmatically:
# Basic usage
./examples/client_example.py path/to/your/document.pdf
# Save output to a file
./examples/client_example.py path/to/your/document.pdf --output extracted_text.txt
# Use a different API endpoint
./examples/client_example.py path/to/your/document.pdf --api-url http://api.example.com/api/v1/pdf/convertThe example client requires the requests library, which you can install using uv:
# Install requests directly
uv pip install requests
# Or install as an optional dependency group
uv pip install -e ".[examples]"The Document Converter API uses semantic versioning (SemVer) for version management. The version format is MAJOR.MINOR.PATCH:
MAJORversion changes indicate incompatible API changesMINORversion changes add functionality in a backward-compatible mannerPATCHversion changes make backward-compatible bug fixes
You can check the current version of the API in several ways:
- API Root Endpoint: Send a GET request to the root endpoint (
/) to see the version in the response. - OpenAPI Documentation: The version is displayed in the Swagger UI and ReDoc pages.
- Startup Logs: The version is logged when the application starts.
The version is centrally managed in the codebase:
- The canonical version is defined in
pyproject.toml - The version is exposed through the
app.core.versionmodule - All parts of the application reference this central version
pytest# Format code
black src tests
# Sort imports
isort src tests
# Lint code
ruff src testsMIT