99.156% Accuracy on DocVQA Validation Set
This repository is our DocVQA benchmark implementation guide using Agentic Document Extraction (ADE) with DPT-2 Parse API.
- Accuracy: 99.156% (5,286/5,331 correct, excluding 18 dataset issues)
- Remaining errors: 45 real errors (18 dataset issues excluded but visible)
View Interactive Error Gallery →
The gallery includes:
- 63 success cases with answer-focused visual grounding
- All 45 error cases with detailed analysis
- 18 dataset issues (excluded from accuracy but shown for transparency)
- Interactive category filtering
gallery.html- Interactive visualization of results and remaining errorsprompt.md- Final hybrid prompt (recommended for best results)evaluate.py- Simple evaluation scriptimages/- All 1,286 DocVQA validation imagesparsed/- Pre-parsed document JSONs (ADE DPT-2 output)results/- Prediction results
extra/- Alternative prompts, test scripts, analysis reports, and utilities
# Python 3.8+
python3 --version
# Install dependencies
pip install -r requirements.txtDownload the DocVQA validation annotations from HuggingFace:
# Create data directory if it doesn't exist
mkdir -p data/val
# Download val annotations
wget https://huggingface.co/datasets/lmms-lab/DocVQA/resolve/main/val_v1.0.json -O data/val/val_v1.0.jsonAlternatively, visit DocVQA on HuggingFace and download manually.
export ANTHROPIC_API_KEY='your-api-key-here'Get your API key from Anthropic Console.
python3 evaluate.pyThis will:
- Load the hybrid prompt from
prompt.md - Process all 5,349 questions in the validation set
- Use pre-parsed documents from
parsed/ - Save predictions to
results/predictions.jsonl - Report final accuracy
Note: Full evaluation takes ~1-2 hours with Claude Sonnet 4.5.
Each JSON contains ADE DPT-2 parsing output:
{
"chunks": [
{
"id": "chunk_0",
"type": "text",
"markdown": "Text content with <bbox:[x1,y1,x2,y2]> annotations",
"bbox": [x1, y1, x2, y2]
}
]
}Each line is a JSON object:
{
"question_id": 12345,
"question": "What is the date?",
"answer": ["January 1, 2020"],
"pred": "January 1, 2020",
"sources": ["chunk_0", "chunk_1"],
"correct": true
}- Document Parsing: Use ADE DPT-2 to extract structured content with spatial grounding
- Question Answering: Apply hybrid prompt with Claude Sonnet 4.5
- Evaluation: Exact match scoring (case-insensitive)
Our final prompt (prompt.md) uses a two-strategy approach:
- Direct Extraction - For simple factual queries (names, dates, numbers)
- Structured Analysis - For complex spatial/hierarchical questions
Key improvements:
- Footer vs content distinction
- Page number handling (Arabic numerals only)
- Logo text extraction
- Handwritten text detection
- Positional reasoning (last line, under, above, etc.)
| Category | Count | % of Errors | Description |
|---|---|---|---|
| Prompt/LLM Misses | 18 | 40.0% | Reasoning or interpretation failures |
| Incorrect Parse | 13 | 28.9% | OCR/parsing errors (character confusion, misreads) |
| Not ADE Focus | 9 | 20.0% | Spatial layout questions outside ADE's core strength |
| Missed Parse | 5 | 11.1% | Information not extracted during parsing |
| Dataset Issues | 18 | — | Questionable ground truth (excluded from count) |
- Incorrect Parse: OCR/parsing mistakes like character confusion (O/0, I/l/1), table misreads, or parsing artifacts
- Prompt/LLM Misses: Claude gives wrong answer despite having correct parsed data - reasoning or instruction-following issues
- Not ADE Focus: Questions requiring visual layout analysis, spatial reasoning, or document structure understanding beyond text extraction
- Missed Parse: Information exists in document but wasn't extracted by the parser
- Dataset Issues: Questionable annotations, ambiguous questions, or debatable ground truth (excluded from accuracy calculation)
Recommended (used for 99.156% result):
- Model:
claude-sonnet-4-20250514(Sonnet 4.5) - Temperature: 0.0
- Max tokens: 4096
To reproduce the result:
- Use the provided
parsed/documents (same parsing output) - Use
prompt.md(final hybrid prompt) - Use Claude Sonnet 4.5 (
claude-sonnet-4-20250514) - Temperature 0.0 (deterministic)
- Exclude 18 dataset issues from accuracy calculation (as documented in gallery)