Skip to content

landing-ai/ade-docvqa-benchmark

Repository files navigation

ADE DocVQA Benchmark

99.156% Accuracy on DocVQA Validation Set

This repository is our DocVQA benchmark implementation guide using Agentic Document Extraction (ADE) with DPT-2 Parse API.

🎯 Results

  • Accuracy: 99.156% (5,286/5,331 correct, excluding 18 dataset issues)
  • Remaining errors: 45 real errors (18 dataset issues excluded but visible)

View Interactive Error Gallery →

The gallery includes:

  • 63 success cases with answer-focused visual grounding
  • All 45 error cases with detailed analysis
  • 18 dataset issues (excluded from accuracy but shown for transparency)
  • Interactive category filtering

📁 Repository Contents

Main Files

  • gallery.html - Interactive visualization of results and remaining errors
  • prompt.md - Final hybrid prompt (recommended for best results)
  • evaluate.py - Simple evaluation script
  • images/ - All 1,286 DocVQA validation images
  • parsed/ - Pre-parsed document JSONs (ADE DPT-2 output)
  • results/ - Prediction results

Additional Materials

  • extra/ - Alternative prompts, test scripts, analysis reports, and utilities

🚀 Quick Start

1. Prerequisites

# Python 3.8+
python3 --version

# Install dependencies
pip install -r requirements.txt

2. Get DocVQA Annotations

Download the DocVQA validation annotations from HuggingFace:

# Create data directory if it doesn't exist
mkdir -p data/val

# Download val annotations
wget https://huggingface.co/datasets/lmms-lab/DocVQA/resolve/main/val_v1.0.json -O data/val/val_v1.0.json

Alternatively, visit DocVQA on HuggingFace and download manually.

3. Set API Key (assuming you want to use claude)

export ANTHROPIC_API_KEY='your-api-key-here'

Get your API key from Anthropic Console.

4. Run Evaluation

python3 evaluate.py

This will:

  • Load the hybrid prompt from prompt.md
  • Process all 5,349 questions in the validation set
  • Use pre-parsed documents from parsed/
  • Save predictions to results/predictions.jsonl
  • Report final accuracy

Note: Full evaluation takes ~1-2 hours with Claude Sonnet 4.5.

📊 Data Format

Parsed Documents (parsed/*.json)

Each JSON contains ADE DPT-2 parsing output:

{
  "chunks": [
    {
      "id": "chunk_0",
      "type": "text",
      "markdown": "Text content with <bbox:[x1,y1,x2,y2]> annotations",
      "bbox": [x1, y1, x2, y2]
    }
  ]
}

Predictions (results/predictions.jsonl)

Each line is a JSON object:

{
  "question_id": 12345,
  "question": "What is the date?",
  "answer": ["January 1, 2020"],
  "pred": "January 1, 2020",
  "sources": ["chunk_0", "chunk_1"],
  "correct": true
}

🔍 Methodology

Approach

  1. Document Parsing: Use ADE DPT-2 to extract structured content with spatial grounding
  2. Question Answering: Apply hybrid prompt with Claude Sonnet 4.5
  3. Evaluation: Exact match scoring (case-insensitive)

Hybrid Prompt Strategy

Our final prompt (prompt.md) uses a two-strategy approach:

  1. Direct Extraction - For simple factual queries (names, dates, numbers)
  2. Structured Analysis - For complex spatial/hierarchical questions

Key improvements:

  • Footer vs content distinction
  • Page number handling (Arabic numerals only)
  • Logo text extraction
  • Handwritten text detection
  • Positional reasoning (last line, under, above, etc.)

📈 Performance Breakdown

By Error Category (45 real errors)

Category Count % of Errors Description
Prompt/LLM Misses 18 40.0% Reasoning or interpretation failures
Incorrect Parse 13 28.9% OCR/parsing errors (character confusion, misreads)
Not ADE Focus 9 20.0% Spatial layout questions outside ADE's core strength
Missed Parse 5 11.1% Information not extracted during parsing
Dataset Issues 18 Questionable ground truth (excluded from count)

Error Categories Explained

  • Incorrect Parse: OCR/parsing mistakes like character confusion (O/0, I/l/1), table misreads, or parsing artifacts
  • Prompt/LLM Misses: Claude gives wrong answer despite having correct parsed data - reasoning or instruction-following issues
  • Not ADE Focus: Questions requiring visual layout analysis, spatial reasoning, or document structure understanding beyond text extraction
  • Missed Parse: Information exists in document but wasn't extracted by the parser
  • Dataset Issues: Questionable annotations, ambiguous questions, or debatable ground truth (excluded from accuracy calculation)

🛠️ Model Configuration

Recommended (used for 99.156% result):

  • Model: claude-sonnet-4-20250514 (Sonnet 4.5)
  • Temperature: 0.0
  • Max tokens: 4096

🔬 Reproducing Results

To reproduce the result:

  1. Use the provided parsed/ documents (same parsing output)
  2. Use prompt.md (final hybrid prompt)
  3. Use Claude Sonnet 4.5 (claude-sonnet-4-20250514)
  4. Temperature 0.0 (deterministic)
  5. Exclude 18 dataset issues from accuracy calculation (as documented in gallery)

About

99.156% Accuracy from Agentic Document Extraction DPT-2 model on DocVQA val split

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •