Skip to content

Commit b050332

Browse files
Merge pull request NanoNets#24 from NanoNets/dev/markdown
Add image & pdf 2 markdown support
2 parents a742a3b + 48f437a commit b050332

File tree

16 files changed

+679
-21
lines changed

16 files changed

+679
-21
lines changed

EXT_README.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ python -m docext.app.app
5252
python -m docext.app.app --model_name "hosted_vllm/Qwen/Qwen2.5-VL-7B-Instruct-AWQ" --max_img_size 1024 # `--help` for more options
5353
```
5454

55-
The interface will be available at `http://localhost:7860` with default credentials: (You can change the port by using `--ui_port` flag)
55+
The interface will be available at `http://localhost:7860` with default credentials: (You can change the port by using `--server_port` flag)
5656

5757
- Username: `admin`
5858
- Password: `admin`
@@ -71,6 +71,7 @@ python -m docext.app.app --concurrency_limit 10
7171
import pandas as pd
7272
import concurrent.futures
7373
from gradio_client import Client, handle_file
74+
from docext.core.file_converters.pdf_converter import PDFConverter
7475

7576

7677
def dataframe_to_custom_dict(df: pd.DataFrame) -> dict:
@@ -110,6 +111,12 @@ fields_and_tables = dataframe_to_custom_dict(pd.DataFrame([
110111
{"name": "item_description", "type": "table", "description": "Item/Product description"}
111112
# add more fields and table columns as needed
112113
]))
114+
# client url can be the local host or the public url like `https://6986bdd23daef6f7eb.gradio.live/`
115+
CLIENT_URL = "http://localhost:7860"
116+
117+
118+
119+
## ======= Image Inputs =======
113120

114121
file_inputs = [
115122
{
@@ -119,21 +126,34 @@ file_inputs = [
119126
]
120127

121128
## send single request
122-
### client url can be the local host or the public url like `https://6986bdd23daef6f7eb.gradio.live/`
123129
fields_df, tables_df = get_extracted_fields_and_tables(
124-
"http://localhost:7860", "admin", "admin", "hosted_vllm/Qwen/Qwen2.5-VL-7B-Instruct-AWQ", fields_and_tables, file_inputs
130+
CLIENT_URL, "admin", "admin", "hosted_vllm/Qwen/Qwen2.5-VL-7B-Instruct-AWQ", fields_and_tables, file_inputs
125131
)
126132
print("========Fields:=========")
127133
print(fields_df)
128134
print("========Tables:=========")
129135
print(tables_df)
130136

131137

138+
## ======= PDF Inputs =======
139+
140+
pdf_converter = PDFConverter()
141+
document_pages = pdf_converter.convert_and_save_images("assets/invoice_test.pdf")
142+
file_inputs = [{"image": handle_file(page)} for page in document_pages]
143+
144+
fields_df, tables_df = get_extracted_fields_and_tables(
145+
CLIENT_URL, "admin", "admin", "hosted_vllm/Qwen/Qwen2.5-VL-7B-Instruct-AWQ", fields_and_tables, file_inputs
146+
)
147+
print("========Fields:=========")
148+
print(fields_df)
149+
print("========Tables:=========")
150+
print(tables_df)
151+
132152
## send multiple requests in parallel
133153
# Define a wrapper function for parallel execution
134154
def run_request():
135155
return get_extracted_fields_and_tables(
136-
"http://localhost:7860", "admin", "admin", "hosted_vllm/Qwen/Qwen2.5-VL-7B-Instruct-AWQ", fields_and_tables, file_inputs
156+
CLIENT_URL, "admin", "admin", "hosted_vllm/Qwen/Qwen2.5-VL-7B-Instruct-AWQ", fields_and_tables, file_inputs
137157
)
138158

139159
# Use ThreadPoolExecutor to send 10 requests in parallel

PDF2MD_README.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# PDF to Markdown API Documentation
2+
3+
Convert PDF documents and images to high-quality markdown format using vision-language models.
4+
5+
## Table of Contents
6+
- [Features](#features)
7+
- [Getting Started](#getting-started)
8+
- [Quickstart](#quickstart)
9+
- [Installation](#installation)
10+
- [Web Interface](#web-interface)
11+
- [API Access](#api-access)
12+
- [Requirements](#requirements)
13+
- [Supported Models & Platforms](#supported-models--platforms)
14+
- [Models with vLLM (Linux)](#models-with-vllm-linux)
15+
16+
## Features
17+
18+
- **LaTeX Equation Recognition**: Convert both inline and block LaTeX equations in images to markdown.
19+
- **Intelligent Image Description**: Generate a detailed description for all images in the document within `<img></img>` tags.
20+
- **Signature Detection**: Detect and mark signatures and watermarks in the document. Signatures text are extracted within `<signature></signature>` tags.
21+
- **Watermark Detection**: Detect and mark watermarks in the document. Watermarks text are extracted within `<watermark></watermark>` tags.
22+
- **Page Number Detection**: Detect and mark page numbers in the document. Page numbers are extracted within `<page_number></page_number>` tags.
23+
- **Checkboxes and Radio Buttons**: Converts form checkboxes and radio buttons into standardized Unicode symbols (☐, ☑, ☒).
24+
- **Table Detection**: Convert complex tables into html tables.
25+
26+
## Getting Started
27+
### Quickstart
28+
- [Colab notebook for onprem deployment](https://colab.research.google.com/drive/1uKO70sctH8G59yYH_rLW6CPK4Vj2YmI6?usp=sharing)
29+
30+
### Installation
31+
```bash
32+
# create a virtual environment
33+
## install uv if not installed
34+
curl -LsSf https://astral.sh/uv/install.sh | sh
35+
## create a virtual environment with python 3.11
36+
uv venv --python=3.11
37+
source .venv/bin/activate
38+
39+
# Install from PyPI
40+
uv pip install docext
41+
42+
# Or install from source
43+
git clone https://github.com/nanonets/docext.git
44+
cd docext
45+
uv pip install -e .
46+
```
47+
48+
### Web Interface
49+
50+
docext includes a Gradio-based web interface for easy document processing:
51+
52+
```bash
53+
# Start the web interface with default configs
54+
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s
55+
56+
# Start the web interface with custom configs
57+
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s --max_img_size 1024 --concurrency_limit 16 # `--help` for more options
58+
```
59+
60+
The interface will be available at `http://localhost:7860` with default credentials: (You can change the port by using `--ui_port` flag)
61+
62+
- Username: `admin`
63+
- Password: `admin`
64+
65+
Check [Supported Models]() section for more options for the model.
66+
67+
### API Access
68+
69+
```python
70+
import time
71+
from gradio_client import Client, handle_file
72+
73+
def convert_pdf_to_markdown(
74+
client_url: str,
75+
username: str,
76+
password: str,
77+
file_paths: list[str],
78+
model_name: str = "hosted_vllm/nanonets/Nanonets-OCR-s"
79+
):
80+
"""
81+
Convert PDF/images to markdown using the API
82+
83+
Args:
84+
client_url: URL of the docext server
85+
username: Authentication username
86+
password: Authentication password
87+
file_paths: List of file paths to convert
88+
model_name: Model to use for conversion
89+
90+
Returns:
91+
str: Converted markdown content
92+
"""
93+
client = Client(client_url, auth=(username, password))
94+
95+
# Prepare file inputs
96+
file_inputs = [{"image": handle_file(file_path)} for file_path in file_paths]
97+
98+
# Convert to markdown (non-streaming)
99+
result = client.predict(
100+
images=file_inputs,
101+
api_name="/process_markdown_streaming"
102+
)
103+
104+
return result
105+
106+
# Example usage
107+
# client url can be the local host or the public url like `https://6986bdd23daef6f7eb.gradio.live/`
108+
CLIENT_URL = "http://localhost:7860"
109+
110+
# Single image conversion
111+
markdown_content = convert_pdf_to_markdown(
112+
CLIENT_URL,
113+
"admin",
114+
"admin",
115+
["assets/invoice_test.pdf"]
116+
)
117+
print(markdown_content)
118+
119+
# Multiple files conversion
120+
markdown_content = convert_pdf_to_markdown(
121+
CLIENT_URL,
122+
"admin",
123+
"admin",
124+
["assets/invoice_test.jpeg", "assets/invoice_test.pdf"]
125+
)
126+
print(markdown_content)
127+
```
128+
## Requirements
129+
130+
- Python 3.11+
131+
- CUDA-compatible GPU (for optimal performance). Use Google Colab for free GPU access.
132+
- Dependencies listed in requirements.txt
133+
134+
## Supported Models & Platforms
135+
### Models with vLLM (Linux)
136+
137+
We recommend using the `hosted_vllm/nanonets/Nanonets-OCR-s` model for best performance. The model is trained to do OCR with semantic tagging. But, you can use any other VLM models supported by vLLM. Also, it is a 3B model, so it can run on a GPUs with small VRAM.
138+
139+
Examples:
140+
| Model | `--model_name` |
141+
|-------|--------------|
142+
| Nanonets-OCR-s | `hosted_vllm/nanonets/Nanonets-OCR-s` |
143+
| Qwen/Qwen2.5-VL-7B-Instruct-AWQ | `hosted_vllm/Qwen/Qwen2.5-VL-7B-Instruct-AWQ` |
144+
| Qwen/Qwen2.5-VL-7B-Instruct | `hosted_vllm/Qwen/Qwen2.5-VL-7B-Instruct` |
145+
| Qwen/Qwen2.5-VL-32B-Instruct | `hosted_vllm/Qwen/Qwen2.5-VL-32B-Instruct` |

README.md

Lines changed: 29 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,18 +18,42 @@
1818
</a>
1919
</p>
2020

21-
![Demo Docext](https://raw.githubusercontent.com/NanoNets/docext/main/assets/demo.jpg)
21+
<!-- ![Demo Docext](https://raw.githubusercontent.com/NanoNets/docext/main/assets/pdf2markdown.jpg) -->
22+
![Demo Docext](assets/pdf2markdown.png)
2223

2324

25+
## New Model Release: Nanonets-OCR-s
26+
27+
**We're excited to announce the release of Nanonets-OCR-s, a compact 3B parameter model specifically trained for efficient image to markdown conversion with semantic understanding for images, signatures, watermarks, etc.!**
28+
29+
📢 [Read the full announcement](https://nanonets.com/research/nanonets-ocr-s) | 🤗 [Hugging Face model](https://huggingface.co/nanonets/Nanonets-OCR-s)
2430

2531
## Overview
2632

27-
docext is an OCR-free tool for extracting structured information from documents such as invoices, passports, and other documents. It leverages vision-language models (VLMs) to accurately identify and extract both field data and tabular information from document images.
33+
docext is a comprehensive on-premises document intelligence toolkit powered by vision-language models (VLMs). It provides three core capabilities:
34+
35+
**📄 PDF & Image to Markdown Conversion**: Transform documents into structured markdown with intelligent content recognition, including LaTeX equations, signatures, watermarks, tables, and semantic tagging.
2836

29-
The [Intelligent Document Processing Leaderboard](https://idp-leaderboard.org/) tracks and evaluates performance vision-language models across OCR, Key Information Extraction (KIE), document classification, table extraction, and other intelligent document processing tasks.
37+
**🔍 Document Information Extraction**: OCR-free extraction of structured information (fields, tables, etc.) from documents such as invoices, passports, and other document types, with confidence scoring.
38+
39+
**📊 Intelligent Document Processing Leaderboard**: A comprehensive benchmarking platform that tracks and evaluates vision-language model performance across OCR, Key Information Extraction (KIE), document classification, table extraction, and other intelligent document processing tasks.
3040

3141

3242
## Features
43+
### PDF and Image to Markdown
44+
Convert both PDF and images to markdown with content recognition and semantic tagging.
45+
- **LaTeX Equation Recognition**: Convert both inline and block LaTeX equations in images to markdown.
46+
- **Intelligent Image Description**: Generate a detailed description for all images in the document within `<img></img>` tags.
47+
- **Signature Detection**: Detect and mark signatures and watermarks in the document. Signatures text are extracted within `<signature></signature>` tags.
48+
- **Watermark Detection**: Detect and mark watermarks in the document. Watermarks text are extracted within `<watermark></watermark>` tags.
49+
- **Page Number Detection**: Detect and mark page numbers in the document. Page numbers are extracted within `<page_number></page_number>` tags.
50+
- **Checkboxes and Radio Buttons**: Converts form checkboxes and radio buttons into standardized Unicode symbols (☐, ☑, ☒).
51+
- **Table Detection**: Convert complex tables into html tables.
52+
53+
🔍 For in-depth information, see the [release blog](https://github.com/NanoNets/docext/tree/main/docext/benchmark).
54+
55+
For setup instructions and additional details, check out the full feature guide for the [pdf to markdown](https://github.com/NanoNets/docext/blob/main/PDF2MD_README.md).
56+
3357
### Intelligent Document Processing Leaderboard
3458
This benchmark evaluates performance across seven key document intelligence challenges:
3559

@@ -64,13 +88,14 @@ For more details (Installation, Usage, and so on), please check out the [feature
6488
## Change Log
6589

6690
### Latest Updates
91+
- **12-06-2025** - Added pdf and image to markdown support.
6792
- **06-06-2025** - Added `gemini-2.5-pro-preview-06-05` evaluation metrics to the leaderboard.
6893
- **04-06-2025** - Added support for PDF and multiple documents in `docext` extraction.
69-
- **23-05-2025** – Added `gemini-2.5-pro-preview-03-25`, `claude-sonnet-4` evaluation metrics to the leaderboard.
7094

7195
<details>
7296
<summary>Older Changes</summary>
7397

98+
- **23-05-2025** – Added `gemini-2.5-pro-preview-03-25`, `claude-sonnet-4` evaluation metrics to the leaderboard.
7499
- **17-05-2025** – Added `InternVL3-38B-Instruct`, `qwen2.5-vl-32b-instruct` evaluation metrics to the leaderboard.
75100
- **16-05-2025** – Added `gemma-3-27b-it` evaluation metrics to the leaderboard.
76101
- **12-05-2025** – Added `Claude 3.7 sonnet`, `mistral-medium-3` evaluation metrics to the leaderboard.

Troubleshooting.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,6 @@
1313
### 3. `RuntimeError: Failed to infer device type`
1414
- This error occurs when CUDA drivers are not installed, affecting vLLM.
1515
- Follow the troubleshooting guide [here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#failed-to-infer-device-type).
16+
17+
### 4. `ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half.`
18+
- Use `--dtype=float16` instead of `--dtype=bfloat16`.

assets/invoice_test.pdf

57.9 KB
Binary file not shown.

assets/pdf2markdown.png

2.81 MB
Loading

0 commit comments

Comments
 (0)