Fix page_range stopping at page 32 when start >= 30 #2658

Copilot · 2025-11-19T13:10:27Z

The page_range parameter stops prematurely at page 32 when the range starts from page 30 or higher. For example, page_range=(30, 35) extracts only pages 30-32 instead of 30-35.

Root Cause

The drain loop uses a hardcoded batch_size = 32 to pull processed pages from the output queue. This creates an effective ceiling at page_no 31 (page 32 in 1-indexed terms) when combined with the page range filtering logic.

Changes

Core fix:

Changed batch_size: int = 32 to batch_size: int = total_pages in the drain loops
- docling/pipeline/standard_pdf_pipeline.py:548
- docling/experimental/pipeline/threaded_layout_vlm_pipeline.py:255

Tests:

Added tests/test_page_range_bug.py with coverage for:
- Ranges starting at page 30+
- Ranges entirely beyond page 32
- Single-page extraction at boundaries (pages 32, 33)
- Range crossing the page 32 boundary

# Previously failed - only got pages 30-32
result = converter.convert(pdf, page_range=(30, 35))
assert len(result.pages) == 6  # Now correctly extracts all 6 pages

# Verify page_no values are correct (0-indexed)
assert [p.page_no for p in result.pages] == [29, 30, 31, 32, 33, 34]

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

astral.sh
- Triggering command: curl -LsSf REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>page_range parameter stops prematurely at page 32 when starting from page 30+</issue_title>
<issue_description>### Bug
The page_range parameter in DocumentConverter.convert() does not properly extract the full requested range when the range spans into the 30s-40s region. The conversion prematurely stops at page 32 instead of continuing to the specified end page.

Pattern observed:

page_range=(1, 45) → Works correctly, extracts pages 1-45

page_range=(30, 35) → Stops at page 32, extracts pages 30-32 instead of 30-35

page_range=(30, 45) → Stops at page 32, extracts pages 30-32 instead of 30-45

This suggests the issue occurs when:

Start page is 30 or higher, AND

The range is requested to go beyond page 32

Or maybe related to the number of pages

Suspected a sys.maxsize issue, but nah

The conversion appears to stop at the end of page 32, regardless of the requested end page.

Steps to reproduce

Use a PDF with at least 50 pages

Test these page ranges:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# Works fine
result1 = converter.convert("test.pdf", page_range=(1, 45)).document.export_to_markdown()
print(result1)
# Manually checking the final lines' content in the PDF file

# Fails - stops at page 32
result2 = converter.convert("test.pdf", page_range=(30, 35)).document.export_to_markdown()
print(result2)
# Manually checking the final lines' content in the PDF file

# Fails - stops at page 32
result3 = converter.convert("test.pdf", page_range=(30, 45)).document.export_to_markdown()
print(result3)
# Manually checking the final lines' content in the PDF file
Environment

Python 3.11.13
UV for package management
docling=2.56.1

Don't hesitate to ask for extra information/context :)</issue_description>

<agent_instructions>

You're a senior software engineer that prefers code readability and simplicity over clever solutions. you want to help me write the highest quality solution, so you're not afraid of contradicting me or suggesting I change my approach when that's the best way to help me.

Be concise, but answer questions fully. Avoid preambles or warnings and give answers directly.

Keep responses unique and free of repetition.

Avoid any language constructs that could be interpreted as expressing remorse, apology, regret or sugarcoating. This includes any phrases containing words like 'sorry', 'apologies', 'regret', 'you're absolutely right.' etc.

Code formatting must be in black style, with 120 char max line width

Adhere to widely popular PEP standards for coding guidelines

Code must never speculate or check on existence of fields, attributes, methods or arguments by introspection (hasattr, if "fieldname" in dict and related). Either use type annotations or tools to query API documentation to figure out proper usage. If context is missing to create confident code, ask for input.

Always apply proper type annotation on classes, method arguments, return types. Make sure it can pass mypy checks. Wherever needed, create data models with fields, do not use raw python dictionaries.

Use Optional[...] annotation only where appropriate. Guide mypy type inference with if isinstance(..., type) or assert isinstance(..., type). Move such assertions to the outermost scope possible to ensure it is not repetitive

If it would be useful for a data model to be serializable in any potential context, create pydantic v2 types. For transient data structures, plain python data classes are acceptable.

Path and file operations should be using pathlib, and Path type. Avoid using os.path functions.

Any tool call to run a python-based command or CLI (e.g. pytest) must be prefixed with uv run

You must not write code that checks if a dependency is available by writing code that tries to import it, and otherwise fallback to a different solution. Regard the added dependency as mandatory.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes page_range parameter stops prematurely at page 32 when starting from page 30+ #2655

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

mergify · 2025-11-19T13:11:01Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Enforce conventional commit

This rule is failing.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Co-authored-by: cau-git <[email protected]>

Initial plan

18f705b

Copilot AI assigned Copilot and cau-git Nov 19, 2025

Copilot started work on behalf of cau-git November 19, 2025 13:12 View session

Copilot AI and others added 3 commits November 19, 2025 13:25

Fix page_range stopping at page 32 by using dynamic batch_size

58fc6cc

Co-authored-by: cau-git <[email protected]>

Add comprehensive tests for page_range bug fix

0788e71

Co-authored-by: cau-git <[email protected]>

Apply ruff formatting to test file

6fe6aae

Co-authored-by: cau-git <[email protected]>

Copilot AI changed the title ~~[WIP] Fix page_range parameter to extract full range~~ Fix page_range stopping at page 32 when start >= 30 Nov 19, 2025

Copilot finished work on behalf of cau-git November 21, 2025 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix page_range stopping at page 32 when start >= 30 #2658

Fix page_range stopping at page 32 when start >= 30 #2658

Copilot AI commented Nov 19, 2025 •

edited

Loading

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix page_range stopping at page 32 when start >= 30 #2658

Are you sure you want to change the base?

Fix page_range stopping at page 32 when start >= 30 #2658

Conversation

Copilot AI commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Changes

I tried to connect to the following addresses, but was blocked by firewall rules:

Steps to reproduce

Environment

Comments on the Issue (you are @copilot in this section)

Uh oh!

mergify bot commented Nov 19, 2025

Merge Protections

🔴 Enforce conventional commit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 19, 2025 •

edited

Loading