Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 19, 2025

The page_range parameter stops prematurely at page 32 when the range starts from page 30 or higher. For example, page_range=(30, 35) extracts only pages 30-32 instead of 30-35.

Root Cause

The drain loop uses a hardcoded batch_size = 32 to pull processed pages from the output queue. This creates an effective ceiling at page_no 31 (page 32 in 1-indexed terms) when combined with the page range filtering logic.

Changes

Core fix:

  • Changed batch_size: int = 32 to batch_size: int = total_pages in the drain loops
    • docling/pipeline/standard_pdf_pipeline.py:548
    • docling/experimental/pipeline/threaded_layout_vlm_pipeline.py:255

Tests:

  • Added tests/test_page_range_bug.py with coverage for:
    • Ranges starting at page 30+
    • Ranges entirely beyond page 32
    • Single-page extraction at boundaries (pages 32, 33)
    • Range crossing the page 32 boundary
# Previously failed - only got pages 30-32
result = converter.convert(pdf, page_range=(30, 35))
assert len(result.pages) == 6  # Now correctly extracts all 6 pages

# Verify page_no values are correct (0-indexed)
assert [p.page_no for p in result.pages] == [29, 30, 31, 32, 33, 34]

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • astral.sh
    • Triggering command: curl -LsSf REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>page_range parameter stops prematurely at page 32 when starting from page 30+</issue_title>
<issue_description>### Bug
The page_range parameter in DocumentConverter.convert() does not properly extract the full requested range when the range spans into the 30s-40s region. The conversion prematurely stops at page 32 instead of continuing to the specified end page.

Pattern observed:

  • page_range=(1, 45) → Works correctly, extracts pages 1-45
  • page_range=(30, 35) → Stops at page 32, extracts pages 30-32 instead of 30-35
  • page_range=(30, 45) → Stops at page 32, extracts pages 30-32 instead of 30-45

This suggests the issue occurs when:

  1. Start page is 30 or higher, AND
  2. The range is requested to go beyond page 32
  3. Or maybe related to the number of pages
  4. Suspected a sys.maxsize issue, but nah

The conversion appears to stop at the end of page 32, regardless of the requested end page.

Steps to reproduce

  1. Use a PDF with at least 50 pages
  2. Test these page ranges:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# Works fine
result1 = converter.convert("test.pdf", page_range=(1, 45)).document.export_to_markdown()
print(result1)
# Manually checking the final lines' content in the PDF file

# Fails - stops at page 32
result2 = converter.convert("test.pdf", page_range=(30, 35)).document.export_to_markdown()
print(result2)
# Manually checking the final lines' content in the PDF file

# Fails - stops at page 32
result3 = converter.convert("test.pdf", page_range=(30, 45)).document.export_to_markdown()
print(result3)
# Manually checking the final lines' content in the PDF file

Environment

Python 3.11.13
UV for package management
docling=2.56.1

Don't hesitate to ask for extra information/context :)</issue_description>

<agent_instructions>

  • You're a senior software engineer that prefers code readability and simplicity over clever solutions. you want to help me write the highest quality solution, so you're not afraid of contradicting me or suggesting I change my approach when that's the best way to help me.

  • Be concise, but answer questions fully. Avoid preambles or warnings and give answers directly.

  • Keep responses unique and free of repetition.

  • Avoid any language constructs that could be interpreted as expressing remorse, apology, regret or sugarcoating. This includes any phrases containing words like 'sorry', 'apologies', 'regret', 'you're absolutely right.' etc.

  • Code formatting must be in black style, with 120 char max line width

  • Adhere to widely popular PEP standards for coding guidelines

  • Code must never speculate or check on existence of fields, attributes, methods or arguments by introspection (hasattr, if "fieldname" in dict and related). Either use type annotations or tools to query API documentation to figure out proper usage. If context is missing to create confident code, ask for input.

  • Always apply proper type annotation on classes, method arguments, return types. Make sure it can pass mypy checks. Wherever needed, create data models with fields, do not use raw python dictionaries.

  • Use Optional[...] annotation only where appropriate. Guide mypy type inference with if isinstance(..., type) or assert isinstance(..., type). Move such assertions to the outermost scope possible to ensure it is not repetitive

  • If it would be useful for a data model to be serializable in any potential context, create pydantic v2 types. For transient data structures, plain python data classes are acceptable.

  • Path and file operations should be using pathlib, and Path type. Avoid using os.path functions.

  • Any tool call to run a python-based command or CLI (e.g. pytest) must be prefixed with uv run

  • You must not write code that checks if a dependency is available by writing code that tries to import it, and otherwise fallback to a different solution. Regard the added dependency as mandatory.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@mergify
Copy link

mergify bot commented Nov 19, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Enforce conventional commit

This rule is failing.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Copilot AI changed the title [WIP] Fix page_range parameter to extract full range Fix page_range stopping at page 32 when start >= 30 Nov 19, 2025
Copilot finished work on behalf of cau-git November 21, 2025 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

page_range parameter stops prematurely at page 32 when starting from page 30+

2 participants