Skip to content

Duplicated Information When Extracting Tables Inside Table Cells #276

Closed as not planned
@ricdurvin

Description

@ricdurvin

Bug

When extracting tables from the attached PDF (table_inside_cell.pdf) using pymupdf4llm, I observed that information is duplicated in the output (pymupdf4llm-table_inside_cell.md). Specifically, when there is a table nested within a cell of another table, the content appears multiple times in the extracted result.

Steps to reproduce

  1. Run pymupdf4llm on the attached table_inside_cell.pdf.
  2. Review the output.
  3. Observe that the content from the nested table is duplicated in the output.
import pymupdf4llm

def extract_with_pymupdf4llm(file_name):
    text = pymupdf4llm.to_markdown(file_name)
    return text

if name == "main":
    file = "PDF_PATH"
    text = extract_with_pymupdf4llm(file)
    print(text)

pymupdf4llm-table_inside_cell.md
table_inside_cell.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    wontfixThis will not be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions