Skip to content

Exception reading PDF files in PagePdfDocumentReader #3054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
GrantGochnauer opened this issue May 8, 2025 · 11 comments · Fixed by #3271
Closed

Exception reading PDF files in PagePdfDocumentReader #3054

GrantGochnauer opened this issue May 8, 2025 · 11 comments · Fixed by #3271
Milestone

Comments

@GrantGochnauer
Copy link

Bug description
Parsing a publically available PDF file (https://www.novo-pi.com/ozempic.pdf) results in an exception:


Failed to ingest PDF file ozempic-pi.pdf
java.lang.RuntimeException: Failed to ingest PDF file ozempic-pi.pdf
	at com.vodori.platform.ai.advisor.service.DocumentIngestionService.ingestSupportingDocuments(DocumentIngestionService.java:103)
	at com.vodori.platform.ai.advisor.DocumentIngestionTest.setUp(DocumentIngestionTest.java:56)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
Caused by: java.lang.StringIndexOutOfBoundsException: Index 0 out of bounds for length 0
	at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:55)
	at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:52)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
	at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
	at java.base/java.lang.String.checkIndex(String.java:4832)
	at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:46)
	at java.base/java.lang.String.charAt(String.java:1555)
	at org.springframework.ai.reader.pdf.layout.CharacterFactory.getCharacterFromTextPosition(CharacterFactory.java:97)
	at org.springframework.ai.reader.pdf.layout.CharacterFactory.createCharacterFromTextPosition(CharacterFactory.java:46)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.writeLine(ForkPDFLayoutTextStripper.java:114)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.writeTextPositionList(ForkPDFLayoutTextStripper.java:148)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.iterateThroughTextList(ForkPDFLayoutTextStripper.java:136)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.writePage(ForkPDFLayoutTextStripper.java:85)
	at org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea.writePage(PDFLayoutTextStripperByArea.java:150)
	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.processPage(ForkPDFLayoutTextStripper.java:68)
	at org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea.extractRegions(PDFLayoutTextStripperByArea.java:123)
	at org.springframework.ai.reader.pdf.PagePdfDocumentReader.get(PagePdfDocumentReader.java:141)
	at org.springframework.ai.reader.pdf.PagePdfDocumentReader.get(PagePdfDocumentReader.java:48)
	at org.springframework.ai.document.DocumentReader.read(DocumentReader.java:25)
	at com.vodori.platform.ai.advisor.service.DocumentIngestionService.ingestSupportingDocuments(DocumentIngestionService.java:79)
	... 3 more

Environment

  • Spring AI 1.0.M8
  • Spring Boot 3.4.5
  • MacOS
  • Java 21

Steps to reproduce
Pass in the PDF linked above and call (which I pulled from the Spring AI docs)

List<Document> pages = new PagePdfDocumentReader(resource,
            PdfDocumentReaderConfig.builder().withPageTopMargin(0).withPageExtractedTextFormatter(ExtractedTextFormatter.builder().withNumberOfTopTextLinesToDelete(0).build())
                .withPagesPerDocument(1).build()).read();

Expected behavior
List of pages.

@GrantGochnauer
Copy link
Author

OpenAI's codex suggested to modify:

src/main/java/org/springframework/ai/reader/pdf/layout/CharacterFactory.java

"Override of CharacterFactory to guard against empty unicode strings."

@sunyuhan1998
Copy link
Contributor

sunyuhan1998 commented May 9, 2025

I followed the reproduction code you provided, but it seems unable to reproduce the issue in my environment. Below is my code and the log obtained from executing it.

Code:
List<Document> pages = new PagePdfDocumentReader("https://www.novo-pi.com/ozempic.pdf",PdfDocumentReaderConfig.builder().withPageTopMargin(0).withPageExtractedTextFormatter(ExtractedTextFormatter.builder().withNumberOfTopTextLinesToDelete(0).build()).withPagesPerDocument(1).build()).read();

Logs:

11:31:00.277 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 1 11:31:00.335 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 2 11:31:00.371 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 3 11:31:00.397 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 4 11:31:00.425 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 5 11:31:00.445 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 6 11:31:00.464 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 7 11:31:00.487 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 8 11:31:00.502 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 9 11:31:00.514 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 10 11:31:00.572 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Bold 11:31:00.572 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Bold 11:31:00.573 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.573 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.573 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Bold 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Bold 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.675 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing 16 pages

@GrantGochnauer
Copy link
Author

GrantGochnauer commented May 9, 2025

Interesting, I made a separate test case and used your line of code but still get the same issue - even when I pull directly from the web url. Are you testing against 1.0M8 or a different nightly build?

This is my gradle file:

plugins {
    id 'java'
    id 'org.springframework.boot' version '3.4.5'
    id 'io.spring.dependency-management' version '1.1.7'
}

group = 'com.vodori.platform.ai'
version = '0.0.1-SNAPSHOT'

java {
    toolchain {
        languageVersion = JavaLanguageVersion.of(21)
    }
}

configurations {
    compileOnly {
        extendsFrom annotationProcessor
    }
}

repositories {
    mavenCentral()
}

ext {
    set('springAiVersion', "1.0.0-M8")
    set('testcontainersVersion', "1.21.0")
}

dependencies {

    //Spring Boot
    implementation 'org.springframework.boot:spring-boot-starter-security'
    implementation 'org.springframework.boot:spring-boot-starter-web'
    implementation 'org.springframework.boot:spring-boot-starter-webflux'
    implementation 'org.springframework.boot:spring-boot-starter-data-neo4j'
    implementation 'org.springframework.boot:spring-boot-starter-graphql'

    // Spring AI
    implementation 'org.springframework.ai:spring-ai-starter-model-openai'
    implementation 'org.springframework.ai:spring-ai-starter-vector-store-neo4j'

    //Spring AI MCP
    implementation 'org.springframework.ai:spring-ai-starter-mcp-client-webflux'
    implementation 'org.springframework.ai:spring-ai-starter-mcp-server-webflux'
    implementation 'org.springframework.ai:spring-ai-starter-mcp-client'
    implementation 'org.springframework.ai:spring-ai-starter-mcp-server'

    //Spring AI Document Readers
    implementation 'org.springframework.ai:spring-ai-tika-document-reader'
    implementation 'org.springframework.ai:spring-ai-pdf-document-reader'
    implementation 'org.springframework.ai:spring-ai-markdown-document-reader'


    compileOnly 'org.projectlombok:lombok'
    developmentOnly 'org.springframework.boot:spring-boot-devtools'
    annotationProcessor 'org.projectlombok:lombok'

    //Tests
    testImplementation 'org.springframework.boot:spring-boot-starter-test'
    testImplementation 'org.springframework.security:spring-security-test'
    testImplementation("org.wiremock:wiremock-standalone:3.4.2")
    testRuntimeOnly 'org.junit.platform:junit-platform-launcher'
    testImplementation "org.testcontainers:junit-jupiter:${testcontainersVersion}"
    testImplementation "org.testcontainers:neo4j:${testcontainersVersion}"

}

dependencyManagement {
    imports {
        mavenBom "org.springframework.ai:spring-ai-bom:${springAiVersion}"
    }
}

tasks.named('test') {
    useJUnitPlatform()
}

@GrantGochnauer
Copy link
Author

I also tried 1.0.0-SNAPSHOT and that failed as well. I wonder what is different with your test? No matter what I try, it always fails.

@sunyuhan1998
Copy link
Contributor

I also tried 1.0.0-SNAPSHOT and that failed as well. I wonder what is different with your test? No matter what I try, it always fails.

Actually, I compiled and tested based on the latest version of the Spring AI source code, so I think this might be equivalent to 1.0.0-SNAPSHOT. I suspect that the issue might not be with the Spring AI code itself, but rather with other external environmental factors.

@GrantGochnauer
Copy link
Author

It is strange - are you using MacOS? I've tried various JDK versions, SNAPSHOT releases, same exception in spring AI codebases. My project dependencies are pretty simple.

@markpollack markpollack added this to the 1.0.x milestone May 12, 2025
@sunyuhan1998
Copy link
Contributor

It is strange - are you using MacOS? I've tried various JDK versions, SNAPSHOT releases, same exception in spring AI codebases. My project dependencies are pretty simple.

Yes,I'm using MacOS,and the JDK version is 17.

@GrantGochnauer
Copy link
Author

Using RC1, I have confirmed the issue only happens when you add this to your gradle:

implementation 'org.springframework.ai:spring-ai-tika-document-reader'

If you remove tika, PDF parsing works. If you add tika, it fails every time.

@sunyuhan1998
Copy link
Contributor

sunyuhan1998 commented May 15, 2025

Using RC1, I have confirmed the issue only happens when you add this to your gradle:

implementation 'org.springframework.ai:spring-ai-tika-document-reader'

If you remove tika, PDF parsing works. If you add tika, it fails every time.

Oh, I see! I tried to write a demo to reproduce the scenario you mentioned, but I still couldn't replicate the issue. I'm using Maven as my build tool, and here are my dependencies:

<dependencies>
     <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-tika-document-reader</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pdf-document-reader</artifactId>
    </dependency>
    <dependency>
        <groupId>ch.qos.logback</groupId>
        <artifactId>logback-classic</artifactId>
        <version>1.4.14</version>
    </dependency>
</dependencies>

The test code is still the same one we used earlier:

public static void main(String[] args) {
        PdfDocumentReaderConfig readerConfig = PdfDocumentReaderConfig.builder()
                .withPageTopMargin(0)
                .withPageExtractedTextFormatter(
                        ExtractedTextFormatter.builder()
                                .withNumberOfTopTextLinesToDelete(0)
                                .build()
                )
                .withPagesPerDocument(1)
                .build();
        List<Document> pages = new PagePdfDocumentReader("https://www.novo-pi.com/ozempic.pdf", readerConfig).read();
    }

yet the PDF conversion continues to work properly. Could it be that you've enabled some additional configuration? This is really strange—I'm completely out of ideas.

@GrantGochnauer
Copy link
Author

My build is straight from start.spring.io using gradle - nothing else. I'm attaching my sample project that demonstrates the one line fails using no changes from what spring provides.

spring-ai-pdf-test.zip

@dafriz
Copy link
Contributor

dafriz commented May 22, 2025

Checking the Unicode string is not empty before accessing first character avoids the StringIndexOutOfBoundsException.

I encountered this issue after updating to PDF Box 3.0.4 with the pdf in the test case and also for your linked ozempic-pi.pdf. The issue is resolved in both files with the change in the linked PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants