Exception reading PDF files in PagePdfDocumentReader #3054

GrantGochnauer · 2025-05-08T17:18:51Z

Bug description
Parsing a publically available PDF file (https://www.novo-pi.com/ozempic.pdf) results in an exception:


Failed to ingest PDF file ozempic-pi.pdf
java.lang.RuntimeException: Failed to ingest PDF file ozempic-pi.pdf
	at com.vodori.platform.ai.advisor.service.DocumentIngestionService.ingestSupportingDocuments(DocumentIngestionService.java:103)
	at com.vodori.platform.ai.advisor.DocumentIngestionTest.setUp(DocumentIngestionTest.java:56)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
Caused by: java.lang.StringIndexOutOfBoundsException: Index 0 out of bounds for length 0
	at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:55)
	at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:52)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
	at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
	at java.base/java.lang.String.checkIndex(String.java:4832)
	at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:46)
	at java.base/java.lang.String.charAt(String.java:1555)
	at org.springframework.ai.reader.pdf.layout.CharacterFactory.getCharacterFromTextPosition(CharacterFactory.java:97)
	at org.springframework.ai.reader.pdf.layout.CharacterFactory.createCharacterFromTextPosition(CharacterFactory.java:46)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.writeLine(ForkPDFLayoutTextStripper.java:114)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.writeTextPositionList(ForkPDFLayoutTextStripper.java:148)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.iterateThroughTextList(ForkPDFLayoutTextStripper.java:136)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.writePage(ForkPDFLayoutTextStripper.java:85)
	at org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea.writePage(PDFLayoutTextStripperByArea.java:150)
	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380)
	at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.processPage(ForkPDFLayoutTextStripper.java:68)
	at org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea.extractRegions(PDFLayoutTextStripperByArea.java:123)
	at org.springframework.ai.reader.pdf.PagePdfDocumentReader.get(PagePdfDocumentReader.java:141)
	at org.springframework.ai.reader.pdf.PagePdfDocumentReader.get(PagePdfDocumentReader.java:48)
	at org.springframework.ai.document.DocumentReader.read(DocumentReader.java:25)
	at com.vodori.platform.ai.advisor.service.DocumentIngestionService.ingestSupportingDocuments(DocumentIngestionService.java:79)
	... 3 more

Environment

Spring AI 1.0.M8
Spring Boot 3.4.5
MacOS
Java 21

Steps to reproduce
Pass in the PDF linked above and call (which I pulled from the Spring AI docs)

List<Document> pages = new PagePdfDocumentReader(resource,
            PdfDocumentReaderConfig.builder().withPageTopMargin(0).withPageExtractedTextFormatter(ExtractedTextFormatter.builder().withNumberOfTopTextLinesToDelete(0).build())
                .withPagesPerDocument(1).build()).read();

Expected behavior
List of pages.

The text was updated successfully, but these errors were encountered:

GrantGochnauer · 2025-05-08T17:20:15Z

OpenAI's codex suggested to modify:

src/main/java/org/springframework/ai/reader/pdf/layout/CharacterFactory.java

"Override of CharacterFactory to guard against empty unicode strings."

sunyuhan1998 · 2025-05-09T03:33:31Z

I followed the reproduction code you provided, but it seems unable to reproduce the issue in my environment. Below is my code and the log obtained from executing it.

Code:
List<Document> pages = new PagePdfDocumentReader("https://www.novo-pi.com/ozempic.pdf",PdfDocumentReaderConfig.builder().withPageTopMargin(0).withPageExtractedTextFormatter(ExtractedTextFormatter.builder().withNumberOfTopTextLinesToDelete(0).build()).withPagesPerDocument(1).build()).read();

Logs:

11:31:00.277 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 1 11:31:00.335 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 2 11:31:00.371 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 3 11:31:00.397 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 4 11:31:00.425 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 5 11:31:00.445 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 6 11:31:00.464 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 7 11:31:00.487 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 8 11:31:00.502 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 9 11:31:00.514 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing PDF page: 10 11:31:00.572 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Bold 11:31:00.572 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Bold 11:31:00.573 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.573 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.573 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Bold 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Bold 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.666 [main] WARN org.apache.fontbox.cff.Type1CharString -- Unknown charstring command in glyph space of font LZZKUG+FrutigerLTPro-Black 11:31:00.675 [main] INFO org.springframework.ai.reader.pdf.PagePdfDocumentReader -- Processing 16 pages

GrantGochnauer · 2025-05-09T09:17:40Z

Interesting, I made a separate test case and used your line of code but still get the same issue - even when I pull directly from the web url. Are you testing against 1.0M8 or a different nightly build?

This is my gradle file:

plugins {
    id 'java'
    id 'org.springframework.boot' version '3.4.5'
    id 'io.spring.dependency-management' version '1.1.7'
}

group = 'com.vodori.platform.ai'
version = '0.0.1-SNAPSHOT'

java {
    toolchain {
        languageVersion = JavaLanguageVersion.of(21)
    }
}

configurations {
    compileOnly {
        extendsFrom annotationProcessor
    }
}

repositories {
    mavenCentral()
}

ext {
    set('springAiVersion', "1.0.0-M8")
    set('testcontainersVersion', "1.21.0")
}

dependencies {

    //Spring Boot
    implementation 'org.springframework.boot:spring-boot-starter-security'
    implementation 'org.springframework.boot:spring-boot-starter-web'
    implementation 'org.springframework.boot:spring-boot-starter-webflux'
    implementation 'org.springframework.boot:spring-boot-starter-data-neo4j'
    implementation 'org.springframework.boot:spring-boot-starter-graphql'

    // Spring AI
    implementation 'org.springframework.ai:spring-ai-starter-model-openai'
    implementation 'org.springframework.ai:spring-ai-starter-vector-store-neo4j'

    //Spring AI MCP
    implementation 'org.springframework.ai:spring-ai-starter-mcp-client-webflux'
    implementation 'org.springframework.ai:spring-ai-starter-mcp-server-webflux'
    implementation 'org.springframework.ai:spring-ai-starter-mcp-client'
    implementation 'org.springframework.ai:spring-ai-starter-mcp-server'

    //Spring AI Document Readers
    implementation 'org.springframework.ai:spring-ai-tika-document-reader'
    implementation 'org.springframework.ai:spring-ai-pdf-document-reader'
    implementation 'org.springframework.ai:spring-ai-markdown-document-reader'


    compileOnly 'org.projectlombok:lombok'
    developmentOnly 'org.springframework.boot:spring-boot-devtools'
    annotationProcessor 'org.projectlombok:lombok'

    //Tests
    testImplementation 'org.springframework.boot:spring-boot-starter-test'
    testImplementation 'org.springframework.security:spring-security-test'
    testImplementation("org.wiremock:wiremock-standalone:3.4.2")
    testRuntimeOnly 'org.junit.platform:junit-platform-launcher'
    testImplementation "org.testcontainers:junit-jupiter:${testcontainersVersion}"
    testImplementation "org.testcontainers:neo4j:${testcontainersVersion}"

}

dependencyManagement {
    imports {
        mavenBom "org.springframework.ai:spring-ai-bom:${springAiVersion}"
    }
}

tasks.named('test') {
    useJUnitPlatform()
}

GrantGochnauer · 2025-05-09T14:18:03Z

I also tried 1.0.0-SNAPSHOT and that failed as well. I wonder what is different with your test? No matter what I try, it always fails.

sunyuhan1998 · 2025-05-12T01:35:10Z

I also tried 1.0.0-SNAPSHOT and that failed as well. I wonder what is different with your test? No matter what I try, it always fails.

Actually, I compiled and tested based on the latest version of the Spring AI source code, so I think this might be equivalent to 1.0.0-SNAPSHOT. I suspect that the issue might not be with the Spring AI code itself, but rather with other external environmental factors.

GrantGochnauer · 2025-05-12T13:41:04Z

It is strange - are you using MacOS? I've tried various JDK versions, SNAPSHOT releases, same exception in spring AI codebases. My project dependencies are pretty simple.

sunyuhan1998 · 2025-05-14T15:51:24Z

It is strange - are you using MacOS? I've tried various JDK versions, SNAPSHOT releases, same exception in spring AI codebases. My project dependencies are pretty simple.

Yes，I'm using MacOS，and the JDK version is 17.

GrantGochnauer · 2025-05-14T16:27:16Z

Using RC1, I have confirmed the issue only happens when you add this to your gradle:

implementation 'org.springframework.ai:spring-ai-tika-document-reader'

If you remove tika, PDF parsing works. If you add tika, it fails every time.

sunyuhan1998 · 2025-05-15T01:57:45Z

Using RC1, I have confirmed the issue only happens when you add this to your gradle:

implementation 'org.springframework.ai:spring-ai-tika-document-reader'

If you remove tika, PDF parsing works. If you add tika, it fails every time.

Oh, I see! I tried to write a demo to reproduce the scenario you mentioned, but I still couldn't replicate the issue. I'm using Maven as my build tool, and here are my dependencies:

<dependencies>
     <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-tika-document-reader</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pdf-document-reader</artifactId>
    </dependency>
    <dependency>
        <groupId>ch.qos.logback</groupId>
        <artifactId>logback-classic</artifactId>
        <version>1.4.14</version>
    </dependency>
</dependencies>

The test code is still the same one we used earlier:

public static void main(String[] args) {
        PdfDocumentReaderConfig readerConfig = PdfDocumentReaderConfig.builder()
                .withPageTopMargin(0)
                .withPageExtractedTextFormatter(
                        ExtractedTextFormatter.builder()
                                .withNumberOfTopTextLinesToDelete(0)
                                .build()
                )
                .withPagesPerDocument(1)
                .build();
        List<Document> pages = new PagePdfDocumentReader("https://www.novo-pi.com/ozempic.pdf", readerConfig).read();
    }

yet the PDF conversion continues to work properly. Could it be that you've enabled some additional configuration? This is really strange—I'm completely out of ideas.

GrantGochnauer · 2025-05-15T09:08:41Z

My build is straight from start.spring.io using gradle - nothing else. I'm attaching my sample project that demonstrates the one line fails using no changes from what spring provides.

spring-ai-pdf-test.zip

dafriz · 2025-05-22T11:47:53Z

Checking the Unicode string is not empty before accessing first character avoids the StringIndexOutOfBoundsException.

I encountered this issue after updating to PDF Box 3.0.4 with the pdf in the test case and also for your linked ozempic-pi.pdf. The issue is resolved in both files with the change in the linked PR.

markpollack added this to the 1.0.x milestone May 12, 2025

dafriz mentioned this issue May 21, 2025

Bump org.apache.pdfbox to 3.0.4 and guard against empty unicode strings #3271

Merged

sobychacko closed this as completed in #3271 May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exception reading PDF files in PagePdfDocumentReader #3054

Exception reading PDF files in PagePdfDocumentReader #3054

GrantGochnauer commented May 8, 2025

GrantGochnauer commented May 8, 2025

Uh oh!

sunyuhan1998 commented May 9, 2025 •

edited

Loading

Uh oh!

GrantGochnauer commented May 9, 2025 •

edited

Loading

Uh oh!

GrantGochnauer commented May 9, 2025

Uh oh!

sunyuhan1998 commented May 12, 2025

Uh oh!

GrantGochnauer commented May 12, 2025

Uh oh!

sunyuhan1998 commented May 14, 2025

Uh oh!

GrantGochnauer commented May 14, 2025

Uh oh!

sunyuhan1998 commented May 15, 2025 •

edited

Loading

Uh oh!

GrantGochnauer commented May 15, 2025

Uh oh!

dafriz commented May 22, 2025

Uh oh!

Exception reading PDF files in PagePdfDocumentReader #3054

Exception reading PDF files in PagePdfDocumentReader #3054

Comments

GrantGochnauer commented May 8, 2025

GrantGochnauer commented May 8, 2025

Uh oh!

sunyuhan1998 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrantGochnauer commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrantGochnauer commented May 9, 2025

Uh oh!

sunyuhan1998 commented May 12, 2025

Uh oh!

GrantGochnauer commented May 12, 2025

Uh oh!

sunyuhan1998 commented May 14, 2025

Uh oh!

GrantGochnauer commented May 14, 2025

Uh oh!

sunyuhan1998 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrantGochnauer commented May 15, 2025

Uh oh!

dafriz commented May 22, 2025

Uh oh!

sunyuhan1998 commented May 9, 2025 •

edited

Loading

GrantGochnauer commented May 9, 2025 •

edited

Loading

sunyuhan1998 commented May 15, 2025 •

edited

Loading