-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Exception reading PDF files in PagePdfDocumentReader #3054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
OpenAI's codex suggested to modify:
"Override of CharacterFactory to guard against empty unicode strings." |
I followed the reproduction code you provided, but it seems unable to reproduce the issue in my environment. Below is my code and the log obtained from executing it. Code: Logs:
|
Interesting, I made a separate test case and used your line of code but still get the same issue - even when I pull directly from the web url. Are you testing against 1.0M8 or a different nightly build? This is my gradle file:
|
I also tried 1.0.0-SNAPSHOT and that failed as well. I wonder what is different with your test? No matter what I try, it always fails. |
Actually, I compiled and tested based on the latest version of the Spring AI source code, so I think this might be equivalent to 1.0.0-SNAPSHOT. I suspect that the issue might not be with the Spring AI code itself, but rather with other external environmental factors. |
It is strange - are you using MacOS? I've tried various JDK versions, SNAPSHOT releases, same exception in spring AI codebases. My project dependencies are pretty simple. |
Yes,I'm using MacOS,and the JDK version is 17. |
Using RC1, I have confirmed the issue only happens when you add this to your gradle:
If you remove tika, PDF parsing works. If you add tika, it fails every time. |
Oh, I see! I tried to write a demo to reproduce the scenario you mentioned, but I still couldn't replicate the issue. I'm using Maven as my build tool, and here are my dependencies:
The test code is still the same one we used earlier:
yet the PDF conversion continues to work properly. Could it be that you've enabled some additional configuration? This is really strange—I'm completely out of ideas. |
My build is straight from start.spring.io using gradle - nothing else. I'm attaching my sample project that demonstrates the one line fails using no changes from what spring provides. |
Checking the Unicode string is not empty before accessing first character avoids the I encountered this issue after updating to PDF Box 3.0.4 with the pdf in the test case and also for your linked ozempic-pi.pdf. The issue is resolved in both files with the change in the linked PR. |
Bug description
Parsing a publically available PDF file (https://www.novo-pi.com/ozempic.pdf) results in an exception:
Environment
Steps to reproduce
Pass in the PDF linked above and call (which I pulled from the Spring AI docs)
Expected behavior
List of pages.
The text was updated successfully, but these errors were encountered: