-
Notifications
You must be signed in to change notification settings - Fork 89
feat: Thumbnail and PDF for new column type #3193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @lhoestq @severo, I'd like to confirm a few details regarding UI integration. For that purpose, I’ve drafted a new class: class PDFSource(TypedDict):
src: str # path to the PDF file
thumbnail: str # path to a preview image of the PDF (first page) in PNG format The idea is that Do you think this structure would work on the UI side? |
yes, sure. We would need the width and height of the thumbnail too, so that we can reserve the space while loading the image |
Also, I wonder if we could return the size of the PDF in bytes? It might be a useful information to show in the dataset viewer (maybe for all the binary cells, btw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice ! yes +1 on thumbnail width and height and it should be all good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds support for processing PDF columns as a new “document” modality in both the worker and viewer utilities, including thumbnail generation and dependency updates.
- Introduce
Pdf
feature handling in dataset-runner modalities and DTOS - Implement
create_pdf_file
+ integrate PDF inget_cell_value
- Add tests, fixtures, and update
pyproject.toml
forpdfplumber
Reviewed Changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
services/worker/tests/job_runners/config/test_parquet_and_info.py | Include Pdf in imports and expected info parsing tests |
services/worker/src/worker/job_runners/dataset/modalities.py | Import Pdf , classify as "document" , add .pdf extension |
services/worker/src/worker/dtos.py | Extend DatasetModality literal with "document" |
libs/libcommon/tests/viewer_utils/test_rows.py | Add pdf cases to first-rows truncation tests |
libs/libcommon/tests/viewer_utils/test_features.py | Include Pdf in supported-columns test |
libs/libcommon/tests/viewer_utils/test_assets.py | New test_create_pdf_file for PDF+thumbnail generation |
libs/libcommon/tests/fixtures/datasets.py | Add pdf fixture with expected asset urls |
libs/libcommon/tests/constants.py | Add "pdf" to constant lists |
libs/libcommon/src/libcommon/viewer_utils/rows.py | Treat Pdf like Image /Audio for untruncated columns |
libs/libcommon/src/libcommon/viewer_utils/features.py | Import pdfplumber , define pdf() handler, integrate in get_cell_value |
libs/libcommon/src/libcommon/viewer_utils/asset.py | Define PDFSource and implement create_pdf_file |
libs/libcommon/src/libcommon/url_preparator.py | Support Pdf in AssetUrlPath and URL preparation logic |
libs/libcommon/pyproject.toml | Add pdfplumber dependency |
Comments suppressed due to low confidence (2)
services/worker/src/worker/job_runners/dataset/modalities.py:47
- Add unit tests for
classify_modality
anddetect_modalities*
to ensure thatPdf
features and.pdf
extensions correctly map to the "document" modality.
elif isinstance(feature, Pdf):
libs/libcommon/src/libcommon/viewer_utils/features.py:279
- Add unit tests for the
pdf()
cell handler underget_cell_value
to validate handling of bytes payloads, encoded dicts, and filesystem paths.
def pdf(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Looks mostly good to me :)
Feel free to also edit transform_rows()
in libapi to enable multithreading when preparing the rows containing PDFs (right now multithreading is only enabled for audio and image) and we can merge !
I also added a comment on filtering
@@ -320,6 +324,8 @@ def detect_modalities_from_filetypes(dataset: str) -> set[DatasetModality]: | |||
modalities.add("3d") | |||
elif filetype["extension"] in TEXT_EXTENSIONS: | |||
modalities.add("text") | |||
elif filetype["extension"] in DOCUMENT_EXTENSIONS: | |||
modalities.add("document") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not sure, but maybe we need to update the openapi spec with this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cooool!
I saw that you want to update the doc in another PR, but I think that the OpenAPI spec should be updated here, to reflect the current state.
As a follow-up: should we support lists of PDFs files as we do for images and audio?
@@ -277,7 +277,7 @@ class DatasetCompatibleLibrariesResponse(TypedDict): | |||
formats: list[DatasetFormat] | |||
|
|||
|
|||
DatasetModality = Literal["image", "audio", "text", "video", "geospatial", "3d", "tabular", "timeseries"] | |||
DatasetModality = Literal["image", "audio", "text", "video", "geospatial", "3d", "tabular", "timeseries", "document"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems right, but ping @julien-c @gary149: the proposal here is to add a new "Modality" (https://huggingface.co/datasets) for documents:
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I’ve applied the final modifications based on your suggestions:
In upcoming PRs, I plan to address:
With these changes, is it safe to proceed with merging, or could this risk breaking something in the Hub? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
With these changes, is it safe to proceed with merging, or could this risk breaking something in the Hub?
let's see :p
I think we need to enable CORS for these assets to serve the PDFs right? @AndreaFrancis |
@cfahlgren1 Not sure if any other configuration is needed from moonlanding/dataset-viewer infra as it already consumes the sources from https://datasets-server.huggingface.co/assets/* same as for images and audios. Maybe @severo? |
Audio and images send as a |
Part of #2991
Given that huggingface/datasets#7325 has been implemented, we can now process PDF columns in dataset-viewer.
TODOs:
Note:
Doc will go in an upcoming PR once this one is deployed and confirm it works as expected and ready to share with the community.