Skip to content

feat: Thumbnail and PDF for new column type #3193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 29, 2025
Merged

feat: Thumbnail and PDF for new column type #3193

merged 22 commits into from
May 29, 2025

Conversation

AndreaFrancis
Copy link
Contributor

@AndreaFrancis AndreaFrancis commented May 13, 2025

Part of #2991
Given that huggingface/datasets#7325 has been implemented, we can now process PDF columns in dataset-viewer.
TODOs:

  • Implement PDF processing in get_cell_value for /first-rows and /rows
  • Add the "document" modality using the same logic as image/audio
  • Fix dependencies (poetry.lock)
  • Add unit tests in libcommon
  • Add unit tests for modalities in worker
  • Add e2e for new type

Note:
Doc will go in an upcoming PR once this one is deployed and confirm it works as expected and ready to share with the community.

@AndreaFrancis
Copy link
Contributor Author

Hi @lhoestq @severo,
This is a very early draft of how the PDF implementation could work.

I'd like to confirm a few details regarding UI integration. For that purpose, I’ve drafted a new class:

class PDFSource(TypedDict):
    src: str  # path to the PDF file
    thumbnail: str  # path to a preview image of the PDF (first page) in PNG format

The idea is that src would point to the actual PDF file, while thumbnail would be used to display a preview in the UI.

Do you think this structure would work on the UI side?

@severo
Copy link
Collaborator

severo commented May 13, 2025

yes, sure. We would need the width and height of the thumbnail too, so that we can reserve the space while loading the image

@severo
Copy link
Collaborator

severo commented May 14, 2025

Also, I wonder if we could return the size of the PDF in bytes? It might be a useful information to show in the dataset viewer (maybe for all the binary cells, btw)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ! yes +1 on thumbnail width and height and it should be all good

@AndreaFrancis AndreaFrancis requested a review from Copilot May 17, 2025 01:48
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for processing PDF columns as a new “document” modality in both the worker and viewer utilities, including thumbnail generation and dependency updates.

  • Introduce Pdf feature handling in dataset-runner modalities and DTOS
  • Implement create_pdf_file + integrate PDF in get_cell_value
  • Add tests, fixtures, and update pyproject.toml for pdfplumber

Reviewed Changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
services/worker/tests/job_runners/config/test_parquet_and_info.py Include Pdf in imports and expected info parsing tests
services/worker/src/worker/job_runners/dataset/modalities.py Import Pdf, classify as "document", add .pdf extension
services/worker/src/worker/dtos.py Extend DatasetModality literal with "document"
libs/libcommon/tests/viewer_utils/test_rows.py Add pdf cases to first-rows truncation tests
libs/libcommon/tests/viewer_utils/test_features.py Include Pdf in supported-columns test
libs/libcommon/tests/viewer_utils/test_assets.py New test_create_pdf_file for PDF+thumbnail generation
libs/libcommon/tests/fixtures/datasets.py Add pdf fixture with expected asset urls
libs/libcommon/tests/constants.py Add "pdf" to constant lists
libs/libcommon/src/libcommon/viewer_utils/rows.py Treat Pdf like Image/Audio for untruncated columns
libs/libcommon/src/libcommon/viewer_utils/features.py Import pdfplumber, define pdf() handler, integrate in get_cell_value
libs/libcommon/src/libcommon/viewer_utils/asset.py Define PDFSource and implement create_pdf_file
libs/libcommon/src/libcommon/url_preparator.py Support Pdf in AssetUrlPath and URL preparation logic
libs/libcommon/pyproject.toml Add pdfplumber dependency
Comments suppressed due to low confidence (2)

services/worker/src/worker/job_runners/dataset/modalities.py:47

  • Add unit tests for classify_modality and detect_modalities* to ensure that Pdf features and .pdf extensions correctly map to the "document" modality.
elif isinstance(feature, Pdf):

libs/libcommon/src/libcommon/viewer_utils/features.py:279

  • Add unit tests for the pdf() cell handler under get_cell_value to validate handling of bytes payloads, encoded dicts, and filesystem paths.
def pdf(

@AndreaFrancis AndreaFrancis marked this pull request as ready for review May 26, 2025 21:58
@AndreaFrancis AndreaFrancis requested review from lhoestq and severo May 26, 2025 21:59
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Looks mostly good to me :)

Feel free to also edit transform_rows() in libapi to enable multithreading when preparing the rows containing PDFs (right now multithreading is only enabled for audio and image) and we can merge !

I also added a comment on filtering

@@ -320,6 +324,8 @@ def detect_modalities_from_filetypes(dataset: str) -> set[DatasetModality]:
modalities.add("3d")
elif filetype["extension"] in TEXT_EXTENSIONS:
modalities.add("text")
elif filetype["extension"] in DOCUMENT_EXTENSIONS:
modalities.add("document")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not sure, but maybe we need to update the openapi spec with this)

Copy link
Collaborator

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cooool!

I saw that you want to update the doc in another PR, but I think that the OpenAPI spec should be updated here, to reflect the current state.

As a follow-up: should we support lists of PDFs files as we do for images and audio?

@@ -277,7 +277,7 @@ class DatasetCompatibleLibrariesResponse(TypedDict):
formats: list[DatasetFormat]


DatasetModality = Literal["image", "audio", "text", "video", "geospatial", "3d", "tabular", "timeseries"]
DatasetModality = Literal["image", "audio", "text", "video", "geospatial", "3d", "tabular", "timeseries", "document"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems right, but ping @julien-c @gary149: the proposal here is to add a new "Modality" (https://huggingface.co/datasets) for documents:

Screenshot From 2025-05-27 14-14-10

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@AndreaFrancis
Copy link
Contributor Author

Hi @lhoestq and @severo,

I’ve applied the final modifications based on your suggestions:

  • Added transform_rows to libapi
  • Clarified that filtering is not yet supported for transformed data
  • Updated OpenAPI schemas to reflect the new response types
  • Switched to using thumbnail as an ImageSource for better alignment with existing structures

In upcoming PRs, I plan to address:

  • Including the number of pages in the transformed data (for stats and filtering)
  • Supporting lists of PDF files, similar to how we handle images and audio
  • Adding relevant documentation content

With these changes, is it safe to proceed with merging, or could this risk breaking something in the Hub?

@AndreaFrancis AndreaFrancis requested a review from lhoestq May 27, 2025 21:25
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

With these changes, is it safe to proceed with merging, or could this risk breaking something in the Hub?

let's see :p

@AndreaFrancis AndreaFrancis merged commit e785a0f into main May 29, 2025
30 checks passed
@AndreaFrancis AndreaFrancis deleted the pdf-type branch May 29, 2025 14:19
@cfahlgren1
Copy link
Contributor

I think we need to enable CORS for these assets to serve the PDFs right? @AndreaFrancis

@AndreaFrancis
Copy link
Contributor Author

@cfahlgren1 Not sure if any other configuration is needed from moonlanding/dataset-viewer infra as it already consumes the sources from https://datasets-server.huggingface.co/assets/* same as for images and audios. Maybe @severo?

@cfahlgren1
Copy link
Contributor

same as for images and audios.

Audio and images send as a no-cors fetch mode whereas the PDF would be cors. I was getting a CORS error when trying it out yesterday. Not sure if it has to do with /cached-assets/ route or maybe something else https://datasets-server.huggingface.co/cached-assets/...🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants