feat: Thumbnail and PDF for new column type #3193

AndreaFrancis · 2025-05-13T02:35:11Z

Part of #2991
Given that huggingface/datasets#7325 has been implemented, we can now process PDF columns in dataset-viewer.
TODOs:

Implement PDF processing in get_cell_value for /first-rows and /rows
Add the "document" modality using the same logic as image/audio
Fix dependencies (poetry.lock)
Add unit tests in libcommon
Add unit tests for modalities in worker
Add e2e for new type

Note:
Doc will go in an upcoming PR once this one is deployed and confirm it works as expected and ready to share with the community.

AndreaFrancis · 2025-05-13T15:04:56Z

Hi @lhoestq @severo,
This is a very early draft of how the PDF implementation could work.

I'd like to confirm a few details regarding UI integration. For that purpose, I’ve drafted a new class:

class PDFSource(TypedDict):
    src: str  # path to the PDF file
    thumbnail: str  # path to a preview image of the PDF (first page) in PNG format

The idea is that src would point to the actual PDF file, while thumbnail would be used to display a preview in the UI.

Do you think this structure would work on the UI side?

severo · 2025-05-13T15:23:30Z

yes, sure. We would need the width and height of the thumbnail too, so that we can reserve the space while loading the image

severo · 2025-05-14T07:35:41Z

Also, I wonder if we could return the size of the PDF in bytes? It might be a useful information to show in the dataset viewer (maybe for all the binary cells, btw)

lhoestq

nice ! yes +1 on thumbnail width and height and it should be all good

libs/libcommon/src/libcommon/viewer_utils/features.py

Copilot

Pull Request Overview

Adds support for processing PDF columns as a new “document” modality in both the worker and viewer utilities, including thumbnail generation and dependency updates.

Introduce Pdf feature handling in dataset-runner modalities and DTOS
Implement create_pdf_file + integrate PDF in get_cell_value
Add tests, fixtures, and update pyproject.toml for pdfplumber

Reviewed Changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
services/worker/tests/job_runners/config/test_parquet_and_info.py	Include `Pdf` in imports and expected info parsing tests
services/worker/src/worker/job_runners/dataset/modalities.py	Import `Pdf`, classify as `"document"`, add `.pdf` extension
services/worker/src/worker/dtos.py	Extend `DatasetModality` literal with `"document"`
libs/libcommon/tests/viewer_utils/test_rows.py	Add `pdf` cases to first-rows truncation tests
libs/libcommon/tests/viewer_utils/test_features.py	Include `Pdf` in supported-columns test
libs/libcommon/tests/viewer_utils/test_assets.py	New `test_create_pdf_file` for PDF+thumbnail generation
libs/libcommon/tests/fixtures/datasets.py	Add `pdf` fixture with expected asset urls
libs/libcommon/tests/constants.py	Add `"pdf"` to constant lists
libs/libcommon/src/libcommon/viewer_utils/rows.py	Treat `Pdf` like `Image`/`Audio` for untruncated columns
libs/libcommon/src/libcommon/viewer_utils/features.py	Import `pdfplumber`, define `pdf()` handler, integrate in `get_cell_value`
libs/libcommon/src/libcommon/viewer_utils/asset.py	Define `PDFSource` and implement `create_pdf_file`
libs/libcommon/src/libcommon/url_preparator.py	Support `Pdf` in `AssetUrlPath` and URL preparation logic
libs/libcommon/pyproject.toml	Add `pdfplumber` dependency

Comments suppressed due to low confidence (2)

services/worker/src/worker/job_runners/dataset/modalities.py:47

Add unit tests for classify_modality and detect_modalities* to ensure that Pdf features and .pdf extensions correctly map to the "document" modality.

elif isinstance(feature, Pdf):

libs/libcommon/src/libcommon/viewer_utils/features.py:279

Add unit tests for the pdf() cell handler under get_cell_value to validate handling of bytes payloads, encoded dicts, and filesystem paths.

def pdf(

libs/libcommon/src/libcommon/viewer_utils/asset.py

libs/libcommon/src/libcommon/viewer_utils/features.py

libs/libcommon/src/libcommon/viewer_utils/asset.py

lhoestq

Nice ! Looks mostly good to me :)

Feel free to also edit transform_rows() in libapi to enable multithreading when preparing the rows containing PDFs (right now multithreading is only enabled for audio and image) and we can merge !

I also added a comment on filtering

e2e/tests/test_53_filter.py

lhoestq · 2025-05-27T12:00:17Z

services/worker/src/worker/job_runners/dataset/modalities.py

@@ -320,6 +324,8 @@ def detect_modalities_from_filetypes(dataset: str) -> set[DatasetModality]:
                modalities.add("3d")
            elif filetype["extension"] in TEXT_EXTENSIONS:
                modalities.add("text")
+            elif filetype["extension"] in DOCUMENT_EXTENSIONS:
+                modalities.add("document")


(not sure, but maybe we need to update the openapi spec with this)

severo

Cooool!

I saw that you want to update the doc in another PR, but I think that the OpenAPI spec should be updated here, to reflect the current state.

As a follow-up: should we support lists of PDFs files as we do for images and audio?

libs/libcommon/pyproject.toml

libs/libcommon/src/libcommon/url_preparator.py

libs/libcommon/src/libcommon/viewer_utils/asset.py

severo · 2025-05-27T12:14:32Z

services/worker/src/worker/dtos.py

@@ -277,7 +277,7 @@ class DatasetCompatibleLibrariesResponse(TypedDict):
    formats: list[DatasetFormat]


-DatasetModality = Literal["image", "audio", "text", "video", "geospatial", "3d", "tabular", "timeseries"]
+DatasetModality = Literal["image", "audio", "text", "video", "geospatial", "3d", "tabular", "timeseries", "document"]


seems right, but ping @julien-c @gary149: the proposal here is to add a new "Modality" (https://huggingface.co/datasets) for documents:

HuggingFaceDocBuilderDev · 2025-05-27T21:15:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

AndreaFrancis · 2025-05-27T21:23:45Z

Hi @lhoestq and @severo,

I’ve applied the final modifications based on your suggestions:

Added transform_rows to libapi
Clarified that filtering is not yet supported for transformed data
Updated OpenAPI schemas to reflect the new response types
Switched to using thumbnail as an ImageSource for better alignment with existing structures

In upcoming PRs, I plan to address:

Including the number of pages in the transformed data (for stats and filtering)
Supporting lists of PDF files, similar to how we handle images and audio
Adding relevant documentation content

With these changes, is it safe to proceed with merging, or could this risk breaking something in the Hub?

lhoestq

LGTM !

With these changes, is it safe to proceed with merging, or could this risk breaking something in the Hub?

let's see :p

cfahlgren1 · 2025-05-29T23:35:39Z

I think we need to enable CORS for these assets to serve the PDFs right? @AndreaFrancis

AndreaFrancis · 2025-05-30T14:42:58Z

@cfahlgren1 Not sure if any other configuration is needed from moonlanding/dataset-viewer infra as it already consumes the sources from https://datasets-server.huggingface.co/assets/* same as for images and audios. Maybe @severo?

cfahlgren1 · 2025-05-30T14:52:31Z

same as for images and audios.

Audio and images send as a no-cors fetch mode whereas the PDF would be cors. I was getting a CORS error when trying it out yesterday. Not sure if it has to do with /cached-assets/ route or maybe something else https://datasets-server.huggingface.co/cached-assets/...🤔

AndreaFrancis added 4 commits May 12, 2025 11:34

draft: adding pdf cell type

41e5c15

draft: adding pdf functions for rows processing

c9e528f

Merge branch 'main' into pdf-type

54e3b81

draft: detect modalities

4597d29

AndreaFrancis added 3 commits May 13, 2025 21:59

Rollback

a731a6a

Fix dependencies

638d229

Adding width and height

1e66e38

lhoestq reviewed May 14, 2025

View reviewed changes

libs/libcommon/src/libcommon/viewer_utils/features.py Outdated Show resolved Hide resolved

AndreaFrancis added 4 commits May 14, 2025 10:13

Adding test in assets

1d288cb

Adding pdf size y bytes

fb5a0cb

Add test for get_cell_value_value

02fbc9e

Add more tests

d1f03e2

AndreaFrancis requested a review from Copilot May 17, 2025 01:48

Copilot AI reviewed May 17, 2025

View reviewed changes

libs/libcommon/src/libcommon/viewer_utils/asset.py Show resolved Hide resolved

libs/libcommon/src/libcommon/viewer_utils/features.py Show resolved Hide resolved

libs/libcommon/src/libcommon/viewer_utils/asset.py Show resolved Hide resolved

AndreaFrancis added 9 commits May 20, 2025 13:03

Fix test parquet and info

d03d60f

FIx typo

4ea8d21

Update datasets library and add test for modalities

4e1c7df

Add e2e

8ec5a7a

Add missing test for first-rows and filter

04822d9

Debug e2e

58d0308

Fix copy/paste bug

eead4ee

Fix documents name

db2c7ee

Fix typo

7026927

AndreaFrancis marked this pull request as ready for review May 26, 2025 21:58

AndreaFrancis requested review from lhoestq and severo May 26, 2025 21:59

lhoestq reviewed May 27, 2025

View reviewed changes

severo approved these changes May 27, 2025

View reviewed changes

AndreaFrancis mentioned this pull request May 27, 2025

Create a special column type when it contains PDF bytes or PDF URL #2991

Open

AndreaFrancis added 2 commits May 27, 2025 15:51

Address code review suggestions

57cac3f

Add Pdf spec in OpenApi

605ef62

AndreaFrancis requested a review from lhoestq May 27, 2025 21:25

lhoestq approved these changes May 28, 2025

View reviewed changes

AndreaFrancis merged commit e785a0f into main May 29, 2025
30 checks passed

AndreaFrancis deleted the pdf-type branch May 29, 2025 14:19

feat: Thumbnail and PDF for new column type #3193

feat: Thumbnail and PDF for new column type #3193

Uh oh!

Conversation

AndreaFrancis commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndreaFrancis commented May 13, 2025

Uh oh!

severo commented May 13, 2025

Uh oh!

severo commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhoestq May 27, 2025

Choose a reason for hiding this comment

Uh oh!

severo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

severo May 27, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2025

Uh oh!

AndreaFrancis commented May 27, 2025

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cfahlgren1 commented May 29, 2025

Uh oh!

AndreaFrancis commented May 30, 2025

Uh oh!

cfahlgren1 commented May 30, 2025

Uh oh!

Uh oh!

AndreaFrancis commented May 13, 2025 •

edited

Loading

severo commented May 14, 2025 •

edited

Loading

lhoestq left a comment •

edited

Loading