Refactor the download and publication of the datasets and convert it into CLI script #95

AdilZouitine · 2024-04-23T16:27:32Z

What does this PR do?

Following discussions with @Cadene, this PR has two main objectives:

Refactors the process of downloading and publishing datasets and adds components to the library that enable users to publish their own datasets.
Introduces a CLI that allows users to publish new datasets (and the current datasets), provided they comply with the Aloha, Umi, Pusht, or Xarm formats.

I tested this PR with tests/test_datasets.py::test_backward_compatibility and DATA_DIR="tmp/data/path/save/to/disk" pytest -sx tests/test_datasets.py::test_backward_compatibility

…push dataset to hub

Cadene

Sorry for the avalanche of comments and suggestions, but they are mostly minor.
This PR is so important. Huge congrats for reaching this state. It's almost done.

lerobot/scripts/push_dataset_to_hub.py

lerobot/common/datasets/push_dataset_to_hub/_preprocessor.py

lerobot/scripts/push_dataset_to_hub.py

Cadene · 2024-04-28T18:36:40Z

lerobot/scripts/push_dataset_to_hub.py

+        dry_run (bool, optional): If True, performs a dry run without actually pushing the dataset. Defaults to False.
+        revision (str, optional): The revision of the dataset. Defaults to "v1.0".
+        community_id (str, optional): The ID of the community. Defaults to "lerobot".
+        preprocess (bool, optional): If True, preprocesses the dataset. Defaults to True.


I am not sure if we should keep preprocess argument. Could be worth simplifying.

Suggested change

preprocess (bool, optional): If True, preprocesses the dataset. Defaults to True.

I've also asked myself this question, what we can do is rename this flag to --no-preprocess and by default processing is done.
The flag is only interesting for users who want the raw dataset, which will be a small percentage.

@AdilZouitine for those users, they can comment the preprocess code lines ^^
I think it's safe to remove preprocess argument

lerobot/scripts/push_dataset_to_hub.py

Co-authored-by: Remi <[email protected]>

…nctions inside push_dataset_to_hub

lerobot/__init__.py

Cadene · 2024-04-28T20:11:14Z

lerobot/__init__.py

+available_datasets_without_env = ["lerobot/umi_cup_in_the_wild"]
+
+available_datasets = list(
+    itertools.chain(*available_datasets_per_env.values(), available_datasets_without_env)
+)
+


For now, a TODO should be fine

lerobot/common/datasets/push_dataset_to_hub/aloha_processor.py

lerobot/common/datasets/push_dataset_to_hub/umi_processor.py

lerobot/common/datasets/push_dataset_to_hub/xarm_processor.py

lerobot/common/datasets/push_dataset_to_hub/pusht_processor.py

Co-authored-by: Remi <[email protected]>

…into CLI script (huggingface#95) Co-authored-by: Remi <[email protected]>

…into CLI script (#95) Co-authored-by: Remi <[email protected]>

…into CLI script (huggingface#95) Co-authored-by: Remi <[email protected]>

AdilZouitine changed the title ~~[do not review] Refactor the download and publication of the dataset and convert it into CLI tool~~ [do not review] Refactor the download and publication of the datasets and convert it into CLI tool Apr 23, 2024

AdilZouitine changed the title ~~[do not review] Refactor the download and publication of the datasets and convert it into CLI tool~~ [do not review] Refactor the download and publication of the datasets and convert it into CLI script Apr 23, 2024

aliberts added the dataset Issues regarding data inputs, processing, or datasets label Apr 24, 2024

AdilZouitine force-pushed the user/azouitine/2024_04_22_refactor_download_upload branch from 28d02c6 to 77863e9 Compare April 27, 2024 19:07

AdilZouitine mentioned this pull request Apr 28, 2024

Add UMI-gripper dataset #83

Merged

AdilZouitine changed the title ~~[do not review] Refactor the download and publication of the datasets and convert it into CLI script~~ Refactor the download and publication of the datasets and convert it into CLI script Apr 28, 2024

AdilZouitine marked this pull request as ready for review April 28, 2024 15:39

Cadene assigned AdilZouitine Apr 28, 2024

Cadene self-requested a review April 28, 2024 16:12

AdilZouitine added 5 commits April 28, 2024 18:52

Implements Umi dataset

b697b99

Include preprocess option to push_dataset_to_hub script

8c77f55

Refactor dataset processors to use private fps attribute

cb8e44d

Update UmiProcessor class to include comments and modify revision in …

d7aca8d

…push dataset to hub

Add preprocess option to the example of push_dataset_to_hub script

112f9e6

AdilZouitine force-pushed the user/azouitine/2024_04_22_refactor_download_upload branch from fea1a60 to 112f9e6 Compare April 28, 2024 16:55

Delete preprocessor.py

da04501

Cadene suggested changes Apr 28, 2024

View reviewed changes

AdilZouitine and others added 13 commits April 28, 2024 21:23

Update lerobot/scripts/push_dataset_to_hub.py

d03cd67

Co-authored-by: Remi <[email protected]>

Update lerobot/scripts/push_dataset_to_hub.py

592683f

Co-authored-by: Remi <[email protected]>

Update lerobot/scripts/push_dataset_to_hub.py

c7e0e3b

Co-authored-by: Remi <[email protected]>

Update lerobot/scripts/push_dataset_to_hub.py

653a36e

Co-authored-by: Remi <[email protected]>

Update lerobot/scripts/push_dataset_to_hub.py

997fcee

Co-authored-by: Remi <[email protected]>

Update lerobot/scripts/push_dataset_to_hub.py

f0a7e1c

Co-authored-by: Remi <[email protected]>

Add docstring for DatasetProcessor protocol

11bb5e2

Update lerobot/scripts/push_dataset_to_hub.py

9d645b4

Co-authored-by: Remi <[email protected]>

Apply suggestions from code review

c2050c7

Co-authored-by: Remi <[email protected]>

Apply suggestions from code review

9eeea88

Co-authored-by: Remi <[email protected]>

Apply suggestions from code review

8f5b0e5

Co-authored-by: Remi <[email protected]>

Remove download datasets, rename processor files and move download fu…

cff2608

…nctions inside push_dataset_to_hub

Rename push_to_hub to push_lerobot_dataset_to_hub

056ae54

AdilZouitine added 2 commits April 28, 2024 22:08

Add no_preprocess flag

64da76e

Fix preprocess flag in push_dataset_to_hub.py

d5b81df

Cadene reviewed Apr 28, 2024

View reviewed changes

AdilZouitine and others added 7 commits April 28, 2024 22:25

Apply suggestions from code review

ba93790

Co-authored-by: Remi <[email protected]>

format

a6c5276

Move _download_raw inside push_to_hub

9ab0f45

Runs on all available datasets

c115744

Fix dry-run flag

914f195

Fix saving stats locally

0377a5e

Update tests

87cd109

Cadene approved these changes Apr 28, 2024

View reviewed changes

Add TODO comments for missing dataset artifacts

bab7d28

AdilZouitine merged commit 55dc9f7 into main Apr 28, 2024

AdilZouitine deleted the user/azouitine/2024_04_22_refactor_download_upload branch April 28, 2024 22:08

AdilZouitine mentioned this pull request Apr 29, 2024

Update UmiProcessor default fps to 10 #116

Merged

menhguin pushed a commit to menhguin/lerobot that referenced this pull request Feb 9, 2025

Refactor the download and publication of the datasets and convert it …

feac375

…into CLI script (huggingface#95) Co-authored-by: Remi <[email protected]>

Kalcy-U referenced this pull request in Kalcy-U/lerobot May 13, 2025

Refactor the download and publication of the datasets and convert it …

af58a4c

…into CLI script (#95) Co-authored-by: Remi <[email protected]>

ZoreAnuj pushed a commit to luckyrobots/lerobot that referenced this pull request Jul 29, 2025

Refactor the download and publication of the datasets and convert it …

650d976

…into CLI script (huggingface#95) Co-authored-by: Remi <[email protected]>

Ricci084 pushed a commit to JeffWang987/lerobot that referenced this pull request Sep 5, 2025

Refactor the download and publication of the datasets and convert it …

d46b5b5

…into CLI script (huggingface#95) Co-authored-by: Remi <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor the download and publication of the datasets and convert it into CLI script #95

Refactor the download and publication of the datasets and convert it into CLI script #95

Uh oh!

AdilZouitine commented Apr 23, 2024 •

edited

Loading

Uh oh!

Cadene left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cadene Apr 28, 2024 •

edited

Loading

Uh oh!

AdilZouitine Apr 28, 2024 •

edited

Loading

Uh oh!

Cadene Apr 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cadene Apr 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Refactor the download and publication of the datasets and convert it into CLI script #95

Refactor the download and publication of the datasets and convert it into CLI script #95

Uh oh!

Conversation

AdilZouitine commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Cadene left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cadene Apr 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdilZouitine Apr 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cadene Apr 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cadene Apr 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AdilZouitine commented Apr 23, 2024 •

edited

Loading

Cadene Apr 28, 2024 •

edited

Loading

AdilZouitine Apr 28, 2024 •

edited

Loading