Skip to content

Conversation

AdilZouitine
Copy link
Collaborator

@AdilZouitine AdilZouitine commented Apr 23, 2024

What does this PR do?

Following discussions with @Cadene, this PR has two main objectives:

  1. Refactors the process of downloading and publishing datasets and adds components to the library that enable users to publish their own datasets.
  2. Introduces a CLI that allows users to publish new datasets (and the current datasets), provided they comply with the Aloha, Umi, Pusht, or Xarm formats.

I tested this PR with tests/test_datasets.py::test_backward_compatibility and DATA_DIR="tmp/data/path/save/to/disk" pytest -sx tests/test_datasets.py::test_backward_compatibility

@AdilZouitine AdilZouitine changed the title [do not review] Refactor the download and publication of the dataset and convert it into CLI tool [do not review] Refactor the download and publication of the datasets and convert it into CLI tool Apr 23, 2024
@AdilZouitine AdilZouitine changed the title [do not review] Refactor the download and publication of the datasets and convert it into CLI tool [do not review] Refactor the download and publication of the datasets and convert it into CLI script Apr 23, 2024
@aliberts aliberts added the dataset Issues regarding data inputs, processing, or datasets label Apr 24, 2024
@AdilZouitine AdilZouitine force-pushed the user/azouitine/2024_04_22_refactor_download_upload branch from 28d02c6 to 77863e9 Compare April 27, 2024 19:07
@AdilZouitine AdilZouitine changed the title [do not review] Refactor the download and publication of the datasets and convert it into CLI script Refactor the download and publication of the datasets and convert it into CLI script Apr 28, 2024
@AdilZouitine AdilZouitine marked this pull request as ready for review April 28, 2024 15:39
@Cadene Cadene self-requested a review April 28, 2024 16:12
@AdilZouitine AdilZouitine force-pushed the user/azouitine/2024_04_22_refactor_download_upload branch from fea1a60 to 112f9e6 Compare April 28, 2024 16:55
Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the avalanche of comments and suggestions, but they are mostly minor.
This PR is so important. Huge congrats for reaching this state. It's almost done.

dry_run (bool, optional): If True, performs a dry run without actually pushing the dataset. Defaults to False.
revision (str, optional): The revision of the dataset. Defaults to "v1.0".
community_id (str, optional): The ID of the community. Defaults to "lerobot".
preprocess (bool, optional): If True, preprocesses the dataset. Defaults to True.
Copy link
Collaborator

@Cadene Cadene Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should keep preprocess argument. Could be worth simplifying.

Suggested change
preprocess (bool, optional): If True, preprocesses the dataset. Defaults to True.

Copy link
Collaborator Author

@AdilZouitine AdilZouitine Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also asked myself this question, what we can do is rename this flag to --no-preprocess and by default processing is done.
The flag is only interesting for users who want the raw dataset, which will be a small percentage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdilZouitine for those users, they can comment the preprocess code lines ^^
I think it's safe to remove preprocess argument

Comment on lines +64 to +69
available_datasets_without_env = ["lerobot/umi_cup_in_the_wild"]

available_datasets = list(
itertools.chain(*available_datasets_per_env.values(), available_datasets_without_env)
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, a TODO should be fine

@AdilZouitine AdilZouitine merged commit 55dc9f7 into main Apr 28, 2024
@AdilZouitine AdilZouitine deleted the user/azouitine/2024_04_22_refactor_download_upload branch April 28, 2024 22:08
menhguin pushed a commit to menhguin/lerobot that referenced this pull request Feb 9, 2025
Kalcy-U referenced this pull request in Kalcy-U/lerobot May 13, 2025
ZoreAnuj pushed a commit to luckyrobots/lerobot that referenced this pull request Jul 29, 2025
Ricci084 pushed a commit to JeffWang987/lerobot that referenced this pull request Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Issues regarding data inputs, processing, or datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants