Skip to content

Conversation

Cadene
Copy link
Collaborator

@Cadene Cadene commented Mar 14, 2024

We can quickly download and load our datasets with:

data_dir = snapshot_download(repo_id=f"cadene/{self.dataset_id}", repo_type="dataset")
storage = TensorStorage(TensorDict.load_memmap(data_dir))

Datasets will be stored in .cache by default:

$ ls $HOME/.cache/huggingface/hub/datasets--cadene--pusht/snapshots/a7ee4130aea55af096033347464d92bf54c72867/
README.md  action.memmap  episode.memmap  frame_id.memmap  meta.json  next  observation  stats.pth

Uploading / updating is easy. I added some info in README.

Datasets added:

Future work:

  • update dataset format: video/mp4 instead of uint8
  • add datasets card
  • add simxarm datasets

@Cadene Cadene changed the title [WIP] Add pusht on hf dataset Add pusht and aloha to hugging face dataset hub Mar 15, 2024
@Cadene Cadene changed the title Add pusht and aloha to hugging face dataset hub Add pusht on hf dataset Mar 15, 2024
@Cadene Cadene changed the title Add pusht on hf dataset Download datasets from hugging face Mar 15, 2024
Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, otherwise LGTM

torchvision = "^0.17.1"
h5py = "^3.10.0"
dm-control = "1.0.14"
huggingface-hub = {extras = ["hf-transfer"], version = "^0.21.4"}
Copy link
Collaborator

@aliberts aliberts Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also add it to the CI env:

cd .github/poetry/cpu && poetry add "huggingface-hub[hf-transfer]"

(I know, this is going to be a pain until we have better CI pipelines)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!!!! @aliberts if you have cycles, might be good to add a test for this in the meantime (if possible).

@aliberts
Copy link
Collaborator

Come to think of it, the CI passed because it uses the test artifacts stored by git lfs from the repo and so never needs to download the dataset.

Should we have a light/mock version of the dataset to download for testing? Should we test this at all? Not sure what the best course of action would be here.

…t in the pyproject.toml and will be skipped:

  - huggingface-hub

If you want to update it to the latest compatible version, you can use `poetry update package`.
If you prefer to upgrade it to the latest available version, you can use `poetry add package@latest`.

Nothing to add.
@Cadene
Copy link
Collaborator Author

Cadene commented Mar 15, 2024

Come to think of it, the CI passed because it uses the test artifacts stored by git lfs from the repo and so never needs to download the dataset.

Should we have a light/mock version of the dataset to download for testing? Should we test this at all? Not sure what the best course of action would be here.

There might be a utility in huggingface-cli but YOLO (for now)

@Cadene Cadene requested a review from aliberts March 15, 2024 10:58
Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

self.data_dir = self.root / self.dataset_id

storage = self._download_or_load_storage()
self.root = root if root is None else Path(root)
Copy link
Collaborator

@aliberts aliberts Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would then suggest to remove the type hint root: Path = None in the signature above but this could be work for a dedicated PR on type hinting later on, perhaps not urgent at the moment

EDIT: Nevermind! I didn't see your reply above when writing this.
(root: Path | None = None would be the way to go then)

@Cadene Cadene merged commit 9c88071 into main Mar 15, 2024
Copy link
Contributor

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@aliberts aliberts deleted the user/rcadene/2024_03_14_hf_dataset branch April 27, 2024 08:16
menhguin pushed a commit to menhguin/lerobot that referenced this pull request Feb 9, 2025
…_hf_dataset

Download datasets from hugging face
Kalcy-U referenced this pull request in Kalcy-U/lerobot May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants