new fsdp dataset: nvidia-deeplearningexamples #706

mvinci12 · 2025-05-30T22:45:13Z

Issue #, if available:

Description of changes:
due to allenai/c4 dataset error **[rank3]: FileNotFoundError: gzip://c4-validation.00003-of-00008.json::hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-validation.00003-of-00008.json.gz**, we are changing dataset to https://huggingface.co/datasets/BEE-spoke-data/Nvidia-DeepLearningExamples/viewer/default/train?views%5B%5D=train

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

KeitaW

https://huggingface.co/datasets/allenai/c4/blob/main/en/c4-validation.00003-of-00008.json.gz
The file does exists.

KeitaW · 2025-06-03T23:02:07Z

Internally discussed with @mvinci12 for the motivation, however Nvidia-DeepLearningExamples seems to be a coding dataset, while C4 dataset is more general corpus. Can we see if other general corpus, such as FineWeb be used as an alternative?

KeitaW · 2025-06-03T23:41:29Z

Tips: FineWeb dataset does not have train/validation split but you can always split data manually:
ex:

streaming_train = load_dataset("HuggingFaceFW/fineweb", split="train[:90%]", streaming=True)
streaming_val = load_dataset("HuggingFaceFW/fineweb", split="train[90%:]", streaming=True)

cf. https://huggingface.co/docs/datasets/v1.11.0/splits.html

…so new dataset arguments

mvinci12 · 2025-06-06T01:36:16Z

new commit uses wikimedia/wikipedia dataset (https://huggingface.co/datasets/wikimedia/wikipedia). also added new arguments for users:

--train_split : the specific HF split you'd like to use for training
--val_split : split for validation
--n_samples_train : number of samples you'd like to use for training
--n_samples_val : number of samples you'd like to use for validation

training and validation split were hardcoded to use "train" and "validation" previously, but new popular datasets only have "train" as a split.

KeitaW

added comments

KeitaW · 2025-06-06T23:05:14Z

3.test_cases/pytorch/FSDP/src/model_utils/train_utils.py

@@ -474,10 +474,12 @@ def create_streaming_dataloader(dataset,
                      batch_size=1,
                      max_context_width=4096,
                      workers=4,
-                      split=None):
+                      split=None,
+                      n_samples=6150):
    print(f"dataset={dataset}, name={name}")
    tokenizer = AutoTokenizer.from_pretrained(tokenizer)
    data = load_dataset(dataset, name=name, streaming=True, split=split, trust_remote_code=True).shuffle(42+global_rank)


Aren't we getting different shuffle per GPU hence loading different splits per rank?

KeitaW · 2025-06-06T23:06:08Z

3.test_cases/pytorch/FSDP/src/model_utils/train_utils.py

    print(f"dataset={dataset}, name={name}")
    tokenizer = AutoTokenizer.from_pretrained(tokenizer)
    data = load_dataset(dataset, name=name, streaming=True, split=split, trust_remote_code=True).shuffle(42+global_rank)
+    data = data.take(n_samples)


Why do we need to sample here? The resultant dataset seems to be very small.

Yeah, not needed for this dataset. But I'm thinking it should be configurable for larger/custom datasets

mhuguesaws · 2025-06-07T17:20:57Z

Run failure due to

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/datasets/wikimedia/wikipedia/resolve/b04c8d1ceb2f5cd4588862100d08de323dccfbaa/20231101.en/train-00032-of-00041.parquet

Check results in CI https://github.com/aws-samples/awsome-distributed-training/actions/runs/15500434999/job/43646863450

Let's keep C4 dataset and ask to have a HF_TOKEN created.

nghtm · 2025-06-07T21:22:59Z

@mvinci12 You can find readme instructions for creating HF Token here: https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/picotron#build-environment

mvinci12 · 2025-06-09T02:23:58Z

newest commit is back to c4 dataset, and i added an argument to input the huggingface access token. the regression test will probably fail due to no huggingface token in argument but working with @amanshanbhag on this tomorrow

mhuguesaws · 2025-06-09T03:06:46Z

HF token was already in the doc for mistral models since mandatory. HF_TOKEN as environment variable is sufficient.

mhuguesaws · 2025-06-09T17:50:07Z

Let's keep C4 dataset, since the cause was unauthenticated requested to HF. New PR will probably better and we can close this one. Thoughts ?

new fsdp dataset: nvidia-deeplearningexamples

17ca7ac

mvinci12 requested review from amanshanbhag, mhuguesaws and nghtm May 30, 2025 22:45

edited readme to reflect new dataset

7807c9c

KeitaW requested changes May 30, 2025

View reviewed changes

changed data set to wikimedia/wikipedia for a more generic corpus. al…

5e0353e

…so new dataset arguments

fix in models template

408a13a

KeitaW reviewed Jun 6, 2025

View reviewed changes

back to c4, added hf token argument

8822adc

use hf_token as env variable

03282a1

run sbatch within src due to train script path from aman's PR

eb54598

mvinci12 mentioned this pull request Jun 10, 2025

new commit for fixing fsdp dataset, using allenai/c4 with HF token #729

Merged

mvinci12 closed this Jun 10, 2025

mvinci12 deleted the fsdp-dataset branch June 10, 2025 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

new fsdp dataset: nvidia-deeplearningexamples #706

new fsdp dataset: nvidia-deeplearningexamples #706

Uh oh!

mvinci12 commented May 30, 2025

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW commented Jun 3, 2025 •

edited

Loading

Uh oh!

KeitaW commented Jun 3, 2025

Uh oh!

mvinci12 commented Jun 6, 2025

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Jun 6, 2025

Uh oh!

KeitaW Jun 6, 2025

Uh oh!

mvinci12 Jun 7, 2025

Uh oh!

mhuguesaws commented Jun 7, 2025 •

edited

Loading

Uh oh!

nghtm commented Jun 7, 2025

Uh oh!

mvinci12 commented Jun 9, 2025

Uh oh!

mhuguesaws commented Jun 9, 2025

Uh oh!

mhuguesaws commented Jun 9, 2025

Uh oh!

Uh oh!

new fsdp dataset: nvidia-deeplearningexamples #706

new fsdp dataset: nvidia-deeplearningexamples #706

Uh oh!

Conversation

mvinci12 commented May 30, 2025

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KeitaW commented Jun 3, 2025

Uh oh!

mvinci12 commented Jun 6, 2025

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

KeitaW Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

mvinci12 Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

mhuguesaws commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nghtm commented Jun 7, 2025

Uh oh!

mvinci12 commented Jun 9, 2025

Uh oh!

mhuguesaws commented Jun 9, 2025

Uh oh!

mhuguesaws commented Jun 9, 2025

Uh oh!

Uh oh!

KeitaW commented Jun 3, 2025 •

edited

Loading

mhuguesaws commented Jun 7, 2025 •

edited

Loading