Skip to content

new fsdp dataset: nvidia-deeplearningexamples #706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

mvinci12
Copy link
Contributor

Issue #, if available:

Description of changes:
due to allenai/c4 dataset error **[rank3]: FileNotFoundError: gzip://c4-validation.00003-of-00008.json::hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-validation.00003-of-00008.json.gz**, we are changing dataset to https://huggingface.co/datasets/BEE-spoke-data/Nvidia-DeepLearningExamples/viewer/default/train?views%5B%5D=train

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KeitaW
Copy link
Contributor

KeitaW commented Jun 3, 2025

Internally discussed with @mvinci12 for the motivation, however Nvidia-DeepLearningExamples seems to be a coding dataset, while C4 dataset is more general corpus. Can we see if other general corpus, such as FineWeb be used as an alternative?

@KeitaW
Copy link
Contributor

KeitaW commented Jun 3, 2025

Tips: FineWeb dataset does not have train/validation split but you can always split data manually:
ex:

streaming_train = load_dataset("HuggingFaceFW/fineweb", split="train[:90%]", streaming=True)
streaming_val = load_dataset("HuggingFaceFW/fineweb", split="train[90%:]", streaming=True)

cf. https://huggingface.co/docs/datasets/v1.11.0/splits.html

@mvinci12
Copy link
Contributor Author

mvinci12 commented Jun 6, 2025

new commit uses wikimedia/wikipedia dataset (https://huggingface.co/datasets/wikimedia/wikipedia). also added new arguments for users:

--train_split : the specific HF split you'd like to use for training
--val_split : split for validation
--n_samples_train : number of samples you'd like to use for training
--n_samples_val : number of samples you'd like to use for validation

training and validation split were hardcoded to use "train" and "validation" previously, but new popular datasets only have "train" as a split.

Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments

@@ -474,10 +474,12 @@ def create_streaming_dataloader(dataset,
batch_size=1,
max_context_width=4096,
workers=4,
split=None):
split=None,
n_samples=6150):
print(f"dataset={dataset}, name={name}")
tokenizer = AutoTokenizer.from_pretrained(tokenizer)
data = load_dataset(dataset, name=name, streaming=True, split=split, trust_remote_code=True).shuffle(42+global_rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we getting different shuffle per GPU hence loading different splits per rank?

print(f"dataset={dataset}, name={name}")
tokenizer = AutoTokenizer.from_pretrained(tokenizer)
data = load_dataset(dataset, name=name, streaming=True, split=split, trust_remote_code=True).shuffle(42+global_rank)
data = data.take(n_samples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to sample here? The resultant dataset seems to be very small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not needed for this dataset. But I'm thinking it should be configurable for larger/custom datasets

@mhuguesaws
Copy link
Contributor

mhuguesaws commented Jun 7, 2025

Run failure due to

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/datasets/wikimedia/wikipedia/resolve/b04c8d1ceb2f5cd4588862100d08de323dccfbaa/20231101.en/train-00032-of-00041.parquet

Check results in CI https://github.com/aws-samples/awsome-distributed-training/actions/runs/15500434999/job/43646863450

Let's keep C4 dataset and ask to have a HF_TOKEN created.

@nghtm
Copy link
Contributor

nghtm commented Jun 7, 2025

@mvinci12 You can find readme instructions for creating HF Token here: https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/picotron#build-environment

@mvinci12
Copy link
Contributor Author

mvinci12 commented Jun 9, 2025

newest commit is back to c4 dataset, and i added an argument to input the huggingface access token. the regression test will probably fail due to no huggingface token in argument but working with @amanshanbhag on this tomorrow

@mhuguesaws
Copy link
Contributor

HF token was already in the doc for mistral models since mandatory. HF_TOKEN as environment variable is sufficient.

@mhuguesaws
Copy link
Contributor

Let's keep C4 dataset, since the cause was unauthenticated requested to HF. New PR will probably better and we can close this one. Thoughts ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants