-
Notifications
You must be signed in to change notification settings - Fork 126
new fsdp dataset: nvidia-deeplearningexamples #706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally discussed with @mvinci12 for the motivation, however Nvidia-DeepLearningExamples seems to be a coding dataset, while C4 dataset is more general corpus. Can we see if other general corpus, such as FineWeb be used as an alternative? |
Tips: FineWeb dataset does not have train/validation split but you can always split data manually: streaming_train = load_dataset("HuggingFaceFW/fineweb", split="train[:90%]", streaming=True)
streaming_val = load_dataset("HuggingFaceFW/fineweb", split="train[90%:]", streaming=True) cf. https://huggingface.co/docs/datasets/v1.11.0/splits.html |
…so new dataset arguments
new commit uses wikimedia/wikipedia dataset (https://huggingface.co/datasets/wikimedia/wikipedia). also added new arguments for users: --train_split : the specific HF split you'd like to use for training training and validation split were hardcoded to use "train" and "validation" previously, but new popular datasets only have "train" as a split. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments
@@ -474,10 +474,12 @@ def create_streaming_dataloader(dataset, | |||
batch_size=1, | |||
max_context_width=4096, | |||
workers=4, | |||
split=None): | |||
split=None, | |||
n_samples=6150): | |||
print(f"dataset={dataset}, name={name}") | |||
tokenizer = AutoTokenizer.from_pretrained(tokenizer) | |||
data = load_dataset(dataset, name=name, streaming=True, split=split, trust_remote_code=True).shuffle(42+global_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we getting different shuffle per GPU hence loading different splits per rank?
print(f"dataset={dataset}, name={name}") | ||
tokenizer = AutoTokenizer.from_pretrained(tokenizer) | ||
data = load_dataset(dataset, name=name, streaming=True, split=split, trust_remote_code=True).shuffle(42+global_rank) | ||
data = data.take(n_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to sample here? The resultant dataset seems to be very small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not needed for this dataset. But I'm thinking it should be configurable for larger/custom datasets
Run failure due to
Check results in CI https://github.com/aws-samples/awsome-distributed-training/actions/runs/15500434999/job/43646863450 Let's keep C4 dataset and ask to have a HF_TOKEN created. |
@mvinci12 You can find readme instructions for creating HF Token here: https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/picotron#build-environment |
newest commit is back to c4 dataset, and i added an argument to input the huggingface access token. the regression test will probably fail due to no huggingface token in argument but working with @amanshanbhag on this tomorrow |
HF token was already in the doc for mistral models since mandatory. HF_TOKEN as environment variable is sufficient. |
Let's keep C4 dataset, since the cause was unauthenticated requested to HF. New PR will probably better and we can close this one. Thoughts ? |
Issue #, if available:
Description of changes:
due to allenai/c4 dataset error
**[rank3]: FileNotFoundError: gzip://c4-validation.00003-of-00008.json::hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-validation.00003-of-00008.json.gz**
, we are changing dataset to https://huggingface.co/datasets/BEE-spoke-data/Nvidia-DeepLearningExamples/viewer/default/train?views%5B%5D=trainBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.