Give an option to either provide dataset or dataset_size in distributed sampler #1479

ramanishsingh · 2025-04-29T07:01:13Z

Currently the StatefulDistributedSampler takes the dataset as an argument, but only uses the length/size of the dataset.
Adding an option to provide size of the dataset instead, for more flexibility.

pytorch-bot · 2025-04-29T07:20:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1479

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job

As of commit 1116f96 with merge base aef2409 ():

CANCELLED JOB - The following job was cancelled. Please retry:

Run StatefulDataLoader Tests / test (macos-latest, 3.12) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

divyanshk · 2025-05-01T18:44:34Z

torchdata/stateful_dataloader/sampler.py

@@ -179,19 +182,66 @@ def __iter__(self):
        )


-class StatefulDistributedSampler(torch.utils.data.distributed.DistributedSampler):
+class StatefulDistributedSampler(Sampler[int]):


I think we should continue subclassing DistributedSampler for StatefulDistributedSampler - it is easy to udnerstand that by just the naming and we might trigger many type checking issues in downstream code which uses StatefulDistributedSampler and expects a variant of DistributedSampler.

Since DistributedSampler is a common utility in PyTorch, StatefulDistributedSampler should be expected to be an extension of it.

I decided to fork it instead of subclassing because I do not want to upstream these changes in torch.utils.data.distributed.DistributedSampler as it might break other users' code.
Nevertheless, it is redundant to have Dataset as an arg when we just need the length of it.

divyanshk

Lets sync if there is a better way to do this - maybe create a new sampler?

initial_commit

a78bf2c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2025

ramanishsingh added 3 commits April 29, 2025 07:09

update for typechecking

e46bc36

fix bug

765adfb

fix mypy

1116f96

ramanishsingh marked this pull request as ready for review April 29, 2025 17:33

divyanshk reviewed May 1, 2025

View reviewed changes

divyanshk requested changes May 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give an option to either provide dataset or dataset_size in distributed sampler #1479

Give an option to either provide dataset or dataset_size in distributed sampler #1479

ramanishsingh commented Apr 29, 2025

pytorch-bot bot commented Apr 29, 2025 •

edited

Loading

divyanshk May 1, 2025

ramanishsingh May 4, 2025

divyanshk left a comment

Give an option to either provide dataset or dataset_size in distributed sampler #1479

Are you sure you want to change the base?

Give an option to either provide dataset or dataset_size in distributed sampler #1479

Conversation

ramanishsingh commented Apr 29, 2025

pytorch-bot bot commented Apr 29, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1479

❌ 1 Cancelled Job

divyanshk May 1, 2025

Choose a reason for hiding this comment

ramanishsingh May 4, 2025

Choose a reason for hiding this comment

divyanshk left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Apr 29, 2025 •

edited

Loading