Skip to content

Updating dataset code to avoid creating multiple iterators from a DataPipe #1708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 11, 2022

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented May 10, 2022

We will be introducing a change in torch/torchdata to prevent users from creating multiple iterators from a single DataPipe. In order to not break anything when it lands, this PR update certain DataPipe use cases to avoid creating multiple iterators from a DataPipe. This should not impact any functionality.

Let me know if there are usages of DataPipes outside of torchtext/datasets.

@NivekT NivekT requested a review from Nayef211 May 10, 2022 23:23
@NivekT NivekT requested a review from parmeet May 10, 2022 23:23
cache_decompressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
cache_decompressed_dp_1, cache_decompressed_dp_2 = cache_decompressed_dp.fork(num_instances=2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommended pattern is to use fork when you need the same DataPipe for multiple usages later.

@@ -125,7 +125,7 @@
# avoid additional conditional imports.
def _filter_clean_cache(cache_decompressed_dp, full_filepath, uncleaned_filename):
cache_inner_decompressed_dp = cache_decompressed_dp.on_disk_cache(filepath_fn=lambda x: full_filepath)
cache_inner_decompressed_dp = FileOpener(cache_inner_decompressed_dp, mode="b").load_from_tar()
cache_inner_decompressed_dp = cache_inner_decompressed_dp.open_files(mode="b").load_from_tar()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we recently added .open_files as the functional form of FileOpener

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for making these changes @NivekT! Test failures aren't related to your changes.

@NivekT NivekT merged commit 4b4d50b into pytorch:main May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants