-
Notifications
You must be signed in to change notification settings - Fork 812
Adding secondary caching to datasets #1594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding secondary caching to datasets #1594
Conversation
I seem to intermittently get the following error when executing the following piece of code to download datasets from Google drive:
Stack trace
@ejguan any idea why this happens inconsistently (i.e. running the above snippet might work after 4 or 5 tries). |
It seems Did you encounter such Error before? I think the implementation here is copied from TorchText text/torchtext/_download_hooks.py Lines 21 to 48 in 4be2792
If you have never encountered it before, I think the problem comes from that we explicitly set streaming for the |
Codecov Report
@@ Coverage Diff @@
## main #1594 +/- ##
==========================================
+ Coverage 80.34% 80.35% +0.01%
==========================================
Files 58 58
Lines 2569 2571 +2
==========================================
+ Hits 2064 2066 +2
Misses 505 505
Continue to review full report at Codecov.
|
I've been noticing this error for the last week or so for all of the datasets that are being downloaded from Google Drive.
The documentation for |
I mean if you have encountered it before revamping Dataset into DataPipe. I believe when @parmeet implemented the GDriveDataPipe, we basically copied the code from TorchText as I linked in the previous comment.
I am trying to figure out which part causes the issue. If TorchText doesn't have any error with previous implementation in TorchText. This Error has to be introduced by the changes we made (1 or 2). And, I tried to reproduce it on my local machine but not succeed. Do you want to give a try by changing |
I don't think the change is that simple since setting
I wonder if it's because we need to modify additional logic if we start downloading the contents immediately instead of streaming. Should we create an issue on torchdata to keep track of this and move the discussions there? |
Yeah. Please do so! Thanks . |
@Nayef211, @ejguan This error existed in torchtext even before the migration, So I don't think any changes on torchdata would have triggered this. In my experience, I have seen this when the quota is exceeded. This is a transient error, though I never dig deeper into if there is a way to increase the quota limit or other alternatives to prevent this from happening. I would agree raising an issue on torchdata, so that we can discuss the problem and potential solutions in there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding cache :)
Reference issue #1589
Summary
Test
next(iter(dataset))
on the relevant datasets to verify data is being returned correctly