Adding secondary caching to datasets #1594

Nayef211 · 2022-02-08T16:35:55Z

Reference issue #1589

Summary

Adding secondary caching to SogouNews and AmazonReviewFull datasets

Test

Delete cache and call next(iter(dataset)) on the relevant datasets to verify data is being returned correctly

Nayef211 · 2022-02-08T16:39:04Z

I seem to intermittently get the following error when executing the following piece of code to download datasets from Google drive:

>>> from torchtext.datasets.amazonreviewfull import AmazonReviewFull
>>> dataset = AmazonReviewFull(split="train")
>>> next(iter(dataset))

Stack trace

  File "/home/nayef211/.local/lib/python3.8/site-packages/torchdata/datapipes/iter/load/online.py", line 98, in __iter__
    yield _get_response_from_google_drive(url, timeout=self.timeout)
  File "/home/nayef211/.local/lib/python3.8/site-packages/torchdata/datapipes/iter/load/online.py", line 74, in _get_response_from_google_drive
    raise RuntimeError("Internal error: headers don't contain content-disposition.")
RuntimeError: Internal error: headers don't contain content-disposition.

@ejguan any idea why this happens inconsistently (i.e. running the above snippet might work after 4 or 5 tries).

cc @parmeet @abhinavarora

ejguan · 2022-02-08T16:53:48Z

It seems content-disposition is needed to fetch filename.

Did you encounter such Error before? I think the implementation here is copied from TorchText

text/torchtext/_download_hooks.py

Lines 21 to 48 in 4be2792

    
           def _get_response_from_google_drive(url): 
        
               confirm_token = None 
        
               session = requests.Session() 
        
               response = session.get(url, stream=True) 
        
               for k, v in response.cookies.items(): 
        
                   if k.startswith("download_warning"): 
        
                       confirm_token = v 
        
               if confirm_token is None: 
        
                   if "Quota exceeded" in str(response.content): 
        
                       raise RuntimeError( 
        
                           "Google drive link {} is currently unavailable, because the quota was exceeded.".format( 
        
                               url 
        
                           )) 
        
                   else: 
        
                       raise RuntimeError("Internal error: confirm_token was not found in Google drive link.") 
        
               url = url + "&confirm=" + confirm_token 
        
               response = session.get(url, stream=True) 
        
               if 'content-disposition' not in response.headers: 
        
                   raise RuntimeError("Internal error: headers don't contain content-disposition.") 
        
               filename = re.findall("filename=\"(.+)\"", response.headers['content-disposition']) 
        
               if filename is None: 
        
                   raise RuntimeError("Filename could not be autodetected") 
        
               filename = filename[0] 
        
               return response, filename

If you have never encountered it before, I think the problem comes from that we explicitly set streaming for the request. https://github.com/pytorch/data/blob/8942cf1d6cdb71013e7b93330a6b0d41445c5e6c/torchdata/datapipes/iter/load/online.py#L54-L57
We may need to switch back to normal request.

codecov · 2022-02-08T17:00:27Z

Codecov Report

Merging #1594 (b04ae99) into main (da34de2) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1594      +/-   ##
==========================================
+ Coverage   80.34%   80.35%   +0.01%     
==========================================
  Files          58       58              
  Lines        2569     2571       +2     
==========================================
+ Hits         2064     2066       +2     
  Misses        505      505

Impacted Files	Coverage Δ
torchtext/datasets/amazonreviewfull.py	`96.29% <100.00%> (+0.14%)`	⬆️
torchtext/datasets/sogounews.py	`96.29% <100.00%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da34de2...b04ae99. Read the comment docs.

Nayef211 · 2022-02-08T17:30:12Z

Did you encounter such Error before? I think the implementation here is copied from TorchText

I've been noticing this error for the last week or so for all of the datasets that are being downloaded from Google Drive.

If you have never encountered it before, I think the problem comes from that we explicitly set streaming for the request.

The documentation for Requests seems to suggest we should be using context managers to prevent inefficiencies with the connection. However, I'm not sure if that would explicitly get rid of this error. Do you see any issues with setting the stream arg to False? The main issue I see is that the contents on the drive will be downloaded before we decide whether we want to download the contents or raise an Exception (i.e. all the logic here).

ejguan · 2022-02-08T18:02:40Z

I've been noticing this error for the last week or so for all of the datasets that are being downloaded from Google Drive.

I mean if you have encountered it before revamping Dataset into DataPipe. I believe when @parmeet implemented the GDriveDataPipe, we basically copied the code from TorchText as I linked in the previous comment.
And, what we have changed in TorchData are:

Add a context manager
Adding streaming to response = session.get(url, stream=True)

I am trying to figure out which part causes the issue. If TorchText doesn't have any error with previous implementation in TorchText. This Error has to be introduced by the changes we made (1 or 2).

And, I tried to reproduce it on my local machine but not succeed. Do you want to give a try by changing stream to False in your local env and test this PR?

Nayef211 · 2022-02-08T18:17:33Z

And, I tried to reproduce it on my local machine but not succeed. Do you want to give a try by changing stream to False in your local env and test this PR?

I don't think the change is that simple since setting stream=False results in the following error:

RuntimeError: The computed hash d41d8cd98f00b204e9800998ecf8427e of /home/nayef211/.torchtext/cache/AmazonReviewFull/amazon_review_full_csv.tar.gz does not match the expectedhash 57d28bd5d930e772930baddf36641c7c. Delete the file manually and retry.

I wonder if it's because we need to modify additional logic if we start downloading the contents immediately instead of streaming. Should we create an issue on torchdata to keep track of this and move the discussions there?

ejguan · 2022-02-08T19:00:54Z

Yeah. Please do so! Thanks .

parmeet · 2022-02-08T19:49:34Z

@Nayef211, @ejguan This error existed in torchtext even before the migration, So I don't think any changes on torchdata would have triggered this. In my experience, I have seen this when the quota is exceeded. This is a transient error, though I never dig deeper into if there is a way to increase the quota limit or other alternatives to prevent this from happening. I would agree raising an issue on torchdata, so that we can discuss the problem and potential solutions in there.

parmeet

LGTM! Thanks for adding cache :)

Adding secondary caching to datasets

b04ae99

pytorch-bot bot added the ciflow/default label Feb 8, 2022

facebook-github-bot added the cla signed label Feb 8, 2022

Nayef211 requested a review from parmeet February 8, 2022 16:39

parmeet approved these changes Feb 8, 2022

View reviewed changes

parmeet merged commit ccf021f into pytorch:main Feb 8, 2022

This was referenced Feb 8, 2022

Add Secondary Caching for Extracted Datasets Files #1589

Closed

Getting headers don't contain content-disposition error when downloading data from Google Drive pytorch/data#204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding secondary caching to datasets #1594

Adding secondary caching to datasets #1594

Uh oh!

Nayef211 commented Feb 8, 2022

Uh oh!

Nayef211 commented Feb 8, 2022

Uh oh!

ejguan commented Feb 8, 2022

Uh oh!

codecov bot commented Feb 8, 2022 •

edited

Loading

Uh oh!

Nayef211 commented Feb 8, 2022

Uh oh!

ejguan commented Feb 8, 2022

Uh oh!

Nayef211 commented Feb 8, 2022

Uh oh!

ejguan commented Feb 8, 2022

Uh oh!

parmeet commented Feb 8, 2022

Uh oh!

parmeet left a comment

Uh oh!

Uh oh!

Adding secondary caching to datasets #1594

Adding secondary caching to datasets #1594

Uh oh!

Conversation

Nayef211 commented Feb 8, 2022

Summary

Test

Uh oh!

Nayef211 commented Feb 8, 2022

Uh oh!

ejguan commented Feb 8, 2022

Uh oh!

codecov bot commented Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Nayef211 commented Feb 8, 2022

Uh oh!

ejguan commented Feb 8, 2022

Uh oh!

Nayef211 commented Feb 8, 2022

Uh oh!

ejguan commented Feb 8, 2022

Uh oh!

parmeet commented Feb 8, 2022

Uh oh!

parmeet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Feb 8, 2022 •

edited

Loading