Skip to content

Adding secondary caching to datasets #1594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Nayef211
Copy link
Contributor

@Nayef211 Nayef211 commented Feb 8, 2022

Reference issue #1589

Summary

  • Adding secondary caching to SogouNews and AmazonReviewFull datasets

Test

  • Delete cache and call next(iter(dataset)) on the relevant datasets to verify data is being returned correctly

@Nayef211
Copy link
Contributor Author

Nayef211 commented Feb 8, 2022

I seem to intermittently get the following error when executing the following piece of code to download datasets from Google drive:

>>> from torchtext.datasets.amazonreviewfull import AmazonReviewFull
>>> dataset = AmazonReviewFull(split="train")
>>> next(iter(dataset))

Stack trace

  File "/home/nayef211/.local/lib/python3.8/site-packages/torchdata/datapipes/iter/load/online.py", line 98, in __iter__
    yield _get_response_from_google_drive(url, timeout=self.timeout)
  File "/home/nayef211/.local/lib/python3.8/site-packages/torchdata/datapipes/iter/load/online.py", line 74, in _get_response_from_google_drive
    raise RuntimeError("Internal error: headers don't contain content-disposition.")
RuntimeError: Internal error: headers don't contain content-disposition.

@ejguan any idea why this happens inconsistently (i.e. running the above snippet might work after 4 or 5 tries).

cc @parmeet @abhinavarora

@Nayef211 Nayef211 requested a review from parmeet February 8, 2022 16:39
@ejguan
Copy link
Contributor

ejguan commented Feb 8, 2022

It seems content-disposition is needed to fetch filename.

Did you encounter such Error before? I think the implementation here is copied from TorchText

def _get_response_from_google_drive(url):
confirm_token = None
session = requests.Session()
response = session.get(url, stream=True)
for k, v in response.cookies.items():
if k.startswith("download_warning"):
confirm_token = v
if confirm_token is None:
if "Quota exceeded" in str(response.content):
raise RuntimeError(
"Google drive link {} is currently unavailable, because the quota was exceeded.".format(
url
))
else:
raise RuntimeError("Internal error: confirm_token was not found in Google drive link.")
url = url + "&confirm=" + confirm_token
response = session.get(url, stream=True)
if 'content-disposition' not in response.headers:
raise RuntimeError("Internal error: headers don't contain content-disposition.")
filename = re.findall("filename=\"(.+)\"", response.headers['content-disposition'])
if filename is None:
raise RuntimeError("Filename could not be autodetected")
filename = filename[0]
return response, filename

If you have never encountered it before, I think the problem comes from that we explicitly set streaming for the request. https://github.com/pytorch/data/blob/8942cf1d6cdb71013e7b93330a6b0d41445c5e6c/torchdata/datapipes/iter/load/online.py#L54-L57
We may need to switch back to normal request.

@codecov
Copy link

codecov bot commented Feb 8, 2022

Codecov Report

Merging #1594 (b04ae99) into main (da34de2) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1594      +/-   ##
==========================================
+ Coverage   80.34%   80.35%   +0.01%     
==========================================
  Files          58       58              
  Lines        2569     2571       +2     
==========================================
+ Hits         2064     2066       +2     
  Misses        505      505              
Impacted Files Coverage Δ
torchtext/datasets/amazonreviewfull.py 96.29% <100.00%> (+0.14%) ⬆️
torchtext/datasets/sogounews.py 96.29% <100.00%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da34de2...b04ae99. Read the comment docs.

@Nayef211
Copy link
Contributor Author

Nayef211 commented Feb 8, 2022

Did you encounter such Error before? I think the implementation here is copied from TorchText

I've been noticing this error for the last week or so for all of the datasets that are being downloaded from Google Drive.

If you have never encountered it before, I think the problem comes from that we explicitly set streaming for the request.

The documentation for Requests seems to suggest we should be using context managers to prevent inefficiencies with the connection. However, I'm not sure if that would explicitly get rid of this error. Do you see any issues with setting the stream arg to False? The main issue I see is that the contents on the drive will be downloaded before we decide whether we want to download the contents or raise an Exception (i.e. all the logic here).

@ejguan
Copy link
Contributor

ejguan commented Feb 8, 2022

I've been noticing this error for the last week or so for all of the datasets that are being downloaded from Google Drive.

I mean if you have encountered it before revamping Dataset into DataPipe. I believe when @parmeet implemented the GDriveDataPipe, we basically copied the code from TorchText as I linked in the previous comment.
And, what we have changed in TorchData are:

  1. Add a context manager
  2. Adding streaming to response = session.get(url, stream=True)

I am trying to figure out which part causes the issue. If TorchText doesn't have any error with previous implementation in TorchText. This Error has to be introduced by the changes we made (1 or 2).

And, I tried to reproduce it on my local machine but not succeed. Do you want to give a try by changing stream to False in your local env and test this PR?

@Nayef211
Copy link
Contributor Author

Nayef211 commented Feb 8, 2022

And, I tried to reproduce it on my local machine but not succeed. Do you want to give a try by changing stream to False in your local env and test this PR?

I don't think the change is that simple since setting stream=False results in the following error:

RuntimeError: The computed hash d41d8cd98f00b204e9800998ecf8427e of /home/nayef211/.torchtext/cache/AmazonReviewFull/amazon_review_full_csv.tar.gz does not match the expectedhash 57d28bd5d930e772930baddf36641c7c. Delete the file manually and retry.

I wonder if it's because we need to modify additional logic if we start downloading the contents immediately instead of streaming. Should we create an issue on torchdata to keep track of this and move the discussions there?

@ejguan
Copy link
Contributor

ejguan commented Feb 8, 2022

Yeah. Please do so! Thanks .

@parmeet
Copy link
Contributor

parmeet commented Feb 8, 2022

@Nayef211, @ejguan This error existed in torchtext even before the migration, So I don't think any changes on torchdata would have triggered this. In my experience, I have seen this when the quota is exceeded. This is a transient error, though I never dig deeper into if there is a way to increase the quota limit or other alternatives to prevent this from happening. I would agree raising an issue on torchdata, so that we can discuss the problem and potential solutions in there.

Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding cache :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants