-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Memory leak in file download from google drive #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think I came across a similar (or possibly the same) issue, but with GSheets. Environment:
Here's a simple reproducer (without a #!/usr/bin/env python3
import httplib2
import os
from apiclient import discovery
from memory_profiler import profile
from oauth2client import client, tools
from oauth2client.file import Storage
from time import sleep
SCOPES = "https://www.googleapis.com/auth/spreadsheets.readonly"
# See https://cloud.google.com/docs/authentication/getting-started
CLIENT_SECRET_FILE = ".client_secret.json"
APPLICATION_NAME = "ClientDebug"
DISCOVERY_URL = "https://sheets.googleapis.com/$discovery/rest?version=v4"
def get_credentials():
home_dir = os.path.expanduser("~")
credential_dir = os.path.join(home_dir, ".credentials")
flags = None
if not os.path.exists(credential_dir):
os.makedirs(credential_dir)
credential_path = os.path.join(credential_dir,
"sheets.googleapis.com-clientdebug.json")
store = Storage(credential_path)
credentials = store.get()
if not credentials or credentials.invalid:
flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
flow.user_agent = APPLICATION_NAME
credentials = tools.run_flow(flow, store, flags)
return credentials
@profile(precision=4)
def get_responses(creds):
"""Fetch spreadsheet data."""
sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
http = creds.authorize(httplib2.Http())
service = discovery.build("sheets", "v4", http=http,
discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
result = service.spreadsheets().values().get(
spreadsheetId=sheet_id, range="A1:O").execute()
values = result.get("values", [])
print("Got {} rows".format(len(values)))
if __name__ == "__main__":
creds = get_credentials()
for i in range(0, 50):
get_responses(creds)
sleep(2) For measurements I used First and second iteration
Last iteration
There's clearly a memory leak, as the reproducer fetches the same data over and over again, yet the memory consumption keeps rising. Full log can be found here. As a temporary workaround for one of my long-running applications I use an explicit garbage collector call, which mitigates this issue, at least for now: ...
import gc
...
result = service.spreadsheets().values().get(
spreadsheetId=sheet_id, range="A1:O").execute()
values = result.get("values", [])
gc.collect()
... |
I went a little deeper, and the main culprit seems to be in the
(This method has a huge docstring.) Nevertheless, there is probably a reference loop somewhere, as the |
The issue in the OP and the followup about Gsheets are related, inasmuch as both seem to cause more memory than desirable to be consumed, but they have independent causes. Or, at least, the MediaIOBaseDownload.next_chunk() codepath never causes the createMethod function to be called. @mrc0mmand would you be kind enough to open a separate issue about the Gsheets memory issue? |
Thanks so much @mrc0mmand |
For the record, I created the following script in order to attempt to replicate your findings:
I ran this under Python 2.7.15 and got the following result against a 2.5GB file:
I got a similar result under Python 3.5. Do you think you can run it on your system and see if you get the same results? |
@mcdonc I don't have access to your files so I can't run your code, but as I showed above, I had the same issue as you. |
@alexanderwhatley thanks for the response. What I was getting at, though, is that I don't see an issue. The problematic result in your original analysis is, I believe:
I haven't been able to get a result like that. Instead, memory is reclaimed by the end of the script. Do you think you might be able to change the fileId of my script to a large file in your drive area and replicate your problematic result (possibly making changes to my script as necessary)? |
Closing this issue as Thanks! |
I'm running the following code in google colab to download a large file (approximately 2.5 GB).
I am using the following script to determine memory usage:
However, it appears that there is a memory leak in the api client. I print the amount of memory usage at the very beginning, before running anything:
After running the download script, which as you can see deletes all of the variables, as well as the
%reset
magic, the amount of memory is still much less than before:The text was updated successfully, but these errors were encountered: