Skip to content

Memory leak in file download from google drive #512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alexanderwhatley opened this issue May 30, 2018 · 9 comments
Closed

Memory leak in file download from google drive #512

alexanderwhatley opened this issue May 30, 2018 · 9 comments
Assignees
Labels
needs more info This issue needs more information from the customer to proceed. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@alexanderwhatley
Copy link

alexanderwhatley commented May 30, 2018

I'm running the following code in google colab to download a large file (approximately 2.5 GB).

import io
from google.colab import auth
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
auth.authenticate_user()

drive_service = build('drive', 'v3')
file_id = '1M5ZTq0Iu0kttOZb1FpDPRG-TsagdYbIQ' 
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
  status, done = downloader.next_chunk()
  print('Percent downloaded', int(100 * status.progress()))
downloaded.seek(0)
f = open('data.tar.gz', 'wb')
f.write(downloaded.read())
f.close()
del drive_service, downloader, downloaded, f
printm()

I am using the following script to determine memory usage:

!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))
printm()

However, it appears that there is a memory leak in the api client. I print the amount of memory usage at the very beginning, before running anything:

Requirement already satisfied: gputil in /usr/local/lib/python3.6/dist-packages (1.3.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from gputil) (1.14.3)
Requirement already satisfied: psutil in /usr/local/lib/python3.6/dist-packages (5.4.5)
Requirement already satisfied: humanize in /usr/local/lib/python3.6/dist-packages (0.5.1)
Gen RAM Free: 13.0 GB  I Proc size: 146.8 MB

After running the download script, which as you can see deletes all of the variables, as well as the %reset magic, the amount of memory is still much less than before:

Requirement already satisfied: gputil in /usr/local/lib/python3.6/dist-packages (1.3.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from gputil) (1.14.3)
Requirement already satisfied: psutil in /usr/local/lib/python3.6/dist-packages (5.4.5)
Requirement already satisfied: humanize in /usr/local/lib/python3.6/dist-packages (0.5.1)
Gen RAM Free: 6.7 GB  I Proc size: 6.5 GB
@JustinBeckwith JustinBeckwith added 🚨 This issue needs some love. triage me I really want to be triaged. labels Jun 8, 2018
@mrc0mmand
Copy link

I think I came across a similar (or possibly the same) issue, but with GSheets.

Environment:

$ python --version
Python 3.6.4
$ pip show google-api-python-client
Name: google-api-python-client
Version: 1.7.3

Here's a simple reproducer (without a .client_secret.json):

#!/usr/bin/env python3

import httplib2
import os
from apiclient import discovery
from memory_profiler import profile
from oauth2client import client, tools
from oauth2client.file import Storage
from time import sleep

SCOPES = "https://www.googleapis.com/auth/spreadsheets.readonly"
# See https://cloud.google.com/docs/authentication/getting-started
CLIENT_SECRET_FILE = ".client_secret.json"
APPLICATION_NAME = "ClientDebug"
DISCOVERY_URL = "https://sheets.googleapis.com/$discovery/rest?version=v4"

def get_credentials():
    home_dir = os.path.expanduser("~")
    credential_dir = os.path.join(home_dir, ".credentials")
    flags = None
    if not os.path.exists(credential_dir):
        os.makedirs(credential_dir)
    credential_path = os.path.join(credential_dir,
                                "sheets.googleapis.com-clientdebug.json")

    store = Storage(credential_path)
    credentials = store.get()
    if not credentials or credentials.invalid:
        flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
        flow.user_agent = APPLICATION_NAME
        credentials = tools.run_flow(flow, store, flags)

    return credentials

@profile(precision=4)
def get_responses(creds):
    """Fetch spreadsheet data."""
    sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"

    http = creds.authorize(httplib2.Http())
    service = discovery.build("sheets", "v4", http=http,
                discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    result = service.spreadsheets().values().get(
            spreadsheetId=sheet_id, range="A1:O").execute()
    values = result.get("values", [])

    print("Got {} rows".format(len(values)))

if __name__ == "__main__":
    creds = get_credentials()
    for i in range(0, 50):
        get_responses(creds)
        sleep(2)

For measurements I used memory_profiler module with following results:

First and second iteration

Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35  26.5195 MiB  26.5195 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38  26.5195 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40  26.5195 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41  26.5195 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42  29.2891 MiB   2.7695 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43  49.5742 MiB  20.2852 MiB       result = service.spreadsheets().values().get(
    44  49.5742 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45  49.5742 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47  49.5742 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))


Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35  49.5742 MiB  49.5742 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38  49.5742 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40  49.5742 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41  49.5742 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42  49.5742 MiB   0.0000 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43  67.9922 MiB  18.4180 MiB       result = service.spreadsheets().values().get(
    44  67.9922 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45  67.9922 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47  67.9922 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))

Last iteration

Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35 229.6055 MiB 229.6055 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38 229.6055 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40 229.6055 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41 229.6055 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42 229.6055 MiB   0.0000 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43 229.6055 MiB   0.0000 MiB       result = service.spreadsheets().values().get(
    44 229.6055 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45 229.6055 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47 229.6055 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))

There's clearly a memory leak, as the reproducer fetches the same data over and over again, yet the memory consumption keeps rising. Full log can be found here.

As a temporary workaround for one of my long-running applications I use an explicit garbage collector call, which mitigates this issue, at least for now:

...
import gc
...
    result = service.spreadsheets().values().get(
            spreadsheetId=sheet_id, range="A1:O").execute()
    values = result.get("values", [])

    gc.collect()
...

@JustinBeckwith JustinBeckwith added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jun 11, 2018
@JustinBeckwith JustinBeckwith removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jun 11, 2018
@mrc0mmand
Copy link

I went a little deeper, and the main culprit seems to be in the createMethod function when creating dynamic method batchUpdate:

Method 'batchUpdate, approx. __doc__ size: 2886834                                                                                                                                             
<class 'function'>                                                                                                                                                            
Filename: /home/fsumsal/venv/googleapiclient/lib64/python3.6/site-packages/googleapiclient/discovery.py                                                                       
                                                                                                                                                                              
Line #    Mem usage    Increment   Line Contents                                                                                                                              
================================================                                                                                                                              
  1064     48.7 MiB     48.7 MiB     @profile                                                                                                                                 
  1065                               def _add_basic_methods(self, resourceDesc, rootDesc, schema):                                                                            
  1066                                 # If this is the root Resource, add a new_batch_http_request() method.                                                                 
  1067     48.7 MiB      0.0 MiB       if resourceDesc == rootDesc:                                                                                                           
...                          
  1086                                                                                                                                                                        
  1087                                 # Add basic methods to Resource                                                                                                        
  1088     48.7 MiB      0.0 MiB       if 'methods' in resourceDesc:                                                                                                          
  1089     66.8 MiB      0.0 MiB         for methodName, methodDesc in six.iteritems(resourceDesc['methods']):                                                                
  1090     56.0 MiB      0.0 MiB           fixedMethodName, method = createMethod(                                                                                            
  1091     66.8 MiB     18.1 MiB               methodName, methodDesc, rootDesc, schema)                                                                                      
  1092     66.8 MiB      0.0 MiB           print(type(method))                                                                                                                
  1093     66.8 MiB      0.0 MiB           self._set_dynamic_attr(fixedMethodName,             
  1094     66.8 MiB      0.0 MiB                                  method.__get__(self, self.__class__))                                                                                        
  1095                                     # Add in _media methods. The functionality of the attached method will                                                                              
  1096                                     # change when it sees that the method name ends in _media.                                                                                          
  1097     66.8 MiB      0.0 MiB           if methodDesc.get('supportsMediaDownload', False):  
  1098                                       fixedMethodName, method = createMethod(           
  1099                                           methodName + '_media', methodDesc, rootDesc, schema)                                                                                          
  1100                                       self._set_dynamic_attr(fixedMethodName,           
  1101                                                              method.__get__(self, self.__class__))

(This method has a huge docstring.)

Nevertheless, there is probably a reference loop somewhere, as the gc.collect() call manages to collect all those unreachable objects.

@mcdonc
Copy link
Contributor

mcdonc commented Jun 29, 2018

The issue in the OP and the followup about Gsheets are related, inasmuch as both seem to cause more memory than desirable to be consumed, but they have independent causes. Or, at least, the MediaIOBaseDownload.next_chunk() codepath never causes the createMethod function to be called.

@mrc0mmand would you be kind enough to open a separate issue about the Gsheets memory issue?

@mrc0mmand
Copy link

@mcdonc Sure thing, see #535.

@mcdonc
Copy link
Contributor

mcdonc commented Jun 29, 2018

Thanks so much @mrc0mmand

@mcdonc
Copy link
Contributor

mcdonc commented Jun 29, 2018

@alexanderwhatley

For the record, I created the following script in order to attempt to replicate your findings:

import psutil
import humanize
import os
import io
from httplib2 import Http

from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from oauth2client import file, client, tools

def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))

printm()

# Setup the Drive v3 API
SCOPES = 'https://www.googleapis.com/auth/drive'

#'https://www.googleapis.com/auth/drive.metadata.readonly',

store = file.Storage('credentials.json')
creds = store.get()
if not creds or creds.invalid:
    flow = client.flow_from_clientsecrets('client_secret.json', SCOPES)
    creds = tools.run_flow(flow, store)
drive_service = build('drive', 'v3', http=creds.authorize(Http()))

file_id = '1RbGWNti3FNuSdX9bNgCTmMmDKRYMDby1'
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print('Percent downloaded', int(100 * status.progress()))

downloaded.seek(0)
f = open('data.tar.gz', 'wb')
f.write(downloaded.read())
f.close()
del drive_service, downloader, downloaded, f
printm()

I ran this under Python 2.7.15 and got the following result against a 2.5GB file:

('Gen RAM Free: 58.4 GB', ' I Proc size: 32.4 MB')
('Percent downloaded', 5)
('Percent downloaded', 11)
('Percent downloaded', 17)
('Percent downloaded', 23)
('Percent downloaded', 29)
('Percent downloaded', 35)
('Percent downloaded', 41)
('Percent downloaded', 47)
('Percent downloaded', 53)
('Percent downloaded', 59)
('Percent downloaded', 65)
('Percent downloaded', 71)
('Percent downloaded', 77)
('Percent downloaded', 83)
('Percent downloaded', 89)
('Percent downloaded', 95)
('Percent downloaded', 100)
('Gen RAM Free: 56.8 GB', ' I Proc size: 36.6 MB')

I got a similar result under Python 3.5.

Do you think you can run it on your system and see if you get the same results?

@mcdonc mcdonc added the needs more info This issue needs more information from the customer to proceed. label Jun 29, 2018
@alexanderwhatley
Copy link
Author

@mcdonc I don't have access to your files so I can't run your code, but as I showed above, I had the same issue as you.

@mcdonc
Copy link
Contributor

mcdonc commented Jul 2, 2018

@alexanderwhatley thanks for the response.

What I was getting at, though, is that I don't see an issue. The problematic result in your original analysis is, I believe:

Requirement already satisfied: gputil in /usr/local/lib/python3.6/dist-packages (1.3.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from gputil) (1.14.3)
Requirement already satisfied: psutil in /usr/local/lib/python3.6/dist-packages (5.4.5)
Requirement already satisfied: humanize in /usr/local/lib/python3.6/dist-packages (0.5.1)
Gen RAM Free: 6.7 GB  I Proc size: 6.5 GB

I haven't been able to get a result like that. Instead, memory is reclaimed by the end of the script.

Do you think you might be able to change the fileId of my script to a large file in your drive area and replicate your problematic result (possibly making changes to my script as necessary)?

@JustinBeckwith JustinBeckwith added the 🚨 This issue needs some love. label Nov 26, 2018
@busunkim96
Copy link
Contributor

Closing this issue as needs more info was added over a year ago, and there hasn't been more followup. Please open a new issue and reference this if you encounter this.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs more info This issue needs more information from the customer to proceed. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants