Memory leak in file download from google drive #512

alexanderwhatley · 2018-05-30T19:03:43Z

I'm running the following code in google colab to download a large file (approximately 2.5 GB).

import io
from google.colab import auth
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
auth.authenticate_user()

drive_service = build('drive', 'v3')
file_id = '1M5ZTq0Iu0kttOZb1FpDPRG-TsagdYbIQ' 
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
  status, done = downloader.next_chunk()
  print('Percent downloaded', int(100 * status.progress()))
downloaded.seek(0)
f = open('data.tar.gz', 'wb')
f.write(downloaded.read())
f.close()
del drive_service, downloader, downloaded, f
printm()

I am using the following script to determine memory usage:

!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))
printm()

However, it appears that there is a memory leak in the api client. I print the amount of memory usage at the very beginning, before running anything:

Requirement already satisfied: gputil in /usr/local/lib/python3.6/dist-packages (1.3.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from gputil) (1.14.3)
Requirement already satisfied: psutil in /usr/local/lib/python3.6/dist-packages (5.4.5)
Requirement already satisfied: humanize in /usr/local/lib/python3.6/dist-packages (0.5.1)
Gen RAM Free: 13.0 GB  I Proc size: 146.8 MB

After running the download script, which as you can see deletes all of the variables, as well as the %reset magic, the amount of memory is still much less than before:

Requirement already satisfied: gputil in /usr/local/lib/python3.6/dist-packages (1.3.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from gputil) (1.14.3)
Requirement already satisfied: psutil in /usr/local/lib/python3.6/dist-packages (5.4.5)
Requirement already satisfied: humanize in /usr/local/lib/python3.6/dist-packages (0.5.1)
Gen RAM Free: 6.7 GB  I Proc size: 6.5 GB

The text was updated successfully, but these errors were encountered:

mrc0mmand · 2018-06-11T16:03:13Z

I think I came across a similar (or possibly the same) issue, but with GSheets.

Environment:

$ python --version
Python 3.6.4
$ pip show google-api-python-client
Name: google-api-python-client
Version: 1.7.3

Here's a simple reproducer (without a .client_secret.json):

#!/usr/bin/env python3

import httplib2
import os
from apiclient import discovery
from memory_profiler import profile
from oauth2client import client, tools
from oauth2client.file import Storage
from time import sleep

SCOPES = "https://www.googleapis.com/auth/spreadsheets.readonly"
# See https://cloud.google.com/docs/authentication/getting-started
CLIENT_SECRET_FILE = ".client_secret.json"
APPLICATION_NAME = "ClientDebug"
DISCOVERY_URL = "https://sheets.googleapis.com/$discovery/rest?version=v4"

def get_credentials():
    home_dir = os.path.expanduser("~")
    credential_dir = os.path.join(home_dir, ".credentials")
    flags = None
    if not os.path.exists(credential_dir):
        os.makedirs(credential_dir)
    credential_path = os.path.join(credential_dir,
                                "sheets.googleapis.com-clientdebug.json")

    store = Storage(credential_path)
    credentials = store.get()
    if not credentials or credentials.invalid:
        flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
        flow.user_agent = APPLICATION_NAME
        credentials = tools.run_flow(flow, store, flags)

    return credentials

@profile(precision=4)
def get_responses(creds):
    """Fetch spreadsheet data."""
    sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"

    http = creds.authorize(httplib2.Http())
    service = discovery.build("sheets", "v4", http=http,
                discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    result = service.spreadsheets().values().get(
            spreadsheetId=sheet_id, range="A1:O").execute()
    values = result.get("values", [])

    print("Got {} rows".format(len(values)))

if __name__ == "__main__":
    creds = get_credentials()
    for i in range(0, 50):
        get_responses(creds)
        sleep(2)

For measurements I used memory_profiler module with following results:

First and second iteration

Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35  26.5195 MiB  26.5195 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38  26.5195 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40  26.5195 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41  26.5195 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42  29.2891 MiB   2.7695 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43  49.5742 MiB  20.2852 MiB       result = service.spreadsheets().values().get(
    44  49.5742 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45  49.5742 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47  49.5742 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))


Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35  49.5742 MiB  49.5742 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38  49.5742 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40  49.5742 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41  49.5742 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42  49.5742 MiB   0.0000 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43  67.9922 MiB  18.4180 MiB       result = service.spreadsheets().values().get(
    44  67.9922 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45  67.9922 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47  67.9922 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))

Last iteration

Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35 229.6055 MiB 229.6055 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38 229.6055 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40 229.6055 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41 229.6055 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42 229.6055 MiB   0.0000 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43 229.6055 MiB   0.0000 MiB       result = service.spreadsheets().values().get(
    44 229.6055 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45 229.6055 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47 229.6055 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))

There's clearly a memory leak, as the reproducer fetches the same data over and over again, yet the memory consumption keeps rising. Full log can be found here.

As a temporary workaround for one of my long-running applications I use an explicit garbage collector call, which mitigates this issue, at least for now:

...
import gc
...
    result = service.spreadsheets().values().get(
            spreadsheetId=sheet_id, range="A1:O").execute()
    values = result.get("values", [])

    gc.collect()
...

mrc0mmand · 2018-06-11T20:11:25Z

I went a little deeper, and the main culprit seems to be in the createMethod function when creating dynamic method batchUpdate:

Method 'batchUpdate, approx. __doc__ size: 2886834                                                                                                                                             
<class 'function'>                                                                                                                                                            
Filename: /home/fsumsal/venv/googleapiclient/lib64/python3.6/site-packages/googleapiclient/discovery.py                                                                       
                                                                                                                                                                              
Line #    Mem usage    Increment   Line Contents                                                                                                                              
================================================                                                                                                                              
  1064     48.7 MiB     48.7 MiB     @profile                                                                                                                                 
  1065                               def _add_basic_methods(self, resourceDesc, rootDesc, schema):                                                                            
  1066                                 # If this is the root Resource, add a new_batch_http_request() method.                                                                 
  1067     48.7 MiB      0.0 MiB       if resourceDesc == rootDesc:                                                                                                           
...                          
  1086                                                                                                                                                                        
  1087                                 # Add basic methods to Resource                                                                                                        
  1088     48.7 MiB      0.0 MiB       if 'methods' in resourceDesc:                                                                                                          
  1089     66.8 MiB      0.0 MiB         for methodName, methodDesc in six.iteritems(resourceDesc['methods']):                                                                
  1090     56.0 MiB      0.0 MiB           fixedMethodName, method = createMethod(                                                                                            
  1091     66.8 MiB     18.1 MiB               methodName, methodDesc, rootDesc, schema)                                                                                      
  1092     66.8 MiB      0.0 MiB           print(type(method))                                                                                                                
  1093     66.8 MiB      0.0 MiB           self._set_dynamic_attr(fixedMethodName,             
  1094     66.8 MiB      0.0 MiB                                  method.__get__(self, self.__class__))                                                                                        
  1095                                     # Add in _media methods. The functionality of the attached method will                                                                              
  1096                                     # change when it sees that the method name ends in _media.                                                                                          
  1097     66.8 MiB      0.0 MiB           if methodDesc.get('supportsMediaDownload', False):  
  1098                                       fixedMethodName, method = createMethod(           
  1099                                           methodName + '_media', methodDesc, rootDesc, schema)                                                                                          
  1100                                       self._set_dynamic_attr(fixedMethodName,           
  1101                                                              method.__get__(self, self.__class__))

(This method has a huge docstring.)

Nevertheless, there is probably a reference loop somewhere, as the gc.collect() call manages to collect all those unreachable objects.

mcdonc · 2018-06-29T12:42:15Z

The issue in the OP and the followup about Gsheets are related, inasmuch as both seem to cause more memory than desirable to be consumed, but they have independent causes. Or, at least, the MediaIOBaseDownload.next_chunk() codepath never causes the createMethod function to be called.

@mrc0mmand would you be kind enough to open a separate issue about the Gsheets memory issue?

mrc0mmand · 2018-06-29T12:51:20Z

@mcdonc Sure thing, see #535.

mcdonc · 2018-06-29T13:25:55Z

Thanks so much @mrc0mmand

mcdonc · 2018-06-29T14:36:40Z

@alexanderwhatley

For the record, I created the following script in order to attempt to replicate your findings:

import psutil
import humanize
import os
import io
from httplib2 import Http

from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from oauth2client import file, client, tools

def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))

printm()

# Setup the Drive v3 API
SCOPES = 'https://www.googleapis.com/auth/drive'

#'https://www.googleapis.com/auth/drive.metadata.readonly',

store = file.Storage('credentials.json')
creds = store.get()
if not creds or creds.invalid:
    flow = client.flow_from_clientsecrets('client_secret.json', SCOPES)
    creds = tools.run_flow(flow, store)
drive_service = build('drive', 'v3', http=creds.authorize(Http()))

file_id = '1RbGWNti3FNuSdX9bNgCTmMmDKRYMDby1'
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print('Percent downloaded', int(100 * status.progress()))

downloaded.seek(0)
f = open('data.tar.gz', 'wb')
f.write(downloaded.read())
f.close()
del drive_service, downloader, downloaded, f
printm()

I ran this under Python 2.7.15 and got the following result against a 2.5GB file:

('Gen RAM Free: 58.4 GB', ' I Proc size: 32.4 MB')
('Percent downloaded', 5)
('Percent downloaded', 11)
('Percent downloaded', 17)
('Percent downloaded', 23)
('Percent downloaded', 29)
('Percent downloaded', 35)
('Percent downloaded', 41)
('Percent downloaded', 47)
('Percent downloaded', 53)
('Percent downloaded', 59)
('Percent downloaded', 65)
('Percent downloaded', 71)
('Percent downloaded', 77)
('Percent downloaded', 83)
('Percent downloaded', 89)
('Percent downloaded', 95)
('Percent downloaded', 100)
('Gen RAM Free: 56.8 GB', ' I Proc size: 36.6 MB')

I got a similar result under Python 3.5.

Do you think you can run it on your system and see if you get the same results?

alexanderwhatley · 2018-06-29T18:04:19Z

@mcdonc I don't have access to your files so I can't run your code, but as I showed above, I had the same issue as you.

mcdonc · 2018-07-02T12:22:01Z

@alexanderwhatley thanks for the response.

What I was getting at, though, is that I don't see an issue. The problematic result in your original analysis is, I believe:

Requirement already satisfied: gputil in /usr/local/lib/python3.6/dist-packages (1.3.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from gputil) (1.14.3)
Requirement already satisfied: psutil in /usr/local/lib/python3.6/dist-packages (5.4.5)
Requirement already satisfied: humanize in /usr/local/lib/python3.6/dist-packages (0.5.1)
Gen RAM Free: 6.7 GB  I Proc size: 6.5 GB

I haven't been able to get a result like that. Instead, memory is reclaimed by the end of the script.

Do you think you might be able to change the fileId of my script to a large file in your drive area and replicate your problematic result (possibly making changes to my script as necessary)?

busunkim96 · 2019-07-16T23:54:00Z

Closing this issue as needs more info was added over a year ago, and there hasn't been more followup. Please open a new issue and reference this if you encounter this.

Thanks!

JustinBeckwith added 🚨 This issue needs some love. triage me I really want to be triaged. labels Jun 8, 2018

JustinBeckwith added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jun 11, 2018

JustinBeckwith assigned mcdonc Jun 11, 2018

JustinBeckwith removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jun 11, 2018

mcdonc added the needs more info This issue needs more information from the customer to proceed. label Jun 29, 2018

JustinBeckwith added the 🚨 This issue needs some love. label Nov 26, 2018

busunkim96 closed this as completed Jul 16, 2019

hexadivine mentioned this issue Feb 28, 2022

Memory leak - issue while downloading large files #1706

Open

This was referenced Jul 29, 2024

fix#1706 - memory leak - issue while downloading large files #2452

Closed

fix#1706 - Issue while downloading large files #2453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in file download from google drive #512

Memory leak in file download from google drive #512

alexanderwhatley commented May 30, 2018 •

edited

Loading

mrc0mmand commented Jun 11, 2018

mrc0mmand commented Jun 11, 2018

mcdonc commented Jun 29, 2018

mrc0mmand commented Jun 29, 2018

mcdonc commented Jun 29, 2018

mcdonc commented Jun 29, 2018

alexanderwhatley commented Jun 29, 2018

mcdonc commented Jul 2, 2018

busunkim96 commented Jul 16, 2019

Memory leak in file download from google drive #512

Memory leak in file download from google drive #512

Comments

alexanderwhatley commented May 30, 2018 • edited Loading

mrc0mmand commented Jun 11, 2018

First and second iteration

Last iteration

mrc0mmand commented Jun 11, 2018

mcdonc commented Jun 29, 2018

mrc0mmand commented Jun 29, 2018

mcdonc commented Jun 29, 2018

mcdonc commented Jun 29, 2018

alexanderwhatley commented Jun 29, 2018

mcdonc commented Jul 2, 2018

busunkim96 commented Jul 16, 2019

alexanderwhatley commented May 30, 2018 •

edited

Loading