|
| 1 | +================================ |
| 2 | +Analyzing PyPI package downloads |
| 3 | +================================ |
| 4 | + |
| 5 | +This section covers how to use the `PyPI package dataset`_ to learn more |
| 6 | +about downloads of a package (or packages) hosted on PyPI. For example, you can |
| 7 | +use it to discover the distribution of Python versions used to download a |
| 8 | +package. |
| 9 | + |
| 10 | +.. contents:: Contents |
| 11 | + :local: |
| 12 | + |
| 13 | + |
| 14 | +Background |
| 15 | +========== |
| 16 | + |
| 17 | +PyPI does not display download statistics because they are difficult to |
| 18 | +collect and display accurately. Reasons for this are included in the |
| 19 | +`announcement email |
| 20 | +<https://mail.python.org/pipermail/distutils-sig/2013-May/020855.html>`__: |
| 21 | + |
| 22 | + There are numerous reasons for [download counts] removal/deprecation some |
| 23 | + of which are: |
| 24 | + |
| 25 | + - Technically hard to make work with the new CDN |
| 26 | + - The CDN is being donated to the PSF, and the donated tier does |
| 27 | + not offer any form of log access |
| 28 | + - The work around for not having log access would greatly reduce |
| 29 | + the utility of the CDN |
| 30 | + - Highly inaccurate |
| 31 | + - A number of things prevent the download counts from being |
| 32 | + inaccurate, some of which include: |
| 33 | + - pip download cache |
| 34 | + - Internal or unofficial mirrors |
| 35 | + - Packages not hosted on PyPI (for comparisons sake) |
| 36 | + - Mirrors or unofficial grab scripts causing inflated counts |
| 37 | + (Last I looked 25% of the downloads were from a known |
| 38 | + mirroring script). |
| 39 | + - Not particularly useful |
| 40 | + - Just because a project has been downloaded a lot doesn't mean |
| 41 | + it's good |
| 42 | + - Similarly just because a project hasn't been downloaded a lot |
| 43 | + doesn't mean it's bad |
| 44 | + |
| 45 | + In short because it's value is low for various reasons, and the tradeoffs |
| 46 | + required to make it work are high It has been not an effective use of |
| 47 | + resources. |
| 48 | + |
| 49 | +As an alternative, the `Linehaul project |
| 50 | +<https://github.com/pypa/linehaul>`__ streams download logs to `Google |
| 51 | +BigQuery`_ [#]_. Linehaul writes an entry in a |
| 52 | +``the-psf.pypi.downloadsYYYYMMDD`` table for each download. The table |
| 53 | +contains information about what file was downloaded and how it was |
| 54 | +downloaded. Some useful columns from the `table schema |
| 55 | +<https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema>`__ |
| 56 | +include: |
| 57 | + |
| 58 | ++------------------------+-----------------+-----------------------+ |
| 59 | +| Column | Description | Examples | |
| 60 | ++========================+=================+=======================+ |
| 61 | +| file.project | Project name | ``pipenv``, ``nose`` | |
| 62 | ++------------------------+-----------------+-----------------------+ |
| 63 | +| file.version | Package version | ``0.1.6``, ``1.4.2`` | |
| 64 | ++------------------------+-----------------+-----------------------+ |
| 65 | +| details.installer.name | Installer | pip, `bandersnatch`_ | |
| 66 | ++------------------------+-----------------+-----------------------+ |
| 67 | +| details.python | Python version | ``2.7.12``, ``3.6.4`` | |
| 68 | ++------------------------+-----------------+-----------------------+ |
| 69 | + |
| 70 | +.. [#] `PyPI BigQuery dataset announcement email <https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html>`__ |
| 71 | +
|
| 72 | +Setting up |
| 73 | +========== |
| 74 | + |
| 75 | +In order to use `Google BigQuery`_ to query the `PyPI package dataset`_, |
| 76 | +you'll need a Google account and to enable the BigQuery API on a Google |
| 77 | +Cloud Platform project. You can run the up to 1TB of queries per month `using |
| 78 | +the BigQuery free tier without a credit card |
| 79 | +<https://cloud.google.com/blog/big-data/2017/01/how-to-run-a-terabyte-of-google-bigquery-queries-each-month-without-a-credit-card>`__ |
| 80 | + |
| 81 | +- Navigate to the `BigQuery web UI`_. |
| 82 | +- Create a new project. |
| 83 | +- Enable the `BigQuery API |
| 84 | + <https://console.developers.google.com/apis/api/bigquery-json.googleapis.com/overview>`__. |
| 85 | + |
| 86 | +For more detailed instructions on how to get started with BigQuery, check out |
| 87 | +the `BigQuery quickstart guide |
| 88 | +<https://cloud.google.com/bigquery/quickstart-web-ui>`__. |
| 89 | + |
| 90 | +Useful queries |
| 91 | +============== |
| 92 | + |
| 93 | +Run queries in the `BigQuery web UI`_ by clicking the "Compose query" button. |
| 94 | + |
| 95 | +Note that the rows are stored in separate tables for each day, which helps |
| 96 | +limit costs if you are only interested in recent downloads. To analyze the |
| 97 | +full history, use `wildcard tables |
| 98 | +<https://cloud.google.com/bigquery/docs/querying-wildcard-tables>`__ to |
| 99 | +select all tables. |
| 100 | + |
| 101 | +Counting package downloads |
| 102 | +-------------------------- |
| 103 | + |
| 104 | +The following query counts the total number of downloads for the project |
| 105 | +"pytest". |
| 106 | + |
| 107 | +:: |
| 108 | + |
| 109 | + #standardSQL |
| 110 | + SELECT COUNT(*) AS num_downloads |
| 111 | + FROM `the-psf.pypi.downloads*` |
| 112 | + WHERE file.project = 'pytest' |
| 113 | + |
| 114 | ++---------------+ |
| 115 | +| num_downloads | |
| 116 | ++===============+ |
| 117 | +| 35534338 | |
| 118 | ++---------------+ |
| 119 | + |
| 120 | +To only count downloads from pip, filter on the ``details.installer.name`` |
| 121 | +column. |
| 122 | + |
| 123 | +:: |
| 124 | + |
| 125 | + #standardSQL |
| 126 | + SELECT COUNT(*) AS num_downloads |
| 127 | + FROM `the-psf.pypi.downloads*` |
| 128 | + WHERE |
| 129 | + file.project = 'pytest' |
| 130 | + AND details.installer.name = 'pip' |
| 131 | + |
| 132 | ++---------------+ |
| 133 | +| num_downloads | |
| 134 | ++===============+ |
| 135 | +| 31768554 | |
| 136 | ++---------------+ |
| 137 | + |
| 138 | +Package downloads over time |
| 139 | +--------------------------- |
| 140 | + |
| 141 | +To group by monthly downloads, use the ``_TABLE_SUFFIX`` pseudo-column. Also |
| 142 | +use the pseudo-column to limit the tables queried and the corresponding |
| 143 | +costs. |
| 144 | + |
| 145 | +:: |
| 146 | + |
| 147 | + #standardSQL |
| 148 | + SELECT |
| 149 | + COUNT(*) AS num_downloads, |
| 150 | + SUBSTR(_TABLE_SUFFIX, 1, 6) AS `month` |
| 151 | + FROM `the-psf.pypi.downloads*` |
| 152 | + WHERE |
| 153 | + file.project = 'pytest' |
| 154 | + AND _TABLE_SUFFIX BETWEEN '20171001' AND '20180131' |
| 155 | + GROUP BY `month` |
| 156 | + ORDER BY `month` DESC |
| 157 | + |
| 158 | ++---------------+--------+ |
| 159 | +| num_downloads | month | |
| 160 | ++===============+========+ |
| 161 | +| 1956741 | 201801 | |
| 162 | ++---------------+--------+ |
| 163 | +| 2344692 | 201712 | |
| 164 | ++---------------+--------+ |
| 165 | +| 1730398 | 201711 | |
| 166 | ++---------------+--------+ |
| 167 | +| 2047310 | 201710 | |
| 168 | ++---------------+--------+ |
| 169 | + |
| 170 | +More queries |
| 171 | +------------ |
| 172 | + |
| 173 | +- `Data driven decisions using PyPI download statistics |
| 174 | + <https://langui.sh/2016/12/09/data-driven-decisions/>`__ |
| 175 | +- `PyPI queries gist <https://gist.github.com/alex/4f100a9592b05e9b4d63>`__ |
| 176 | +- `Python versions over time |
| 177 | + <https://github.com/tswast/code-snippets/blob/master/2018/python-community-insights/Python%20Community%20Insights.ipynb>`__ |
| 178 | + |
| 179 | +Additional tools |
| 180 | +================ |
| 181 | + |
| 182 | +You can also access the `PyPI package dataset`_ programmatically via the |
| 183 | +BigQuery API. |
| 184 | + |
| 185 | +pypinfo |
| 186 | +------- |
| 187 | + |
| 188 | +`pypinfo`_ is a command-line tool which provides access to the dataset and |
| 189 | +can generate several useful queries. For example, you can query the total |
| 190 | +number of download for a package with the command ``pypinfo package_name``. |
| 191 | + |
| 192 | +:: |
| 193 | + |
| 194 | + $ pypinfo requests |
| 195 | + Served from cache: False |
| 196 | + Data processed: 6.87 GiB |
| 197 | + Data billed: 6.87 GiB |
| 198 | + Estimated cost: $0.04 |
| 199 | + |
| 200 | + | download_count | |
| 201 | + | -------------- | |
| 202 | + | 9,316,415 | |
| 203 | + |
| 204 | +Install `pypinfo`_ using pip. |
| 205 | + |
| 206 | +:: |
| 207 | + |
| 208 | + pip install pypinfo |
| 209 | + |
| 210 | +Other libraries |
| 211 | +--------------- |
| 212 | + |
| 213 | +- `google-cloud-bigquery`_ is the official client library to access the |
| 214 | + BigQuery API. |
| 215 | +- `pandas-gbq`_ allows for accessing query results via `Pandas`_. |
| 216 | + |
| 217 | +.. _PyPI package dataset: https://bigquery.cloud.google.com/dataset/the-psf:pypi |
| 218 | +.. _bandersnatch: /key_projects/#bandersnatch |
| 219 | +.. _Google BigQuery: https://cloud.google.com/bigquery |
| 220 | +.. _BigQuery web UI: http://bigquery.cloud.google.com/ |
| 221 | +.. _pypinfo: https://github.com/ofek/pypinfo/blob/master/README.rst |
| 222 | +.. _google-cloud-bigquery: https://cloud.google.com/bigquery/docs/reference/libraries |
| 223 | +.. _pandas-gbq: https://pandas-gbq.readthedocs.io/en/latest/ |
| 224 | +.. _Pandas: https://pandas.pydata.org/ |
0 commit comments