Skip to content

Commit afb3c35

Browse files
tswastncoghlan
authored andcommitted
Provide pointer to Google BigQuery download statistics (pypa#433)
This guide points to the PyPI download statistics dataset and describes how to query it. Examples queries are limited to specific time periods to reduce the amount of quota used when experimenting.
1 parent 7c906ff commit afb3c35

File tree

2 files changed

+244
-0
lines changed

2 files changed

+244
-0
lines changed
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
================================
2+
Analyzing PyPI package downloads
3+
================================
4+
5+
This section covers how to use the `PyPI package dataset`_ to learn more
6+
about downloads of a package (or packages) hosted on PyPI. For example, you can
7+
use it to discover the distribution of Python versions used to download a
8+
package.
9+
10+
.. contents:: Contents
11+
:local:
12+
13+
14+
Background
15+
==========
16+
17+
PyPI does not display download statistics because they are difficult to
18+
collect and display accurately. Reasons for this are included in the
19+
`announcement email
20+
<https://mail.python.org/pipermail/distutils-sig/2013-May/020855.html>`__:
21+
22+
There are numerous reasons for [download counts] removal/deprecation some
23+
of which are:
24+
25+
- Technically hard to make work with the new CDN
26+
- The CDN is being donated to the PSF, and the donated tier does
27+
not offer any form of log access
28+
- The work around for not having log access would greatly reduce
29+
the utility of the CDN
30+
- Highly inaccurate
31+
- A number of things prevent the download counts from being
32+
inaccurate, some of which include:
33+
- pip download cache
34+
- Internal or unofficial mirrors
35+
- Packages not hosted on PyPI (for comparisons sake)
36+
- Mirrors or unofficial grab scripts causing inflated counts
37+
(Last I looked 25% of the downloads were from a known
38+
mirroring script).
39+
- Not particularly useful
40+
- Just because a project has been downloaded a lot doesn't mean
41+
it's good
42+
- Similarly just because a project hasn't been downloaded a lot
43+
doesn't mean it's bad
44+
45+
In short because it's value is low for various reasons, and the tradeoffs
46+
required to make it work are high It has been not an effective use of
47+
resources.
48+
49+
As an alternative, the `Linehaul project
50+
<https://github.com/pypa/linehaul>`__ streams download logs to `Google
51+
BigQuery`_ [#]_. Linehaul writes an entry in a
52+
``the-psf.pypi.downloadsYYYYMMDD`` table for each download. The table
53+
contains information about what file was downloaded and how it was
54+
downloaded. Some useful columns from the `table schema
55+
<https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema>`__
56+
include:
57+
58+
+------------------------+-----------------+-----------------------+
59+
| Column | Description | Examples |
60+
+========================+=================+=======================+
61+
| file.project | Project name | ``pipenv``, ``nose`` |
62+
+------------------------+-----------------+-----------------------+
63+
| file.version | Package version | ``0.1.6``, ``1.4.2`` |
64+
+------------------------+-----------------+-----------------------+
65+
| details.installer.name | Installer | pip, `bandersnatch`_ |
66+
+------------------------+-----------------+-----------------------+
67+
| details.python | Python version | ``2.7.12``, ``3.6.4`` |
68+
+------------------------+-----------------+-----------------------+
69+
70+
.. [#] `PyPI BigQuery dataset announcement email <https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html>`__
71+
72+
Setting up
73+
==========
74+
75+
In order to use `Google BigQuery`_ to query the `PyPI package dataset`_,
76+
you'll need a Google account and to enable the BigQuery API on a Google
77+
Cloud Platform project. You can run the up to 1TB of queries per month `using
78+
the BigQuery free tier without a credit card
79+
<https://cloud.google.com/blog/big-data/2017/01/how-to-run-a-terabyte-of-google-bigquery-queries-each-month-without-a-credit-card>`__
80+
81+
- Navigate to the `BigQuery web UI`_.
82+
- Create a new project.
83+
- Enable the `BigQuery API
84+
<https://console.developers.google.com/apis/api/bigquery-json.googleapis.com/overview>`__.
85+
86+
For more detailed instructions on how to get started with BigQuery, check out
87+
the `BigQuery quickstart guide
88+
<https://cloud.google.com/bigquery/quickstart-web-ui>`__.
89+
90+
Useful queries
91+
==============
92+
93+
Run queries in the `BigQuery web UI`_ by clicking the "Compose query" button.
94+
95+
Note that the rows are stored in separate tables for each day, which helps
96+
limit the cost of queries. These example queries analyze downloads from
97+
recent history by using `wildcard tables
98+
<https://cloud.google.com/bigquery/docs/querying-wildcard-tables>`__ to
99+
select all tables and then filter by ``_TABLE_SUFFIX``.
100+
101+
Counting package downloads
102+
--------------------------
103+
104+
The following query counts the total number of downloads for the project
105+
"pytest".
106+
107+
::
108+
109+
#standardSQL
110+
SELECT COUNT(*) AS num_downloads
111+
FROM `the-psf.pypi.downloads*`
112+
WHERE file.project = 'pytest'
113+
-- Only query the last 30 days of history
114+
AND _TABLE_SUFFIX
115+
BETWEEN FORMAT_DATE(
116+
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
117+
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
118+
119+
+---------------+
120+
| num_downloads |
121+
+===============+
122+
| 2117807 |
123+
+---------------+
124+
125+
To only count downloads from pip, filter on the ``details.installer.name``
126+
column.
127+
128+
::
129+
130+
#standardSQL
131+
SELECT COUNT(*) AS num_downloads
132+
FROM `the-psf.pypi.downloads*`
133+
WHERE file.project = 'pytest'
134+
AND details.installer.name = 'pip'
135+
-- Only query the last 30 days of history
136+
AND _TABLE_SUFFIX
137+
BETWEEN FORMAT_DATE(
138+
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
139+
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
140+
141+
+---------------+
142+
| num_downloads |
143+
+===============+
144+
| 1829322 |
145+
+---------------+
146+
147+
Package downloads over time
148+
---------------------------
149+
150+
To group by monthly downloads, use the ``_TABLE_SUFFIX`` pseudo-column. Also
151+
use the pseudo-column to limit the tables queried and the corresponding
152+
costs.
153+
154+
::
155+
156+
#standardSQL
157+
SELECT
158+
COUNT(*) AS num_downloads,
159+
SUBSTR(_TABLE_SUFFIX, 1, 6) AS `month`
160+
FROM `the-psf.pypi.downloads*`
161+
WHERE
162+
file.project = 'pytest'
163+
-- Only query the last 6 months of history
164+
AND _TABLE_SUFFIX
165+
BETWEEN FORMAT_DATE(
166+
'%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH))
167+
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
168+
GROUP BY `month`
169+
ORDER BY `month` DESC
170+
171+
+---------------+--------+
172+
| num_downloads | month |
173+
+===============+========+
174+
| 1956741 | 201801 |
175+
+---------------+--------+
176+
| 2344692 | 201712 |
177+
+---------------+--------+
178+
| 1730398 | 201711 |
179+
+---------------+--------+
180+
| 2047310 | 201710 |
181+
+---------------+--------+
182+
| 1744443 | 201709 |
183+
+---------------+--------+
184+
| 1916952 | 201708 |
185+
+---------------+--------+
186+
187+
More queries
188+
------------
189+
190+
- `Data driven decisions using PyPI download statistics
191+
<https://langui.sh/2016/12/09/data-driven-decisions/>`__
192+
- `PyPI queries gist <https://gist.github.com/alex/4f100a9592b05e9b4d63>`__
193+
- `Python versions over time
194+
<https://github.com/tswast/code-snippets/blob/master/2018/python-community-insights/Python%20Community%20Insights.ipynb>`__
195+
- `Non-Windows downloads, grouped by platform
196+
<https://bigquery.cloud.google.com/savedquery/51422494423:ff1976af63614ad4a1258d8821dd7785>`__
197+
198+
Additional tools
199+
================
200+
201+
You can also access the `PyPI package dataset`_ programmatically via the
202+
BigQuery API.
203+
204+
pypinfo
205+
-------
206+
207+
`pypinfo`_ is a command-line tool which provides access to the dataset and
208+
can generate several useful queries. For example, you can query the total
209+
number of download for a package with the command ``pypinfo package_name``.
210+
211+
::
212+
213+
$ pypinfo requests
214+
Served from cache: False
215+
Data processed: 6.87 GiB
216+
Data billed: 6.87 GiB
217+
Estimated cost: $0.04
218+
219+
| download_count |
220+
| -------------- |
221+
| 9,316,415 |
222+
223+
Install `pypinfo`_ using pip.
224+
225+
::
226+
227+
pip install pypinfo
228+
229+
Other libraries
230+
---------------
231+
232+
- `google-cloud-bigquery`_ is the official client library to access the
233+
BigQuery API.
234+
- `pandas-gbq`_ allows for accessing query results via `Pandas`_.
235+
236+
.. _PyPI package dataset: https://bigquery.cloud.google.com/dataset/the-psf:pypi
237+
.. _bandersnatch: /key_projects/#bandersnatch
238+
.. _Google BigQuery: https://cloud.google.com/bigquery
239+
.. _BigQuery web UI: http://bigquery.cloud.google.com/
240+
.. _pypinfo: https://github.com/ofek/pypinfo/blob/master/README.rst
241+
.. _google-cloud-bigquery: https://cloud.google.com/bigquery/docs/reference/libraries
242+
.. _pandas-gbq: https://pandas-gbq.readthedocs.io/en/latest/
243+
.. _Pandas: https://pandas.pydata.org/

source/guides/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ introduction to packaging, see :doc:`/tutorials/index`.
1919
supporting-windows-using-appveyor
2020
packaging-namespace-packages
2121
creating-and-discovering-plugins
22+
analyzing-pypi-package-downloads
2223
index-mirrors-and-caches
2324
hosting-your-own-index
2425
migrating-to-pypi-org

0 commit comments

Comments
 (0)