Skip to content

Commit e64fb1e

Browse files
committed
Provide pointer to Google BigQuery download statistics
This guide points to the PyPI download statistics dataset and describes how to query it.
1 parent 7c906ff commit e64fb1e

File tree

2 files changed

+225
-0
lines changed

2 files changed

+225
-0
lines changed
Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
================================
2+
Analyzing PyPI package downloads
3+
================================
4+
5+
This section covers how to use the `PyPI package dataset`_ to learn more
6+
about downloads of a package (or packages) hosted on PyPI. For example, you can
7+
use it to discover the distribution of Python versions used to download a
8+
package.
9+
10+
.. contents:: Contents
11+
:local:
12+
13+
14+
Background
15+
==========
16+
17+
PyPI does not display download statistics because they are difficult to
18+
collect and display accurately. Reasons for this are included in the
19+
`announcement email
20+
<https://mail.python.org/pipermail/distutils-sig/2013-May/020855.html>`__:
21+
22+
There are numerous reasons for [download counts] removal/deprecation some
23+
of which are:
24+
25+
- Technically hard to make work with the new CDN
26+
- The CDN is being donated to the PSF, and the donated tier does
27+
not offer any form of log access
28+
- The work around for not having log access would greatly reduce
29+
the utility of the CDN
30+
- Highly inaccurate
31+
- A number of things prevent the download counts from being
32+
inaccurate, some of which include:
33+
- pip download cache
34+
- Internal or unofficial mirrors
35+
- Packages not hosted on PyPI (for comparisons sake)
36+
- Mirrors or unofficial grab scripts causing inflated counts
37+
(Last I looked 25% of the downloads were from a known
38+
mirroring script).
39+
- Not particularly useful
40+
- Just because a project has been downloaded a lot doesn't mean
41+
it's good
42+
- Similarly just because a project hasn't been downloaded a lot
43+
doesn't mean it's bad
44+
45+
In short because it's value is low for various reasons, and the tradeoffs
46+
required to make it work are high It has been not an effective use of
47+
resources.
48+
49+
As an alternative, the `Linehaul project
50+
<https://github.com/pypa/linehaul>`__ streams download logs to `Google
51+
BigQuery`_ [#]_. Linehaul writes an entry in a
52+
``the-psf.pypi.downloadsYYYYMMDD`` table for each download. The table
53+
contains information about what file was downloaded and how it was
54+
downloaded. Some useful columns from the `table schema
55+
<https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema>`__
56+
include:
57+
58+
+------------------------+-----------------+-----------------------+
59+
| Column | Description | Examples |
60+
+========================+=================+=======================+
61+
| file.project | Project name | ``pipenv``, ``nose`` |
62+
+------------------------+-----------------+-----------------------+
63+
| file.version | Package version | ``0.1.6``, ``1.4.2`` |
64+
+------------------------+-----------------+-----------------------+
65+
| details.installer.name | Installer | pip, `bandersnatch`_ |
66+
+------------------------+-----------------+-----------------------+
67+
| details.python | Python version | ``2.7.12``, ``3.6.4`` |
68+
+------------------------+-----------------+-----------------------+
69+
70+
.. [#] `PyPI BigQuery dataset announcement email <https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html>`__
71+
72+
Setting up
73+
==========
74+
75+
In order to use `Google BigQuery`_ to query the `PyPI package dataset`_,
76+
you'll need a Google account and to enable the BigQuery API on a Google
77+
Cloud Platform project. You can run the up to 1TB of queries per month `using
78+
the BigQuery free tier without a credit card
79+
<https://cloud.google.com/blog/big-data/2017/01/how-to-run-a-terabyte-of-google-bigquery-queries-each-month-without-a-credit-card>`__
80+
81+
- Navigate to the `BigQuery web UI`_.
82+
- Create a new project.
83+
- Enable the `BigQuery API
84+
<https://console.developers.google.com/apis/api/bigquery-json.googleapis.com/overview>`__.
85+
86+
For more detailed instructions on how to get started with BigQuery, check out
87+
the `BigQuery quickstart guide
88+
<https://cloud.google.com/bigquery/quickstart-web-ui>`__.
89+
90+
Useful queries
91+
==============
92+
93+
Run queries in the `BigQuery web UI`_ by clicking the "Compose query" button.
94+
95+
Note that the rows are stored in separate tables for each day, which helps
96+
limit costs if you are only interested in recent downloads. To analyze the
97+
full history, use `wildcard tables
98+
<https://cloud.google.com/bigquery/docs/querying-wildcard-tables>`__ to
99+
select all tables.
100+
101+
Counting package downloads
102+
--------------------------
103+
104+
The following query counts the total number of downloads for the project
105+
"pytest".
106+
107+
::
108+
109+
#standardSQL
110+
SELECT COUNT(*) AS num_downloads
111+
FROM `the-psf.pypi.downloads*`
112+
WHERE file.project = 'pytest'
113+
114+
+---------------+
115+
| num_downloads |
116+
+===============+
117+
| 35534338 |
118+
+---------------+
119+
120+
To only count downloads from pip, filter on the ``details.installer.name``
121+
column.
122+
123+
::
124+
125+
#standardSQL
126+
SELECT COUNT(*) AS num_downloads
127+
FROM `the-psf.pypi.downloads*`
128+
WHERE
129+
file.project = 'pytest'
130+
AND details.installer.name = 'pip'
131+
132+
+---------------+
133+
| num_downloads |
134+
+===============+
135+
| 31768554 |
136+
+---------------+
137+
138+
Package downloads over time
139+
---------------------------
140+
141+
To group by monthly downloads, use the ``_TABLE_SUFFIX`` pseudo-column. Also
142+
use the pseudo-column to limit the tables queried and the corresponding
143+
costs.
144+
145+
::
146+
147+
#standardSQL
148+
SELECT
149+
COUNT(*) AS num_downloads,
150+
SUBSTR(_TABLE_SUFFIX, 1, 6) AS `month`
151+
FROM `the-psf.pypi.downloads*`
152+
WHERE
153+
file.project = 'pytest'
154+
AND _TABLE_SUFFIX BETWEEN '20171001' AND '20180131'
155+
GROUP BY `month`
156+
ORDER BY `month` DESC
157+
158+
+---------------+--------+
159+
| num_downloads | month |
160+
+===============+========+
161+
| 1956741 | 201801 |
162+
+---------------+--------+
163+
| 2344692 | 201712 |
164+
+---------------+--------+
165+
| 1730398 | 201711 |
166+
+---------------+--------+
167+
| 2047310 | 201710 |
168+
+---------------+--------+
169+
170+
More queries
171+
------------
172+
173+
- `Data driven decisions using PyPI download statistics
174+
<https://langui.sh/2016/12/09/data-driven-decisions/>`__
175+
- `PyPI queries gist <https://gist.github.com/alex/4f100a9592b05e9b4d63>`__
176+
- `Python versions over time
177+
<https://github.com/tswast/code-snippets/blob/master/2018/python-community-insights/Python%20Community%20Insights.ipynb>`__
178+
179+
Additional tools
180+
================
181+
182+
You can also access the `PyPI package dataset`_ programmatically via the
183+
BigQuery API.
184+
185+
pypinfo
186+
-------
187+
188+
`pypinfo`_ is a command-line tool which provides access to the dataset and
189+
can generate several useful queries. For example, you can query the total
190+
number of download for a package with the command ``pypinfo package_name``.
191+
192+
::
193+
194+
$ pypinfo requests
195+
Served from cache: False
196+
Data processed: 6.87 GiB
197+
Data billed: 6.87 GiB
198+
Estimated cost: $0.04
199+
200+
| download_count |
201+
| -------------- |
202+
| 9,316,415 |
203+
204+
Install `pypinfo`_ using pip.
205+
206+
::
207+
208+
pip install pypinfo
209+
210+
Other libraries
211+
---------------
212+
213+
- `google-cloud-bigquery`_ is the official client library to access the
214+
BigQuery API.
215+
- `pandas-gbq`_ allows for accessing query results via `Pandas`_.
216+
217+
.. _PyPI package dataset: https://bigquery.cloud.google.com/dataset/the-psf:pypi
218+
.. _bandersnatch: /key_projects/#bandersnatch
219+
.. _Google BigQuery: https://cloud.google.com/bigquery
220+
.. _BigQuery web UI: http://bigquery.cloud.google.com/
221+
.. _pypinfo: https://github.com/ofek/pypinfo/blob/master/README.rst
222+
.. _google-cloud-bigquery: https://cloud.google.com/bigquery/docs/reference/libraries
223+
.. _pandas-gbq: https://pandas-gbq.readthedocs.io/en/latest/
224+
.. _Pandas: https://pandas.pydata.org/

source/guides/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ introduction to packaging, see :doc:`/tutorials/index`.
1919
supporting-windows-using-appveyor
2020
packaging-namespace-packages
2121
creating-and-discovering-plugins
22+
analyzing-pypi-package-downloads
2223
index-mirrors-and-caches
2324
hosting-your-own-index
2425
migrating-to-pypi-org

0 commit comments

Comments
 (0)