Skip to content

Commit 9b1b421

Browse files
committed
update PyPI package download queries to use file_downloads table
1 parent e7d022b commit 9b1b421

File tree

1 file changed

+91
-70
lines changed

1 file changed

+91
-70
lines changed

source/guides/analyzing-pypi-package-downloads.rst

Lines changed: 91 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -68,35 +68,35 @@ the `BigQuery quickstart guide
6868
Data schema
6969
-----------
7070

71-
Linehaul writes an entry in a ``the-psf.pypi.downloadsYYYYMMDD`` table for each
71+
Linehaul writes an entry in a ``the-psf.pypi.file_downloads`` table for each
7272
download. The table contains information about what file was downloaded and how
7373
it was downloaded. Some useful columns from the `table schema
74-
<https://console.cloud.google.com/bigquery?pli=1&p=the-psf&d=pypi&t=downloads&page=table>`__
74+
<https://console.cloud.google.com/bigquery?pli=1&p=the-psf&d=pypi&t=file_downloads&page=table>`__
7575
include:
7676

77-
+------------------------+-----------------+-----------------------+
78-
| Column | Description | Examples |
79-
+========================+=================+=======================+
80-
| file.project | Project name | ``pipenv``, ``nose`` |
81-
+------------------------+-----------------+-----------------------+
82-
| file.version | Package version | ``0.1.6``, ``1.4.2`` |
83-
+------------------------+-----------------+-----------------------+
84-
| details.installer.name | Installer | pip, `bandersnatch`_ |
85-
+------------------------+-----------------+-----------------------+
86-
| details.python | Python version | ``2.7.12``, ``3.6.4`` |
87-
+------------------------+-----------------+-----------------------+
77+
+------------------------+-----------------+-----------------------------+
78+
| Column | Description | Examples |
79+
+========================+=================+=============================+
80+
| timestamp | Date and time | ``2020-03-09 00:33:03 UTC`` |
81+
+------------------------+-----------------+-----------------------------+
82+
| file.project | Project name | ``pipenv``, ``nose`` |
83+
+------------------------+-----------------+-----------------------------+
84+
| file.version | Package version | ``0.1.6``, ``1.4.2`` |
85+
+------------------------+-----------------+-----------------------------+
86+
| details.installer.name | Installer | pip, `bandersnatch`_ |
87+
+------------------------+-----------------+-----------------------------+
88+
| details.python | Python version | ``2.7.12``, ``3.6.4`` |
89+
+------------------------+-----------------+-----------------------------+
8890

8991

9092
Useful queries
9193
--------------
9294

9395
Run queries in the `BigQuery web UI`_ by clicking the "Compose query" button.
9496

95-
Note that the rows are stored in separate tables for each day, which helps
97+
Note that the rows are stored in a partitioned, which helps
9698
limit the cost of queries. These example queries analyze downloads from
97-
recent history by using `wildcard tables
98-
<https://cloud.google.com/bigquery/docs/querying-wildcard-tables>`__ to
99-
select all tables and then filter by ``_TABLE_SUFFIX``.
99+
recent history by filtering on the ``timestamp`` column.
100100

101101
Counting package downloads
102102
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -108,18 +108,17 @@ The following query counts the total number of downloads for the project
108108

109109
#standardSQL
110110
SELECT COUNT(*) AS num_downloads
111-
FROM `the-psf.pypi.downloads*`
111+
FROM `the-psf.pypi.file_downloads`
112112
WHERE file.project = 'pytest'
113113
-- Only query the last 30 days of history
114-
AND _TABLE_SUFFIX
115-
BETWEEN FORMAT_DATE(
116-
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
117-
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
114+
AND DATE(timestamp)
115+
BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
116+
AND CURRENT_DATE())
118117

119118
+---------------+
120119
| num_downloads |
121120
+===============+
122-
| 2117807 |
121+
| 20531925 |
123122
+---------------+
124123

125124
To only count downloads from pip, filter on the ``details.installer.name``
@@ -129,71 +128,94 @@ column.
129128

130129
#standardSQL
131130
SELECT COUNT(*) AS num_downloads
132-
FROM `the-psf.pypi.downloads*`
131+
FROM `the-psf.pypi.file_downloads`
133132
WHERE file.project = 'pytest'
134133
AND details.installer.name = 'pip'
135134
-- Only query the last 30 days of history
136-
AND _TABLE_SUFFIX
137-
BETWEEN FORMAT_DATE(
138-
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
139-
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
135+
AND DATE(timestamp)
136+
BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
137+
AND CURRENT_DATE())
140138

141139
+---------------+
142140
| num_downloads |
143141
+===============+
144-
| 1829322 |
142+
| 19391645 |
145143
+---------------+
146144

147145
Package downloads over time
148146
~~~~~~~~~~~~~~~~~~~~~~~~~~~
149147

150-
To group by monthly downloads, use the ``_TABLE_SUFFIX`` pseudo-column. Also
151-
use the pseudo-column to limit the tables queried and the corresponding
152-
costs.
148+
To group by monthly downloads, use the ``TIMESTAMP_TRUNC`` function. Also
149+
filtering by this column reduces corresponding costs. (Warning: This query
150+
processes over 500 GB of data.)
153151

154152
::
155153

156154
#standardSQL
157155
SELECT
158156
COUNT(*) AS num_downloads,
159-
SUBSTR(_TABLE_SUFFIX, 1, 6) AS `month`
160-
FROM `the-psf.pypi.downloads*`
157+
DATE_TRUNC(DATE(timestamp), MONTH) AS `month`
158+
FROM `the-psf.pypi.file_downloads`
161159
WHERE
162160
file.project = 'pytest'
163161
-- Only query the last 6 months of history
164-
AND _TABLE_SUFFIX
165-
BETWEEN FORMAT_DATE(
166-
'%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH))
167-
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
162+
AND DATE(timestamp)
163+
BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), MONTH)
164+
AND CURRENT_DATE()
168165
GROUP BY `month`
169166
ORDER BY `month` DESC
170167

171-
+---------------+--------+
172-
| num_downloads | month |
173-
+===============+========+
174-
| 1956741 | 201801 |
175-
+---------------+--------+
176-
| 2344692 | 201712 |
177-
+---------------+--------+
178-
| 1730398 | 201711 |
179-
+---------------+--------+
180-
| 2047310 | 201710 |
181-
+---------------+--------+
182-
| 1744443 | 201709 |
183-
+---------------+--------+
184-
| 1916952 | 201708 |
185-
+---------------+--------+
186-
187-
More queries
188-
~~~~~~~~~~~~
189-
190-
- `Data driven decisions using PyPI download statistics
191-
<https://langui.sh/2016/12/09/data-driven-decisions/>`__
192-
- `PyPI queries gist <https://gist.github.com/alex/4f100a9592b05e9b4d63>`__
193-
- `Python versions over time
194-
<https://github.com/tswast/code-snippets/blob/master/2018/python-community-insights/Python%20Community%20Insights.ipynb>`__
195-
- `Non-Windows downloads, grouped by platform
196-
<https://bigquery.cloud.google.com/savedquery/51422494423:ff1976af63614ad4a1258d8821dd7785>`__
168+
+---------------+------------+
169+
| num_downloads | month |
170+
+===============+============+
171+
| 1956741 | 2018-01-01 |
172+
+---------------+------------+
173+
| 2344692 | 2017-12-01 |
174+
+---------------+------------+
175+
| 1730398 | 2017-11-01 |
176+
+---------------+------------+
177+
| 2047310 | 2017-10-01 |
178+
+---------------+------------+
179+
| 1744443 | 2017-09-01 |
180+
+---------------+------------+
181+
| 1916952 | 2017-08-01 |
182+
+---------------+------------+
183+
184+
Python versions over time
185+
~~~~~~~~~~~~~~~~~~~~~~~~~
186+
187+
Extract the Python version from the ``details.python`` column.
188+
189+
::
190+
191+
#standardSQL
192+
SELECT
193+
REGEXP_EXTRACT(details.python, r"[0-9]+\.[0-9]+") AS python_version,
194+
COUNT(*) AS num_downloads,
195+
FROM `the-psf.pypi.file_downloads`
196+
WHERE
197+
-- Only query the last 6 months of history
198+
DATE(timestamp)
199+
BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), MONTH)
200+
AND CURRENT_DATE()
201+
GROUP BY `python_version`
202+
ORDER BY `num_downloads` DESC
203+
204+
+--------+---------------+
205+
| python | num_downloads |
206+
+========+===============+
207+
| 3.7 | 12990683561 |
208+
+--------+---------------+
209+
| 3.6 | 9035598511 |
210+
+--------+---------------+
211+
| 2.7 | 8467785320 |
212+
+--------+---------------+
213+
| 3.8 | 4581627740 |
214+
+--------+---------------+
215+
| 3.5 | 2412533601 |
216+
+--------+---------------+
217+
| null | 1641456718 |
218+
+--------+---------------+
197219

198220
Caveats
199221
=======
@@ -229,13 +251,12 @@ the official Python client library for BigQuery.
229251
230252
query_job = client.query("""
231253
SELECT COUNT(*) AS num_downloads
232-
FROM `the-psf.pypi.downloads*`
254+
FROM `the-psf.pypi.file_downloads`
233255
WHERE file.project = 'pytest'
234-
-- Only query the last 30 days of history
235-
AND _TABLE_SUFFIX
236-
BETWEEN FORMAT_DATE(
237-
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
238-
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())""")
256+
-- Only query the last 30 days of history
257+
AND DATE(timestamp)
258+
BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
259+
AND CURRENT_DATE()""")
239260
240261
results = query_job.result() # Waits for job to complete.
241262
for row in results:

0 commit comments

Comments
 (0)