@@ -68,35 +68,35 @@ the `BigQuery quickstart guide
6868Data schema
6969-----------
7070
71- Linehaul writes an entry in a ``the-psf.pypi.downloadsYYYYMMDD `` table for each
71+ Linehaul writes an entry in a ``the-psf.pypi.file_downloads `` table for each
7272download. The table contains information about what file was downloaded and how
7373it was downloaded. Some useful columns from the `table schema
74- <https://console.cloud.google.com/bigquery?pli=1&p=the-psf&d=pypi&t=downloads &page=table> `__
74+ <https://console.cloud.google.com/bigquery?pli=1&p=the-psf&d=pypi&t=file_downloads &page=table> `__
7575include:
7676
77- +------------------------+-----------------+-----------------------+
78- | Column | Description | Examples |
79- +========================+=================+=======================+
80- | file.project | Project name | ``pipenv ``, ``nose `` |
81- +------------------------+-----------------+-----------------------+
82- | file.version | Package version | ``0.1.6 ``, ``1.4.2 `` |
83- +------------------------+-----------------+-----------------------+
84- | details.installer.name | Installer | pip, `bandersnatch `_ |
85- +------------------------+-----------------+-----------------------+
86- | details.python | Python version | ``2.7.12 ``, ``3.6.4 `` |
87- +------------------------+-----------------+-----------------------+
77+ +------------------------+-----------------+-----------------------------+
78+ | Column | Description | Examples |
79+ +========================+=================+=============================+
80+ | timestamp | Date and time | ``2020-03-09 00:33:03 UTC `` |
81+ +------------------------+-----------------+-----------------------------+
82+ | file.project | Project name | ``pipenv ``, ``nose `` |
83+ +------------------------+-----------------+-----------------------------+
84+ | file.version | Package version | ``0.1.6 ``, ``1.4.2 `` |
85+ +------------------------+-----------------+-----------------------------+
86+ | details.installer.name | Installer | pip, `bandersnatch `_ |
87+ +------------------------+-----------------+-----------------------------+
88+ | details.python | Python version | ``2.7.12 ``, ``3.6.4 `` |
89+ +------------------------+-----------------+-----------------------------+
8890
8991
9092Useful queries
9193--------------
9294
9395Run queries in the `BigQuery web UI `_ by clicking the "Compose query" button.
9496
95- Note that the rows are stored in separate tables for each day , which helps
97+ Note that the rows are stored in a partitioned , which helps
9698limit the cost of queries. These example queries analyze downloads from
97- recent history by using `wildcard tables
98- <https://cloud.google.com/bigquery/docs/querying-wildcard-tables> `__ to
99- select all tables and then filter by ``_TABLE_SUFFIX ``.
99+ recent history by filtering on the ``timestamp `` column.
100100
101101Counting package downloads
102102~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -108,18 +108,17 @@ The following query counts the total number of downloads for the project
108108
109109 #standardSQL
110110 SELECT COUNT(*) AS num_downloads
111- FROM `the-psf.pypi.downloads* `
111+ FROM `the-psf.pypi.file_downloads `
112112 WHERE file.project = 'pytest'
113113 -- Only query the last 30 days of history
114- AND _TABLE_SUFFIX
115- BETWEEN FORMAT_DATE(
116- '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
117- AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
114+ AND DATE(timestamp)
115+ BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
116+ AND CURRENT_DATE())
118117
119118+---------------+
120119| num_downloads |
121120+===============+
122- | 2117807 |
121+ | 20531925 |
123122+---------------+
124123
125124To only count downloads from pip, filter on the ``details.installer.name ``
@@ -129,71 +128,94 @@ column.
129128
130129 #standardSQL
131130 SELECT COUNT(*) AS num_downloads
132- FROM `the-psf.pypi.downloads* `
131+ FROM `the-psf.pypi.file_downloads `
133132 WHERE file.project = 'pytest'
134133 AND details.installer.name = 'pip'
135134 -- Only query the last 30 days of history
136- AND _TABLE_SUFFIX
137- BETWEEN FORMAT_DATE(
138- '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
139- AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
135+ AND DATE(timestamp)
136+ BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
137+ AND CURRENT_DATE())
140138
141139+---------------+
142140| num_downloads |
143141+===============+
144- | 1829322 |
142+ | 19391645 |
145143+---------------+
146144
147145Package downloads over time
148146~~~~~~~~~~~~~~~~~~~~~~~~~~~
149147
150- To group by monthly downloads, use the ``_TABLE_SUFFIX `` pseudo-column . Also
151- use the pseudo- column to limit the tables queried and the corresponding
152- costs.
148+ To group by monthly downloads, use the ``TIMESTAMP_TRUNC `` function . Also
149+ filtering by this column reduces corresponding costs. (Warning: This query
150+ processes over 500 GB of data.)
153151
154152::
155153
156154 #standardSQL
157155 SELECT
158156 COUNT(*) AS num_downloads,
159- SUBSTR(_TABLE_SUFFIX, 1, 6 ) AS `month`
160- FROM `the-psf.pypi.downloads* `
157+ DATE_TRUNC(DATE(timestamp), MONTH ) AS `month`
158+ FROM `the-psf.pypi.file_downloads `
161159 WHERE
162160 file.project = 'pytest'
163161 -- Only query the last 6 months of history
164- AND _TABLE_SUFFIX
165- BETWEEN FORMAT_DATE(
166- '%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH))
167- AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
162+ AND DATE(timestamp)
163+ BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), MONTH)
164+ AND CURRENT_DATE()
168165 GROUP BY `month`
169166 ORDER BY `month` DESC
170167
171- +---------------+--------+
172- | num_downloads | month |
173- +===============+========+
174- | 1956741 | 201801 |
175- +---------------+--------+
176- | 2344692 | 201712 |
177- +---------------+--------+
178- | 1730398 | 201711 |
179- +---------------+--------+
180- | 2047310 | 201710 |
181- +---------------+--------+
182- | 1744443 | 201709 |
183- +---------------+--------+
184- | 1916952 | 201708 |
185- +---------------+--------+
186-
187- More queries
188- ~~~~~~~~~~~~
189-
190- - `Data driven decisions using PyPI download statistics
191- <https://langui.sh/2016/12/09/data-driven-decisions/> `__
192- - `PyPI queries gist <https://gist.github.com/alex/4f100a9592b05e9b4d63 >`__
193- - `Python versions over time
194- <https://github.com/tswast/code-snippets/blob/master/2018/python-community-insights/Python%20Community%20Insights.ipynb> `__
195- - `Non-Windows downloads, grouped by platform
196- <https://bigquery.cloud.google.com/savedquery/51422494423:ff1976af63614ad4a1258d8821dd7785> `__
168+ +---------------+------------+
169+ | num_downloads | month |
170+ +===============+============+
171+ | 1956741 | 2018-01-01 |
172+ +---------------+------------+
173+ | 2344692 | 2017-12-01 |
174+ +---------------+------------+
175+ | 1730398 | 2017-11-01 |
176+ +---------------+------------+
177+ | 2047310 | 2017-10-01 |
178+ +---------------+------------+
179+ | 1744443 | 2017-09-01 |
180+ +---------------+------------+
181+ | 1916952 | 2017-08-01 |
182+ +---------------+------------+
183+
184+ Python versions over time
185+ ~~~~~~~~~~~~~~~~~~~~~~~~~
186+
187+ Extract the Python version from the ``details.python `` column.
188+
189+ ::
190+
191+ #standardSQL
192+ SELECT
193+ REGEXP_EXTRACT(details.python, r"[0-9]+\.[0-9]+") AS python_version,
194+ COUNT(*) AS num_downloads,
195+ FROM `the-psf.pypi.file_downloads`
196+ WHERE
197+ -- Only query the last 6 months of history
198+ DATE(timestamp)
199+ BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), MONTH)
200+ AND CURRENT_DATE()
201+ GROUP BY `python_version`
202+ ORDER BY `num_downloads` DESC
203+
204+ +--------+---------------+
205+ | python | num_downloads |
206+ +========+===============+
207+ | 3.7 | 12990683561 |
208+ +--------+---------------+
209+ | 3.6 | 9035598511 |
210+ +--------+---------------+
211+ | 2.7 | 8467785320 |
212+ +--------+---------------+
213+ | 3.8 | 4581627740 |
214+ +--------+---------------+
215+ | 3.5 | 2412533601 |
216+ +--------+---------------+
217+ | null | 1641456718 |
218+ +--------+---------------+
197219
198220Caveats
199221=======
@@ -229,13 +251,12 @@ the official Python client library for BigQuery.
229251
230252 query_job = client.query("""
231253 SELECT COUNT(*) AS num_downloads
232- FROM `the-psf.pypi.downloads* `
254+ FROM `the-psf.pypi.file_downloads `
233255 WHERE file.project = 'pytest'
234- -- Only query the last 30 days of history
235- AND _TABLE_SUFFIX
236- BETWEEN FORMAT_DATE(
237- '%Y%m%d ', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
238- AND FORMAT_DATE('%Y%m%d ', CURRENT_DATE())""" )
256+ -- Only query the last 30 days of history
257+ AND DATE(timestamp)
258+ BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
259+ AND CURRENT_DATE()""" )
239260
240261 results = query_job.result() # Waits for job to complete.
241262 for row in results:
0 commit comments