Skip to content

Commit e3f20b6

Browse files
init
0 parents  commit e3f20b6

18 files changed

+358
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.idea/

README.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Social Media Queries
2+
3+
A collection of SQL queries to social media datasets. The queries return answers like "Most mentioned books on Hacker News", "Top apps on Reddit", and others. See the [list of queries](#queries) and [how to use them](#usage) below
4+
5+
## Queries
6+
7+
Queries are written for Google BigQuery [free public datasets](https://bigquery.cloud.google.com/) (requires a Google account) and stored in `.sql` files, organized by social media outlet (folder `hackernews` and so on). These datasets are snapshots taken on particular dates, so results do not include post-2015 content.
8+
9+
Each of the queries processes 0.5-10GB of data. Processing up to 1TB per month is free, and you have up to 2,000 queries to experiment with.
10+
11+
### [Hacker News](https://news.ycombinator.com/)
12+
13+
* [Most cited books (comments)](hackernews/amazon-books-in-text.sql) - Using links to Amazon.com as citations. Don't include text references. Also see [this thread](https://news.ycombinator.com/item?id=10924741). This and the three next queries can be extended to other items. Examples:
14+
- Movies on International Movie Database: `imdb.com/title/tt[0-9]+/`
15+
- Books on iTunes: `itunes.apple.com/book/id[0-9]+`
16+
- Apps on Google Play: `play.google.com/store/apps/details?id=.+`
17+
* [Most cited books (submissions)](hackernews/amazon-books-in-url.sql) - The same, but this counts submitted URLs.
18+
* [Popular iTunes Apps (comments)](hackernews/itunes-apps-in-text.sql) - Like "Most cited books", but this tracks links to Apple Store.
19+
* [Popular iTunes Apps (submissions)](hackernews/itunes-apps-in-url.sql) - Similarly.
20+
* [Social network (graph)](hackernews/social-network.sql) - A weighted directional graph based on users commenting each other. Weights correspond to the number of comments one user left to another. See [Social network analysis](https://en.wikipedia.org/wiki/Social_network_analysis) for more information.
21+
* [Top authors by median](hackernews/top-authors-median.sql) - List of authors based on the median score. A quick way to find founders and VCs submitting to HN.
22+
* [Top authors by mean](hackernews/top-authors-mean.sql) - Based on the mean score. Usually implies many low-scored posts with major hits due to the skewed distribution.
23+
* [Top news sources](hackernews/top-news-sources.sql) - Where most popular news come from? Separated by day of week and hour.
24+
* [Popular Wikipedia articles](hackernews/wikipedia-pages-in-url.sql) - Counting links to Wikipedia articles.
25+
26+
For simple queries, use Hacker News' Algolia search:
27+
28+
* [All-time stories ranked by score](https://hn.algolia.com/?query=&sort=byPopularity&prefix&page=0&dateRange=all&type=story)
29+
* ["Show HN" by score](https://hn.algolia.com/?query=show%20hn&sort=byPopularity&prefix&page=0&dateRange=all&type=story)
30+
* ["Ask HN" by score](https://hn.algolia.com/?query=ask%20hn&sort=byPopularity&prefix&page=0&dateRange=all&type=story)
31+
* [Comments ranked by score](https://hn.algolia.com/?query=&sort=byPopularity&prefix&page=0&dateRange=all&type=comment)
32+
33+
### [Reddit](http://reddit.com/)
34+
35+
All Hacker News queries can be applied to Reddit after minor edits. Examples:
36+
37+
* [Top authors by median](reddit/top-authors-median.sql) - Authors ranked by the median score with minor adjustments. Expect no poor content from them.
38+
* [Top sources of political news](reddit/posts-top-domains.sql) - Ranking sources submitted to [r/politics](http://reddit.com/r/politics).
39+
40+
Reddit comments on BigQuery are split into multiple tables. If you want to select from comments, use `TABLE_QUERY`:
41+
42+
`FROM (TABLE_QUERY([fh-bigquery:reddit_comments], "table_id BETWEEN '2007' AND '2014' OR table_id CONTAINS '2015_' OR table_id CONTAINS '2016_'"))`.
43+
44+
Beware, this can quickly exhaust the free 1TB limit.
45+
46+
### [Wikipedia](https://www.wikipedia.org/)
47+
48+
* [Edits made from an IP address](wikipedia/edits-by-organization.sql) - Wikipedia records IP addresses of anonymous editors. With respect to privacy, some uses of this data:
49+
- Edits by organization. Many organizations reserve static IPs. One famous example is [US Congress' edits](https://en.wikipedia.org/wiki/United_States_Congressional_staff_edits_to_Wikipedia). This query is unlikely to return many edits done by a particular organization because the sample table contains only 300M edits. Too diluted to have a representative subset.
50+
- Edits by region. The sample is sufficient for statistics by region and other broad characteristics.
51+
52+
## Usage
53+
54+
### Web Interface
55+
56+
1. Locate a query in the repo's folder
57+
2. Login at <https://bigquery.cloud.google.com/welcome>
58+
3. Press "Compose query" in the top left corner
59+
4. Copy-paste the query and run it
60+
61+
See [web UI quickstart](https://cloud.google.com/bigquery/web-ui-quickstart) by Google.
62+
63+
### Command line: `bq`
64+
65+
1. Install [Google Cloud SDK](https://cloud.google.com/sdk/downloads)
66+
2. Initialize your account for [command line tools](https://cloud.google.com/bigquery/bq-command-line-tool)
67+
3. Run ``bq query `cat <path>` ``, where `<path>` leads to the `.sql` file
68+
69+
### Python in clouds: Jupyter, IPython notebooks
70+
71+
1. Get a Google Cloud account ([free trial](https://console.cloud.google.com/freetrial))
72+
2. Create a Jupyter notebook in [Datalab](https://cloud.google.com/datalab/)
73+
3. Do `import gcp.bigquery as bq`
74+
4. Run queries with `bq.Query()` function
75+
76+
See Felipe Hoffa's [Hacker News notebook](https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb) for example.
77+
78+
### BigQuery API
79+
80+
See [BigQuery API Quickstart](https://cloud.google.com/bigquery/bigquery-api-quickstart) for examples in Java, Python, C#, PHP, Ruby. You'll need a [credentials file](https://developers.google.com/identity/protocols/application-default-credentials) to run it locally.
81+
82+
## Contributing
83+
84+
Pull requests are welcomed. Suggestions:
85+
86+
* Adding new data mining queries
87+
* Rewriting `.sql` files related to Hacker News for Reddit and Wikipedia databases
88+
89+
## Acknowledgements
90+
91+
* Discussions on Hacker News and Reddit
92+
* [Felipe Hoffa](https://twitter.com/felipehoffa) for publishing the datasets
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
SELECT
2+
CONCAT('http://amazon.com/', REGEXP_EXTRACT(text, r'amazon.com/([^ \"]+/dp/[0-9]+)')) AS link,
3+
SUM(score) as sum_score,
4+
COUNT(1) AS cnt
5+
FROM
6+
[fh-bigquery:hackernews.full_201510]
7+
WHERE
8+
text CONTAINS 'amazon.com'
9+
GROUP BY
10+
link
11+
HAVING
12+
link IS NOT NULL
13+
ORDER BY
14+
cnt DESC
15+
LIMIT
16+
100

hackernews/amazon-books-in-url.sql

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
SELECT
2+
CONCAT('http://amazon.com/', REGEXP_EXTRACT(url, r'amazon.com/([^ \"]+/dp/[0-9]+)')) AS link,
3+
SUM(score) as sum_score,
4+
COUNT(1) AS cnt
5+
FROM
6+
[fh-bigquery:hackernews.full_201510]
7+
WHERE
8+
url CONTAINS 'amazon.com' AND type = 'story'
9+
GROUP BY
10+
link
11+
HAVING
12+
link IS NOT NULL
13+
ORDER BY
14+
sum_score DESC
15+
LIMIT
16+
100

hackernews/itunes-apps-in-text.sql

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
-- iTunes apps mentioned in text
2+
SELECT
3+
CONCAT('https://itunes.apple.com/app/id=', REGEXP_EXTRACT(text, r'itunes.apple.com/app/id([0-9]+)')) AS link,
4+
COUNT(1) AS cnt
5+
FROM
6+
[fh-bigquery:hackernews.full_201510]
7+
WHERE
8+
text CONTAINS 'itunes.apple.com'
9+
GROUP BY
10+
link
11+
HAVING
12+
link IS NOT NULL
13+
ORDER BY
14+
cnt DESC
15+
LIMIT
16+
100

hackernews/itunes-apps-in-url.sql

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
-- iTunes apps submitted to Hacker News
2+
SELECT
3+
CONCAT('https://itunes.apple.com/app/id', REGEXP_EXTRACT(url, r'itunes.apple.com/app/id([0-9]+)')) AS link,
4+
SUM(score) as sum_score,
5+
COUNT(1) AS cnt
6+
FROM
7+
[fh-bigquery:hackernews.full_201510]
8+
WHERE
9+
url CONTAINS 'itunes.apple.com/app/' AND type = 'story'
10+
GROUP BY
11+
link
12+
HAVING
13+
link IS NOT NULL
14+
ORDER BY
15+
sum_score DESC
16+
LIMIT
17+
100

hackernews/new-year-stories.sql

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
SELECT
2+
id, url, score, title,
3+
MONTH(SEC_TO_TIMESTAMP(time)) AS month,
4+
DAY(SEC_TO_TIMESTAMP(time)) AS day
5+
FROM
6+
[fh-bigquery:hackernews.full_201510]
7+
WHERE
8+
type = 'story'
9+
HAVING
10+
month = 1 AND
11+
day = 1
12+
ORDER BY
13+
score DESC
14+
LIMIT
15+
100

hackernews/social-network.sql

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
-- This is a directed graph. Weights are based on "X comments Y" relationship.
2+
-- Modify `LIMIT` and `HAVING` if you want to build a complete graph in Gephi, networkx, or elsewhere.
3+
SELECT
4+
[tx.by] x,
5+
[ty.by] y,
6+
COUNT(1) weight
7+
FROM
8+
[fh-bigquery:hackernews.full_201510] tx
9+
LEFT JOIN EACH [fh-bigquery:hackernews.full_201510] ty ON ty.id=tx.parent
10+
WHERE
11+
[tx.by] IS NOT NULL AND
12+
[ty.by] IS NOT NULL AND
13+
tx.parent IS NOT NULL AND
14+
[tx.by] != [ty.by]
15+
GROUP BY
16+
x, y
17+
HAVING
18+
weight >= 5
19+
ORDER BY
20+
weight DESC
21+
LIMIT
22+
100

hackernews/top-authors-mean.sql

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
SELECT
2+
[by] author,
3+
COUNT(1) cnt,
4+
ROUND(AVG(score)) avg_score,
5+
CONCAT("https://news.ycombinator.com/submitted?id=", [by]) link,
6+
FROM
7+
[fh-bigquery:hackernews.full_201510]
8+
WHERE
9+
type = 'story'
10+
GROUP BY
11+
author, link
12+
HAVING
13+
cnt >= 5
14+
ORDER BY
15+
avg_score DESC
16+
LIMIT
17+
100

hackernews/top-authors-median.sql

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
-- Authors with a high median score tend to be founders and VCs.
2+
-- Also the high median indicates fewer low-quality submissions by the user.
3+
-- See also: `top-authors-mean.sql`
4+
SELECT
5+
[by] author,
6+
COUNT(1) cnt,
7+
NTH(11, QUANTILES(score, 21)) median_score,
8+
CONCAT("https://news.ycombinator.com/user?id=", [by]) link,
9+
FROM
10+
[fh-bigquery:hackernews.full_201510]
11+
WHERE
12+
type = 'story'
13+
GROUP BY
14+
author, link
15+
HAVING
16+
cnt >= 10
17+
ORDER BY
18+
median_score DESC
19+
LIMIT
20+
100

0 commit comments

Comments
 (0)