init

antontarasenko · antontarasenko · commit e3f20b65143f · 2016-03-28T19:44:48.000+03:00
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+.idea/
diff --git a/README.md b/README.md
@@ -0,0 +1,92 @@
+# Social Media Queries
+
+A collection of SQL queries to social media datasets. The queries return answers like "Most mentioned books on Hacker News", "Top apps on Reddit", and others. See the [list of queries](#queries) and [how to use them](#usage) below
+
+## Queries
+
+Queries are written for Google BigQuery [free public datasets](https://bigquery.cloud.google.com/) (requires a Google account) and stored in `.sql` files, organized by social media outlet (folder `hackernews` and so on). These datasets are snapshots taken on particular dates, so results do not include post-2015 content.
+
+Each of the queries processes 0.5-10GB of data. Processing up to 1TB per month is free, and you have up to 2,000 queries to experiment with.
+
+### [Hacker News](https://news.ycombinator.com/)
+
+* [Most cited books (comments)](hackernews/amazon-books-in-text.sql) - Using links to Amazon.com as citations. Don't include text references. Also see [this thread](https://news.ycombinator.com/item?id=10924741). This and the three next queries can be extended to other items. Examples:
+  - Movies on International Movie Database: `imdb.com/title/tt[0-9]+/`
+  - Books on iTunes: `itunes.apple.com/book/id[0-9]+`
+  - Apps on Google Play: `play.google.com/store/apps/details?id=.+`
+* [Most cited books (submissions)](hackernews/amazon-books-in-url.sql) - The same, but this counts submitted URLs.
+* [Popular iTunes Apps (comments)](hackernews/itunes-apps-in-text.sql) - Like "Most cited books", but this tracks links to Apple Store. 
+* [Popular iTunes Apps (submissions)](hackernews/itunes-apps-in-url.sql) - Similarly.
+* [Social network (graph)](hackernews/social-network.sql) - A weighted directional graph based on users commenting each other. Weights correspond to the number of comments one user left to another. See [Social network analysis](https://en.wikipedia.org/wiki/Social_network_analysis) for more information.
+* [Top authors by median](hackernews/top-authors-median.sql) - List of authors based on the median score. A quick way to find founders and VCs submitting to HN.
+* [Top authors by mean](hackernews/top-authors-mean.sql) - Based on the mean score. Usually implies many low-scored posts with major hits due to the skewed distribution.
+* [Top news sources](hackernews/top-news-sources.sql) - Where most popular news come from? Separated by day of week and hour.
+* [Popular Wikipedia articles](hackernews/wikipedia-pages-in-url.sql) - Counting links to Wikipedia articles. 
+
+For simple queries, use Hacker News' Algolia search:
+
+* [All-time stories ranked by score](https://hn.algolia.com/?query=&sort=byPopularity&prefix&page=0&dateRange=all&type=story)
+* ["Show HN" by score](https://hn.algolia.com/?query=show%20hn&sort=byPopularity&prefix&page=0&dateRange=all&type=story)
+* ["Ask HN" by score](https://hn.algolia.com/?query=ask%20hn&sort=byPopularity&prefix&page=0&dateRange=all&type=story)
+* [Comments ranked by score](https://hn.algolia.com/?query=&sort=byPopularity&prefix&page=0&dateRange=all&type=comment)
+
+### [Reddit](http://reddit.com/)
+
+All Hacker News queries can be applied to Reddit after minor edits. Examples:
+
+* [Top authors by median](reddit/top-authors-median.sql) - Authors ranked by the median score with minor adjustments. Expect no poor content from them.
+* [Top sources of political news](reddit/posts-top-domains.sql) - Ranking sources submitted to [r/politics](http://reddit.com/r/politics).
+
+Reddit comments on BigQuery are split into multiple tables. If you want to select from comments, use `TABLE_QUERY`:
+
+  `FROM (TABLE_QUERY([fh-bigquery:reddit_comments], "table_id BETWEEN '2007' AND '2014' OR table_id CONTAINS '2015_' OR table_id CONTAINS '2016_'"))`.
+
+Beware, this can quickly exhaust the free 1TB limit.
+
+### [Wikipedia](https://www.wikipedia.org/)
+
+* [Edits made from an IP address](wikipedia/edits-by-organization.sql) - Wikipedia records IP addresses of anonymous editors. With respect to privacy, some uses of this data: 
+  - Edits by organization. Many organizations reserve static IPs. One famous example is [US Congress' edits](https://en.wikipedia.org/wiki/United_States_Congressional_staff_edits_to_Wikipedia). This query is unlikely to return many edits done by a particular organization because the sample table contains only 300M edits. Too diluted to have a representative subset. 
+  - Edits by region. The sample is sufficient for statistics by region and other broad characteristics.
+
+## Usage
+
+### Web Interface
+
+1. Locate a query in the repo's folder
+2. Login at <https://bigquery.cloud.google.com/welcome>
+3. Press "Compose query" in the top left corner
+4. Copy-paste the query and run it
+
+See [web UI quickstart](https://cloud.google.com/bigquery/web-ui-quickstart) by Google.
+
+### Command line: `bq`
+
+1. Install [Google Cloud SDK](https://cloud.google.com/sdk/downloads)
+2. Initialize your account for [command line tools](https://cloud.google.com/bigquery/bq-command-line-tool)
+3. Run ``bq query `cat <path>` ``, where `<path>` leads to the `.sql` file
+
+### Python in clouds: Jupyter, IPython notebooks
+
+1. Get a Google Cloud account ([free trial](https://console.cloud.google.com/freetrial))
+2. Create a Jupyter notebook in [Datalab](https://cloud.google.com/datalab/)
+3. Do `import gcp.bigquery as bq`
+4. Run queries with `bq.Query()` function
+
+See Felipe Hoffa's [Hacker News notebook](https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb) for example.
+
+### BigQuery API
+
+See [BigQuery API Quickstart](https://cloud.google.com/bigquery/bigquery-api-quickstart) for examples in Java, Python, C#, PHP, Ruby. You'll need a [credentials file](https://developers.google.com/identity/protocols/application-default-credentials) to run it locally.
+
+## Contributing
+
+Pull requests are welcomed. Suggestions:
+
+* Adding new data mining queries
+* Rewriting `.sql` files related to Hacker News for Reddit and Wikipedia databases
+
+## Acknowledgements
+
+* Discussions on Hacker News and Reddit
+* [Felipe Hoffa](https://twitter.com/felipehoffa) for publishing the datasets
diff --git a/hackernews/amazon-books-in-text.sql b/hackernews/amazon-books-in-text.sql
@@ -0,0 +1,16 @@
+SELECT
+  CONCAT('http://amazon.com/', REGEXP_EXTRACT(text, r'amazon.com/([^ \"]+/dp/[0-9]+)')) AS link,
+  SUM(score) as sum_score,
+  COUNT(1) AS cnt
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  text CONTAINS 'amazon.com'
+GROUP BY
+  link
+HAVING
+  link IS NOT NULL
+ORDER BY
+  cnt DESC
+LIMIT
+  100
diff --git a/hackernews/amazon-books-in-url.sql b/hackernews/amazon-books-in-url.sql
@@ -0,0 +1,16 @@
+SELECT
+  CONCAT('http://amazon.com/', REGEXP_EXTRACT(url, r'amazon.com/([^ \"]+/dp/[0-9]+)')) AS link,
+  SUM(score) as sum_score,
+  COUNT(1) AS cnt
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  url CONTAINS 'amazon.com' AND type = 'story'
+GROUP BY
+  link
+HAVING
+  link IS NOT NULL
+ORDER BY
+  sum_score DESC
+LIMIT
+  100
diff --git a/hackernews/itunes-apps-in-text.sql b/hackernews/itunes-apps-in-text.sql
@@ -0,0 +1,16 @@
+-- iTunes apps mentioned in text
+SELECT
+  CONCAT('https://itunes.apple.com/app/id=', REGEXP_EXTRACT(text, r'itunes.apple.com/app/id([0-9]+)')) AS link,
+  COUNT(1) AS cnt
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  text CONTAINS 'itunes.apple.com'
+GROUP BY
+  link
+HAVING
+  link IS NOT NULL
+ORDER BY
+  cnt DESC
+LIMIT
+  100
diff --git a/hackernews/itunes-apps-in-url.sql b/hackernews/itunes-apps-in-url.sql
@@ -0,0 +1,17 @@
+-- iTunes apps submitted to Hacker News
+SELECT
+  CONCAT('https://itunes.apple.com/app/id', REGEXP_EXTRACT(url, r'itunes.apple.com/app/id([0-9]+)')) AS link,
+  SUM(score) as sum_score,
+  COUNT(1) AS cnt
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  url CONTAINS 'itunes.apple.com/app/' AND type = 'story'
+GROUP BY
+  link
+HAVING
+  link IS NOT NULL
+ORDER BY
+  sum_score DESC
+LIMIT
+  100
diff --git a/hackernews/new-year-stories.sql b/hackernews/new-year-stories.sql
@@ -0,0 +1,15 @@
+SELECT
+  id, url, score, title,
+  MONTH(SEC_TO_TIMESTAMP(time)) AS month,
+  DAY(SEC_TO_TIMESTAMP(time)) AS day
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  type = 'story'
+HAVING
+  month = 1 AND
+  day = 1
+ORDER BY
+  score DESC
+LIMIT
+  100
diff --git a/hackernews/social-network.sql b/hackernews/social-network.sql
@@ -0,0 +1,22 @@
+-- This is a directed graph. Weights are based on "X comments Y" relationship.
+-- Modify `LIMIT` and `HAVING` if you want to build a complete graph in Gephi, networkx, or elsewhere.
+SELECT
+  [tx.by] x,
+  [ty.by] y,
+  COUNT(1) weight
+FROM
+  [fh-bigquery:hackernews.full_201510] tx
+LEFT JOIN EACH [fh-bigquery:hackernews.full_201510] ty ON ty.id=tx.parent
+WHERE
+  [tx.by] IS NOT NULL AND
+  [ty.by] IS NOT NULL AND
+  tx.parent IS NOT NULL AND
+  [tx.by] != [ty.by]
+GROUP BY
+  x, y
+HAVING
+  weight >= 5
+ORDER BY
+  weight DESC
+LIMIT
+  100
diff --git a/hackernews/top-authors-mean.sql b/hackernews/top-authors-mean.sql
@@ -0,0 +1,17 @@
+SELECT
+  [by] author,
+  COUNT(1) cnt,
+  ROUND(AVG(score)) avg_score,
+  CONCAT("https://news.ycombinator.com/submitted?id=", [by]) link,
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  type = 'story'
+GROUP BY
+  author, link
+HAVING
+  cnt >= 5
+ORDER BY
+  avg_score DESC
+LIMIT
+  100
diff --git a/hackernews/top-authors-median.sql b/hackernews/top-authors-median.sql
@@ -0,0 +1,20 @@
+-- Authors with a high median score tend to be founders and VCs.
+-- Also the high median indicates fewer low-quality submissions by the user.
+-- See also: `top-authors-mean.sql`
+SELECT
+  [by] author,
+  COUNT(1) cnt,
+  NTH(11, QUANTILES(score, 21)) median_score,
+  CONCAT("https://news.ycombinator.com/user?id=", [by]) link,
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  type = 'story'
+GROUP BY
+  author, link
+HAVING
+  cnt >= 10
+ORDER BY
+  median_score DESC
+LIMIT
+  100
diff --git a/hackernews/top-f-word.sql b/hackernews/top-f-word.sql
@@ -0,0 +1,15 @@
+-- Who dropped most f-words on HN?
+SELECT
+  [by] author,
+  COUNT(1) cnt
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  text CONTAINS 'fuck'
+GROUP BY
+  author
+ORDER BY
+  cnt DESC
+LIMIT
+  100
+IGNORE CASE
diff --git a/hackernews/top-news-sources.sql b/hackernews/top-news-sources.sql
@@ -0,0 +1,21 @@
+-- Most popular news sources averaged by day of week and hour.
+-- Remove (dow, hour) pairs for a simple ranking.
+SELECT
+  DOMAIN(url) domain,
+  HOUR(SEC_TO_TIMESTAMP(time)) hour,
+  DAYOFWEEK(SEC_TO_TIMESTAMP(time)) dow,
+  NTH(11, QUANTILES(score, 21)) median_score,
+  AVG(score) avg_score,
+  COUNT(url) cnt
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  score > 0 AND
+  type = "story"
+GROUP BY
+  domain, dow, hour
+HAVING
+  cnt > 100 AND
+  domain IS NOT NULL
+ORDER BY
+  cnt DESC
diff --git a/hackernews/top-show-hn.sql b/hackernews/top-show-hn.sql
@@ -0,0 +1,12 @@
+-- List of "Show HN" submissions
+SELECT
+  title, url, score
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  text CONTAINS 'Show HN:'
+ORDER BY
+  score DESC
+LIMIT
+  1000
+IGNORE CASE
diff --git a/hackernews/wikipedia-pages-in-url.sql b/hackernews/wikipedia-pages-in-url.sql
@@ -0,0 +1,14 @@
+SELECT
+  url,
+  SUM(score) AS sum_score,
+  COUNT(1) AS cnt
+FROM
+  [fh-bigquery:hackernews.full_201510]
+WHERE
+  url CONTAINS 'wikipedia.org/wiki/'
+GROUP BY
+  url
+ORDER BY
+  sum_score DESC
+LIMIT
+  100
diff --git a/reddit/best-iama-faq.sql b/reddit/best-iama-faq.sql
@@ -0,0 +1,20 @@
+-- Sum up scores for querstion-answer pairs in IAmA sessions and show 100 best pairs
+-- TODO JOIN multiple tables of Reddit comments (TABLE_QUERY works only in FROM)
+SELECT
+  q.body question,
+  a.body answer,
+  (q.score + a.score) sum_score
+FROM
+  (TABLE_QUERY([fh-bigquery:reddit_comments], "table_id BETWEEN '2007' AND '2014' OR table_id CONTAINS '2015_' OR table_id CONTAINS '2016_'")) q
+LEFT JOIN EACH
+  (TABLE_QUERY([fh-bigquery:reddit_comments], "table_id BETWEEN '2007' AND '2014' OR table_id CONTAINS '2015_' OR table_id CONTAINS '2016_'")) a
+    ON a.parent=q.id
+WHERE
+  subreddit IN ("IAmA") AND
+  a.parent IS NOT NULL AND
+  q.id IS NOT NULL AND
+  a.id IS NOT NULL
+ORDER BY
+  sum_score DESC
+LIMIT
+  100
diff --git a/reddit/famous-authors.sql b/reddit/famous-authors.sql
@@ -0,0 +1,14 @@
+SELECT
+  author,
+  COUNT(1) cnt,
+  NTH(11, QUANTILES(score, 21)) median_score,
+FROM
+  [fh-bigquery:reddit_posts.full_corpus_201512]
+GROUP BY
+  author
+HAVING
+  cnt >= 25
+ORDER BY
+  median_score DESC
+LIMIT
+  100
diff --git a/reddit/posts-top-domains.sql b/reddit/posts-top-domains.sql
@@ -0,0 +1,15 @@
+-- Sources of major news. Change `subreddit` for topical news.
+SELECT
+  DOMAIN(url) domain,
+  COUNT(1) cnt
+FROM
+  [fh-bigquery:reddit_posts.full_corpus_201512]
+WHERE
+  score > 1000 AND
+  subreddit = 'politics'
+GROUP BY
+  domain
+ORDER BY
+  cnt DESC
+LIMIT
+  100
diff --git a/wikipedia/edits-by-organization.sql b/wikipedia/edits-by-organization.sql
@@ -0,0 +1,15 @@
+-- Change REGEXP_MATCH for IP patterns
+SELECT
+  title,
+  COUNT(1) cnt
+FROM
+  [bigquery-public-data:samples.wikipedia]
+WHERE
+  reversion_id IS NOT NULL AND
+  REGEXP_MATCH(contributor_ip, r'8\.8\.[0-9]+\.[0-9]+')
+GROUP BY
+  title
+ORDER BY
+  cnt DESC
+LIMIT
+  100