|
| 1 | +# Social Media Queries |
| 2 | + |
| 3 | +A collection of SQL queries to social media datasets. The queries return answers like "Most mentioned books on Hacker News", "Top apps on Reddit", and others. See the [list of queries](#queries) and [how to use them](#usage) below |
| 4 | + |
| 5 | +## Queries |
| 6 | + |
| 7 | +Queries are written for Google BigQuery [free public datasets](https://bigquery.cloud.google.com/) (requires a Google account) and stored in `.sql` files, organized by social media outlet (folder `hackernews` and so on). These datasets are snapshots taken on particular dates, so results do not include post-2015 content. |
| 8 | + |
| 9 | +Each of the queries processes 0.5-10GB of data. Processing up to 1TB per month is free, and you have up to 2,000 queries to experiment with. |
| 10 | + |
| 11 | +### [Hacker News](https://news.ycombinator.com/) |
| 12 | + |
| 13 | +* [Most cited books (comments)](hackernews/amazon-books-in-text.sql) - Using links to Amazon.com as citations. Don't include text references. Also see [this thread](https://news.ycombinator.com/item?id=10924741). This and the three next queries can be extended to other items. Examples: |
| 14 | + - Movies on International Movie Database: `imdb.com/title/tt[0-9]+/` |
| 15 | + - Books on iTunes: `itunes.apple.com/book/id[0-9]+` |
| 16 | + - Apps on Google Play: `play.google.com/store/apps/details?id=.+` |
| 17 | +* [Most cited books (submissions)](hackernews/amazon-books-in-url.sql) - The same, but this counts submitted URLs. |
| 18 | +* [Popular iTunes Apps (comments)](hackernews/itunes-apps-in-text.sql) - Like "Most cited books", but this tracks links to Apple Store. |
| 19 | +* [Popular iTunes Apps (submissions)](hackernews/itunes-apps-in-url.sql) - Similarly. |
| 20 | +* [Social network (graph)](hackernews/social-network.sql) - A weighted directional graph based on users commenting each other. Weights correspond to the number of comments one user left to another. See [Social network analysis](https://en.wikipedia.org/wiki/Social_network_analysis) for more information. |
| 21 | +* [Top authors by median](hackernews/top-authors-median.sql) - List of authors based on the median score. A quick way to find founders and VCs submitting to HN. |
| 22 | +* [Top authors by mean](hackernews/top-authors-mean.sql) - Based on the mean score. Usually implies many low-scored posts with major hits due to the skewed distribution. |
| 23 | +* [Top news sources](hackernews/top-news-sources.sql) - Where most popular news come from? Separated by day of week and hour. |
| 24 | +* [Popular Wikipedia articles](hackernews/wikipedia-pages-in-url.sql) - Counting links to Wikipedia articles. |
| 25 | + |
| 26 | +For simple queries, use Hacker News' Algolia search: |
| 27 | + |
| 28 | +* [All-time stories ranked by score](https://hn.algolia.com/?query=&sort=byPopularity&prefix&page=0&dateRange=all&type=story) |
| 29 | +* ["Show HN" by score](https://hn.algolia.com/?query=show%20hn&sort=byPopularity&prefix&page=0&dateRange=all&type=story) |
| 30 | +* ["Ask HN" by score](https://hn.algolia.com/?query=ask%20hn&sort=byPopularity&prefix&page=0&dateRange=all&type=story) |
| 31 | +* [Comments ranked by score](https://hn.algolia.com/?query=&sort=byPopularity&prefix&page=0&dateRange=all&type=comment) |
| 32 | + |
| 33 | +### [Reddit](http://reddit.com/) |
| 34 | + |
| 35 | +All Hacker News queries can be applied to Reddit after minor edits. Examples: |
| 36 | + |
| 37 | +* [Top authors by median](reddit/top-authors-median.sql) - Authors ranked by the median score with minor adjustments. Expect no poor content from them. |
| 38 | +* [Top sources of political news](reddit/posts-top-domains.sql) - Ranking sources submitted to [r/politics](http://reddit.com/r/politics). |
| 39 | + |
| 40 | +Reddit comments on BigQuery are split into multiple tables. If you want to select from comments, use `TABLE_QUERY`: |
| 41 | + |
| 42 | + `FROM (TABLE_QUERY([fh-bigquery:reddit_comments], "table_id BETWEEN '2007' AND '2014' OR table_id CONTAINS '2015_' OR table_id CONTAINS '2016_'"))`. |
| 43 | + |
| 44 | +Beware, this can quickly exhaust the free 1TB limit. |
| 45 | + |
| 46 | +### [Wikipedia](https://www.wikipedia.org/) |
| 47 | + |
| 48 | +* [Edits made from an IP address](wikipedia/edits-by-organization.sql) - Wikipedia records IP addresses of anonymous editors. With respect to privacy, some uses of this data: |
| 49 | + - Edits by organization. Many organizations reserve static IPs. One famous example is [US Congress' edits](https://en.wikipedia.org/wiki/United_States_Congressional_staff_edits_to_Wikipedia). This query is unlikely to return many edits done by a particular organization because the sample table contains only 300M edits. Too diluted to have a representative subset. |
| 50 | + - Edits by region. The sample is sufficient for statistics by region and other broad characteristics. |
| 51 | + |
| 52 | +## Usage |
| 53 | + |
| 54 | +### Web Interface |
| 55 | + |
| 56 | +1. Locate a query in the repo's folder |
| 57 | +2. Login at <https://bigquery.cloud.google.com/welcome> |
| 58 | +3. Press "Compose query" in the top left corner |
| 59 | +4. Copy-paste the query and run it |
| 60 | + |
| 61 | +See [web UI quickstart](https://cloud.google.com/bigquery/web-ui-quickstart) by Google. |
| 62 | + |
| 63 | +### Command line: `bq` |
| 64 | + |
| 65 | +1. Install [Google Cloud SDK](https://cloud.google.com/sdk/downloads) |
| 66 | +2. Initialize your account for [command line tools](https://cloud.google.com/bigquery/bq-command-line-tool) |
| 67 | +3. Run ``bq query `cat <path>` ``, where `<path>` leads to the `.sql` file |
| 68 | + |
| 69 | +### Python in clouds: Jupyter, IPython notebooks |
| 70 | + |
| 71 | +1. Get a Google Cloud account ([free trial](https://console.cloud.google.com/freetrial)) |
| 72 | +2. Create a Jupyter notebook in [Datalab](https://cloud.google.com/datalab/) |
| 73 | +3. Do `import gcp.bigquery as bq` |
| 74 | +4. Run queries with `bq.Query()` function |
| 75 | + |
| 76 | +See Felipe Hoffa's [Hacker News notebook](https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb) for example. |
| 77 | + |
| 78 | +### BigQuery API |
| 79 | + |
| 80 | +See [BigQuery API Quickstart](https://cloud.google.com/bigquery/bigquery-api-quickstart) for examples in Java, Python, C#, PHP, Ruby. You'll need a [credentials file](https://developers.google.com/identity/protocols/application-default-credentials) to run it locally. |
| 81 | + |
| 82 | +## Contributing |
| 83 | + |
| 84 | +Pull requests are welcomed. Suggestions: |
| 85 | + |
| 86 | +* Adding new data mining queries |
| 87 | +* Rewriting `.sql` files related to Hacker News for Reddit and Wikipedia databases |
| 88 | + |
| 89 | +## Acknowledgements |
| 90 | + |
| 91 | +* Discussions on Hacker News and Reddit |
| 92 | +* [Felipe Hoffa](https://twitter.com/felipehoffa) for publishing the datasets |
0 commit comments