|
1 | 1 | # Data Science Question Answer
|
2 | 2 |
|
| 3 | +The purpose of this repo is two fold: |
| 4 | + |
| 5 | +* To help you (data science practitioners) prepare for data science related interviews |
| 6 | +* To introduce to people who don't know but want to learn some basic data science concepts |
| 7 | + |
| 8 | +The focus is on the knowledge breadth so this is more of a quick reference rather than an in-depth study material. If you want to learn a specific topic in detail please refer to other content or reach out and I'd love to point you to materials I found useful. |
| 9 | + |
| 10 | +I might add some topics from time to time but hey, this should also be a community effort, right? Any pull request is welcome! |
| 11 | + |
| 12 | +Here are the categorizes: |
| 13 | + |
3 | 14 | * [SQL](#sql)
|
4 | 15 | * [Statistics and ML In General](#statistics-and-ml-in-general)
|
5 | 16 | * [Supervised Learning](#supervised-learning)
|
|
10 | 21 |
|
11 | 22 | ## SQL
|
12 | 23 |
|
13 |
| -First off some good SQL resources: |
14 |
| - |
15 |
| -* [W3schools SQL](https://www.w3schools.com/sql/) |
16 |
| -* [SQLZOO](http://sqlzoo.net/) |
17 |
| - |
18 |
| -Questions: |
19 |
| - |
20 | 24 | * [Difference between joins](#difference-between-joins)
|
21 | 25 |
|
22 | 26 |
|
@@ -45,8 +49,8 @@ Questions:
|
45 | 49 | * [Bagging](#bagging)
|
46 | 50 | * [Stacking](#stacking)
|
47 | 51 | * [Generative vs discriminative](#generative-vs-discriminative)
|
48 |
| -* [Paramteric vs Nonparametric](#paramteric-vs-nonparametric) |
49 |
| - |
| 52 | +* [Parametric vs Nonparametric](#parametric-vs-nonparametric) |
| 53 | +* [Recommender System](#recommender-system) |
50 | 54 |
|
51 | 55 | ### Project Workflow
|
52 | 56 |
|
@@ -196,14 +200,31 @@ generated.
|
196 | 200 | [back to top](#data-science-question-answer)
|
197 | 201 |
|
198 | 202 |
|
199 |
| -### Paramteric vs Nonparametric |
| 203 | +### Parametric vs Nonparametric |
200 | 204 |
|
201 | 205 | * A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model.
|
202 | 206 | * A model where the number of parameters is not determined prior to training. Nonparametric does not mean that they have no parameters. On the contrary, nonparametric models (can) become more and more complex with an increasing amount of data.
|
203 | 207 |
|
204 | 208 | [back to top](#data-science-question-answer)
|
205 | 209 |
|
206 | 210 |
|
| 211 | +### Recommender System |
| 212 | + |
| 213 | +* I put recommend system here since technically it falls neither under supervised nor unsupervised learning |
| 214 | +* A recommender system seeks to predict the 'rating' or 'preference' a user would give to items and then recommend items accordingly |
| 215 | +* Content based recommender systems recommends items similar to those a given user has liked in the past, based on either explicit (ratings, like/dislike button) or implicit (viewed/finished an article) feedbacks. Content based recommenders work solely with the past interactions of a given user and do not take other users into consideration. |
| 216 | +* Collaborative filtering is based on past interactions of the whole user base. There are two Collaborative filtering approaches: **item-based** or **user-based** |
| 217 | + - item-based: for user u, a score for an unrated item is produced by combining the ratings of users similar to u. |
| 218 | + - user-based: a rating (u, i) is produced by looking at the set of items similar to i (interaction similarity), then the ratings by u of similar items are combined into a predicted rating |
| 219 | +* In recommender systems traditionally matrix factorization methods are used, although we recently there are also deep learning based methods |
| 220 | +* Cold start and sparse matrix can be issues for recommender systems |
| 221 | +* Widely used in movies, news, research articles, products, social tags, music, etc. |
| 222 | + |
| 223 | + |
| 224 | + |
| 225 | +[back to top](#data-science-question-answer) |
| 226 | + |
| 227 | + |
207 | 228 | ## Supervised Learning
|
208 | 229 |
|
209 | 230 | * [Linear regression](#linear-regression)
|
@@ -236,7 +257,7 @@ generated.
|
236 | 257 |
|
237 | 258 | ### Logistic regression
|
238 | 259 |
|
239 |
| -* Generalized linear model (GLM) for classification problems |
| 260 | +* Generalized linear model (GLM) for binary classification problems |
240 | 261 | * Apply the sigmoid function to the output of linear models, squeezing the target
|
241 | 262 | to range [0, 1]
|
242 | 263 | * Threshold to make prediction: usually if the output > .5, prediction 1; otherwise prediction 0
|
@@ -510,3 +531,6 @@ Using **Ubuntu** as an example.
|
510 | 531 | * Install package: `sudo apt-get install <package>`
|
511 | 532 |
|
512 | 533 | [back to top](#data-science-question-answer)
|
| 534 | + |
| 535 | + |
| 536 | +Confession: some images are adopted from the internet without proper credit. If you are the author and this would be an issue for you, please let me know. |
0 commit comments