Skip to content

Commit 75896d6

Browse files
committed
nb notebook link update
1 parent e89db80 commit 75896d6

File tree

1 file changed

+21
-9
lines changed

1 file changed

+21
-9
lines changed

projects/practice_projects/naive_bayes_tutorial/Naive_Bayes_tutorial.ipynb

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"\n",
99
"Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'. \n",
1010
"\n",
11-
"In this mission we will be using the Naive Bayes algorithm to create a model that can classify <a href = 'https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection'> SMS messages </a> as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like. Usually they have words like 'free', 'win', 'winner', 'cash', 'prize' and the like in them as these texts are designed to catch your eye and in some sense tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!\n",
11+
"In this mission we will be using the Naive Bayes algorithm to create a model that can classify [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like. Usually they have words like 'free', 'win', 'winner', 'cash', 'prize' and the like in them as these texts are designed to catch your eye and in some sense tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!\n",
1212
"\n",
1313
"Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions. "
1414
]
@@ -274,7 +274,7 @@
274274
"Lets break this down and see how we can do this conversion using a small set of documents.\n",
275275
"\n",
276276
"To handle this, we will be using sklearns \n",
277-
"<a href = 'http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer'> `sklearn.feature_extraction.text.CountVectorizer` </a> method which does the following:\n",
277+
"[count vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) method which does the following:\n",
278278
"\n",
279279
"* It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.\n",
280280
"* It counts the occurrance of each of those tokens.\n",
@@ -519,7 +519,9 @@
519519
{
520520
"cell_type": "code",
521521
"execution_count": 9,
522-
"metadata": {},
522+
"metadata": {
523+
"collapsed": true
524+
},
523525
"outputs": [],
524526
"source": [
525527
"'''\n",
@@ -813,7 +815,7 @@
813815
"\n",
814816
"There are a couple of ways to mitigate this. One way is to use the `stop_words` parameter and set its value to `english`. This will automatically ignore all words(from our input text) that are found in a built in list of English stop words in scikit-learn.\n",
815817
"\n",
816-
"Another way of mitigating this is by using the <a href = 'http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer'> `sklearn.feature_extraction.text.TfidfVectorizer`</a> method. This method is out of scope for the context of this lesson."
818+
"Another way of mitigating this is by using the [tfidf](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) method. This method is out of scope for the context of this lesson."
817819
]
818820
},
819821
{
@@ -895,7 +897,9 @@
895897
{
896898
"cell_type": "code",
897899
"execution_count": null,
898-
"metadata": {},
900+
"metadata": {
901+
"collapsed": true
902+
},
899903
"outputs": [],
900904
"source": [
901905
"'''\n",
@@ -914,7 +918,9 @@
914918
{
915919
"cell_type": "code",
916920
"execution_count": 52,
917-
"metadata": {},
921+
"metadata": {
922+
"collapsed": true
923+
},
918924
"outputs": [],
919925
"source": [
920926
"'''\n",
@@ -1413,7 +1419,9 @@
14131419
{
14141420
"cell_type": "code",
14151421
"execution_count": null,
1416-
"metadata": {},
1422+
"metadata": {
1423+
"collapsed": true
1424+
},
14171425
"outputs": [],
14181426
"source": [
14191427
"'''\n",
@@ -1470,7 +1478,9 @@
14701478
{
14711479
"cell_type": "code",
14721480
"execution_count": 54,
1473-
"metadata": {},
1481+
"metadata": {
1482+
"collapsed": true
1483+
},
14741484
"outputs": [],
14751485
"source": [
14761486
"'''\n",
@@ -1519,7 +1529,9 @@
15191529
{
15201530
"cell_type": "code",
15211531
"execution_count": null,
1522-
"metadata": {},
1532+
"metadata": {
1533+
"collapsed": true
1534+
},
15231535
"outputs": [],
15241536
"source": [
15251537
"'''\n",

0 commit comments

Comments
 (0)