recommit joel comments

jaquesgrobler · jaquesgrobler · commit aaa20caa4764 · 2014-01-08T13:56:18.000+01:00
diff --git a/doc/documentation.rst b/doc/documentation.rst
@@ -38,10 +38,10 @@ Documentation of scikit-learn 0.15
                 <!-- Documentation overview -->
                     <div class="row-fluid">
                         <div class="span4 box">
-                            <h2><a href="tutorial/statistical_inference/index.html">Statistical Learning Tutorial</a></h2>
-                            <blockquote>A tutorial on statistical learning for
-                            data analysis. Contains a more in-depth discussion
-                            of important concepts.
+                            <h2><a href="tutorial/index.html">Tutorials</a></h2>
+                            <blockquote>Useful tutorials for developing a feel
+			    for some of scikit-learn's applications in the
+			    machine learning field.
                             </blockquote>
                         </div>
                         <div class="span4 box">
@@ -52,9 +52,29 @@ Documentation of scikit-learn 0.15
                     	    </blockquote>
                         </div>
 			<div class="span4 box">
-			    <h2><a href="presentations.html">Additional Resources</a></h2>
-                            <blockquote>Talks given, slide-sets and other information relevant to scikit-learn.
-                            </blockquote>
+			<!-- doc versions -->
+			    <h2>Other Versions</h2>
+                        <ul>
+                            <li><a href="http://scikit-learn.org/stable/user_guide.html">scikit-learn 0.14 (stable)</a></li>
+                            <li>scikit-learn 0.15 (development)</li>
+                            <li><a href="http://scikit-learn.org/0.13/user_guide.html">scikit-learn 0.13</a></li>
+                            <li><a href="http://scikit-learn.org/0.12/user_guide.html">scikit-learn 0.12</a></li>
+                            <li><a href="http://scikit-learn.org/0.11/user_guide.html">scikit-learn 0.11</a></li>
+			    <li id="other-versions">Older versions
+			    	<a class="btn dropdown-toggle" data-toggle="dropdown">
+			           <span class="caret"></span>
+			    	   </a>
+		      	    <ul class="dropdown-menu">
+			        <li><a href="http://scikit-learn.org/0.10/user_guide.html">scikit-learn 0.10</a></li>
+                            	<li><a href="http://scikit-learn.org/0.9/user_guide.html">scikit-learn 0.9</a></li>
+				<li><a href="http://scikit-learn.org/0.8/user_guide.html">scikit-learn 0.8</a></li>
+                        	<li><a href="http://scikit-learn.org/0.7/user_guide.html">scikit-learn 0.7</a></li>
+                        	<li><a href="http://scikit-learn.org/0.6/user_guide.html">scikit-learn 0.6</a></li>
+                        	<li><a href="http://scikit-learn.org/0.5/user_guide.html">scikit-learn 0.5</a></li>
+		      	    </ul>
+			</li>
+                    </ul>
+
 			</div>
                         <!-- row -->
                     </div>
@@ -63,11 +83,9 @@ Documentation of scikit-learn 0.15
           <!-- row -->
             <div class="row-fluid">
                 <div class="span4 box">
-                    <h2><a href="tutorial/text_analytics/working_with_text_data.html">Text Analysis Tutorial</a></h2>
-                    <blockquote>This tutorial explores an application of machine learning, namely, Text-analysis.
-		    The goal of this guide is to utilize some of the main scikit-learn tools in order to analyze
-		    a collection of text documents.
-                    </blockquote>
+		    <h2><a href="presentations.html">Additional Resources</a></h2>
+                        <blockquote>Talks given, slide-sets and other information relevant to scikit-learn.
+                        </blockquote>
                 </div>
                 <div class="span4 box">
 		    <h2><a href="developers/index.html">Contributing</a></h2>
@@ -76,29 +94,6 @@ Documentation of scikit-learn 0.15
                     how to build their own estimators.
                     </blockquote>
                 </div>
-		<!-- doc versions -->
-                <div class="span4 box">
-                    <h2>Other Versions</h2>
-                    <ul>
-                        <li><a href="http://scikit-learn.org/stable/user_guide.html">scikit-learn 0.14 (stable)</a></li>
-                        <li>scikit-learn 0.15 (development)</li>
-                        <li><a href="http://scikit-learn.org/0.13/user_guide.html">scikit-learn 0.13</a></li>
-                        <li><a href="http://scikit-learn.org/0.12/user_guide.html">scikit-learn 0.12</a></li>
-                        <li><a href="http://scikit-learn.org/0.11/user_guide.html">scikit-learn 0.11</a></li>
-			<li id="other-versions">Older versions
-			    <a class="btn dropdown-toggle" data-toggle="dropdown">
-			        <span class="caret"></span>
-			    </a>
-		      	    <ul class="dropdown-menu">
-			        <li><a href="http://scikit-learn.org/0.10/user_guide.html">scikit-learn 0.10</a></li>
-                            	<li><a href="http://scikit-learn.org/0.9/user_guide.html">scikit-learn 0.9</a></li>
-				<li><a href="http://scikit-learn.org/0.8/user_guide.html">scikit-learn 0.8</a></li>
-                        	<li><a href="http://scikit-learn.org/0.7/user_guide.html">scikit-learn 0.7</a></li>
-                        	<li><a href="http://scikit-learn.org/0.6/user_guide.html">scikit-learn 0.6</a></li>
-                        	<li><a href="http://scikit-learn.org/0.5/user_guide.html">scikit-learn 0.5</a></li>
-		      	    </ul>
-			</li>
-                    </ul>
-                </div>
+		
             </div>
 
diff --git a/doc/tutorial/text_analytics/working_with_text_data.rst b/doc/tutorial/text_analytics/working_with_text_data.rst
@@ -119,12 +119,12 @@ reference the filenames are also available::
 
 Let's print the first lines of the first loaded file::
 
-  >>> print "\n".join(twenty_train.data[0].split("\n")[:3])
+  >>> print ("\n".join(twenty_train.data[0].split("\n")[:3]))
   From: sd345@city.ac.uk (Michael Collier)
   Subject: Converting images to HP LaserJet III?
   Nntp-Posting-Host: hampton
 
-  >>> print twenty_train.target_names[twenty_train.target[0]]
+  >>> print (twenty_train.target_names[twenty_train.target[0]])
   comp.graphics
 
 Supervised learning algorithms will require a category label for each
@@ -143,7 +143,7 @@ integer id of each sample is stored in the ``target`` attribute::
 It is possible to get back the category names as follows::
 
   >>> for t in twenty_train.target[:10]:
-  ...     print twenty_train.target_names[t]
+  ...     print (twenty_train.target_names[t])
   ...
   comp.graphics
   comp.graphics
@@ -168,6 +168,8 @@ Extracting features from text files
 In order to perform machine learning on text documents, we first need to
 turn the text content into numerical feature vectors.
 
+.. currentmodule:: sklearn.feature_extraction.text
+
 
 Bags of words
 ~~~~~~~~~~~~~
@@ -212,7 +214,7 @@ dictionary of features and transform documents to feature vectors::
   >>> X_train_counts.shape
   (2257, 35788)
 
-``CountVectorizer`` supports counts of N-grams of words or consequective characters.
+:class:`CountVectorizer` supports counts of N-grams of words or consequective characters.
 Once fitted, the vectorizer has built a dictionary of feature indices::
 
   >>> count_vect.vocabulary_.get(u'algorithm')
@@ -262,6 +264,14 @@ Both **tf** and **tf–idf** can be computed as follows::
   >>> X_train_tf.shape
   (2257, 35788)
 
+In the above example-code, we firstly use the ``fit(..)`` method to fit our
+estimator to the data and secondly the ``transform(..)`` method to transform
+our count-matrix to a tf-idf representation.
+These two steps can be combined to achieve the same end result faster
+by skipping redundant processing. This is done through using the
+``fit_transform(..)`` method as shown below, and as mentioned in the note
+in the previous section.
+
   >>> tfidf_transformer = TfidfTransformer()
   >>> X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
   >>> X_train_tfidf.shape
@@ -271,7 +281,7 @@ Both **tf** and **tf–idf** can be computed as follows::
 Training a classifier
 ---------------------
 
-Now that we have our feature, we can train a classifier to try to predict
+Now that we have our features, we can train a classifier to try to predict
 the category of a post. Let's start with a :ref:`naïve Bayes <naive_bayes>`
 classifier, which
 provides a nice baseline for this task. ``scikit-learn`` includes several
@@ -293,7 +303,7 @@ on the transformers, since they have already been fit to the training set::
   >>> predicted = clf.predict(X_new_tfidf)
 
   >>> for doc, category in zip(docs_new, predicted):
-  ...     print '%r => %s' % (doc, twenty_train.target_names[category])
+  ...     print ('%r => %s' % (doc, twenty_train.target_names[category]))
   ...
   'God is love' => soc.religion.christian
   'OpenGL on the GPU is fast' => comp.graphics
@@ -316,7 +326,7 @@ The names ``vect``, ``tfidf`` and ``clf`` (classifier) are arbitrary.
 We shall see their use in the section on grid search, below.
 We can now train the model with a single command::
 
-  >>> _ = text_clf.fit(twenty_train.data, twenty_train.target)
+  >>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
 
 
 Evaluation of the performance on the test set
@@ -354,8 +364,8 @@ classifier object into our pipeline::
 analysis of the results::
 
   >>> from sklearn import metrics
-  >>> print metrics.classification_report(twenty_test.target, predicted,
-  ...     target_names=twenty_test.target_names)
+  >>> print (metrics.classification_report(twenty_test.target, predicted,
+  ...     target_names=twenty_test.target_names))
   ...                                         # doctest: +NORMALIZE_WHITESPACE
                           precision    recall  f1-score   support
   <BLANKLINE>
@@ -444,7 +454,7 @@ we can do::
 
   >>> best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
   >>> for param_name in sorted(parameters.keys()):
-  ...     print "%s: %r" % (param_name, best_parameters[param_name])
+  ...     print ("%s: %r" % (param_name, best_parameters[param_name]))
   ...
   clf__alpha: 0.001
   tfidf__use_idf: True