@@ -119,12 +119,12 @@ reference the filenames are also available::
119119
120120Let's print the first lines of the first loaded file::
121121
122- >>> print "\n".join(twenty_train.data[0].split("\n")[:3])
122+ >>> print ( "\n".join(twenty_train.data[0].split("\n")[:3]) )
123123 From: [email protected] (Michael Collier) 124124 Subject: Converting images to HP LaserJet III?
125125 Nntp-Posting-Host: hampton
126126
127- >>> print twenty_train.target_names[twenty_train.target[0]]
127+ >>> print ( twenty_train.target_names[twenty_train.target[0]])
128128 comp.graphics
129129
130130Supervised learning algorithms will require a category label for each
@@ -143,7 +143,7 @@ integer id of each sample is stored in the ``target`` attribute::
143143It is possible to get back the category names as follows::
144144
145145 >>> for t in twenty_train.target[:10]:
146- ... print twenty_train.target_names[t]
146+ ... print ( twenty_train.target_names[t])
147147 ...
148148 comp.graphics
149149 comp.graphics
@@ -168,6 +168,8 @@ Extracting features from text files
168168In order to perform machine learning on text documents, we first need to
169169turn the text content into numerical feature vectors.
170170
171+ .. currentmodule :: sklearn.feature_extraction.text
172+
171173
172174Bags of words
173175~~~~~~~~~~~~~
@@ -212,7 +214,7 @@ dictionary of features and transform documents to feature vectors::
212214 >>> X_train_counts.shape
213215 (2257, 35788)
214216
215- `` CountVectorizer ` ` supports counts of N-grams of words or consequective characters.
217+ :class: ` CountVectorizer ` supports counts of N-grams of words or consequective characters.
216218Once fitted, the vectorizer has built a dictionary of feature indices::
217219
218220 >>> count_vect.vocabulary_.get(u'algorithm')
@@ -262,6 +264,14 @@ Both **tf** and **tf–idf** can be computed as follows::
262264 >>> X_train_tf.shape
263265 (2257, 35788)
264266
267+ In the above example-code, we firstly use the ``fit(..) `` method to fit our
268+ estimator to the data and secondly the ``transform(..) `` method to transform
269+ our count-matrix to a tf-idf representation.
270+ These two steps can be combined to achieve the same end result faster
271+ by skipping redundant processing. This is done through using the
272+ ``fit_transform(..) `` method as shown below, and as mentioned in the note
273+ in the previous section.
274+
265275 >>> tfidf_transformer = TfidfTransformer()
266276 >>> X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
267277 >>> X_train_tfidf.shape
@@ -271,7 +281,7 @@ Both **tf** and **tf–idf** can be computed as follows::
271281Training a classifier
272282---------------------
273283
274- Now that we have our feature , we can train a classifier to try to predict
284+ Now that we have our features , we can train a classifier to try to predict
275285the category of a post. Let's start with a :ref: `naïve Bayes <naive_bayes >`
276286classifier, which
277287provides a nice baseline for this task. ``scikit-learn `` includes several
@@ -293,7 +303,7 @@ on the transformers, since they have already been fit to the training set::
293303 >>> predicted = clf.predict(X_new_tfidf)
294304
295305 >>> for doc, category in zip(docs_new, predicted):
296- ... print '%r => %s' % (doc, twenty_train.target_names[category])
306+ ... print ( '%r => %s' % (doc, twenty_train.target_names[category]) )
297307 ...
298308 'God is love' => soc.religion.christian
299309 'OpenGL on the GPU is fast' => comp.graphics
@@ -316,7 +326,7 @@ The names ``vect``, ``tfidf`` and ``clf`` (classifier) are arbitrary.
316326We shall see their use in the section on grid search, below.
317327We can now train the model with a single command::
318328
319- >>> _ = text_clf.fit(twenty_train.data, twenty_train.target)
329+ >>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
320330
321331
322332Evaluation of the performance on the test set
@@ -354,8 +364,8 @@ classifier object into our pipeline::
354364analysis of the results::
355365
356366 >>> from sklearn import metrics
357- >>> print metrics.classification_report(twenty_test.target, predicted,
358- ... target_names=twenty_test.target_names)
367+ >>> print ( metrics.classification_report(twenty_test.target, predicted,
368+ ... target_names=twenty_test.target_names))
359369 ... # doctest: +NORMALIZE_WHITESPACE
360370 precision recall f1-score support
361371 <BLANKLINE>
@@ -444,7 +454,7 @@ we can do::
444454
445455 >>> best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
446456 >>> for param_name in sorted(parameters.keys()):
447- ... print "%s: %r" % (param_name, best_parameters[param_name])
457+ ... print ( "%s: %r" % (param_name, best_parameters[param_name]) )
448458 ...
449459 clf__alpha: 0.001
450460 tfidf__use_idf: True
0 commit comments