Merge pull request scikit-learn#6285 from yenchenlin1994/update-DictVectorizer-doc-about-one-hot-encoding

amueller · amueller · commit a8e7e480b155 · 2016-02-09T15:27:15.000-05:00
[MRG+1] Doc Add doc in DictVectorizer when categorical features are numeric values (fixes scikit-learn#4413)
diff --git a/sklearn/feature_extraction/dict_vectorizer.py b/sklearn/feature_extraction/dict_vectorizer.py
@@ -37,6 +37,11 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
     a feature "f" that can take on the values "ham" and "spam" will become two
     features in the output, one signifying "f=ham", the other "f=spam".
 
+    However, note that this transformer will only do a binary one-hot encoding
+    when feature values are of type string. If categorical features are
+    represented as numeric values such as int, the DictVectorizer can be
+    followed by OneHotEncoder to complete binary one-hot encoding.
+
     Features that do not occur in a sample (mapping) will have a zero value
     in the resulting array/matrix.