@@ -160,6 +160,44 @@ some variant of the SMOTE algorithm::
160
160
>>> print(sorted(Counter(y_resampled).items()))
161
161
[(0, 4674), (1, 4674), (2, 4674)]
162
162
163
+ When dealing with mixed data type such as continuous and categorical features,
164
+ none of the presented methods (apart of the class :class: `RandomOverSampler `)
165
+ can deal with the categorical features. The :class: `SMOTENC ` [CBHK2002 ]_ is an
166
+ extension of the :class: `SMOTE ` algorithm for which categorical data are
167
+ treated differently::
168
+
169
+ >>> # create a synthetic data set with continuous and categorical features
170
+ >>> rng = np.random.RandomState(42)
171
+ >>> n_samples = 50
172
+ >>> X = np.empty((n_samples, 3), dtype=object)
173
+ >>> X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
174
+ >>> X[:, 1] = rng.randn(n_samples)
175
+ >>> X[:, 2] = rng.randint(3, size=n_samples)
176
+ >>> y = np.array([0] * 20 + [1] * 30)
177
+ >>> print(sorted(Counter(y).items()))
178
+ [(0, 20), (1, 30)]
179
+
180
+ In this data set, the first and last features are considered as categorical
181
+ features. One need to provide this information to :class: `SMOTENC ` via the
182
+ parameters ``categorical_features `` either by passing the indices of these
183
+ features or a boolean mask marking these features::
184
+
185
+ >>> from imblearn.over_sampling import SMOTENC
186
+ >>> smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
187
+ >>> X_resampled, y_resampled = smote_nc.fit_resample(X, y)
188
+ >>> print(sorted(Counter(y_resampled).items()))
189
+ [(0, 30), (1, 30)]
190
+ >>> print(X_resampled[-5:])
191
+ [['B' 0.1989993778979113 0]
192
+ ['A' -0.3657680728116921 1]
193
+ ['B' 0.8790828729585258 0]
194
+ ['A' 0.3710891618824609 0]
195
+ ['A' 0.3327240726719727 0]]
196
+
197
+ Therefore, it can be seen that the samples generated in the first and last
198
+ columns are belonging to the same categories originally presented without any
199
+ other extra interpolation.
200
+
163
201
.. topic :: References
164
202
165
203
.. [HWB2005 ] H. Han, W. Wen-Yuan, M. Bing-Huan, "Borderline-SMOTE: a new
@@ -198,8 +236,13 @@ interpolation will create a sample on the line between :math:`x_{i}` and
198
236
:scale: 60
199
237
:align: center
200
238
201
- Each SMOTE variant and ADASYN differ from each other by selecting the samples
202
- :math: `x_i` ahead of generating the new samples.
239
+ SMOTE-NC slightly change the way a new sample is generated by performing
240
+ something specific for the categorical features. In fact, the categories of a
241
+ new generated sample are decided by picking the most frequent category of the
242
+ nearest neighbors present during the generation.
243
+
244
+ The other SMOTE variants and ADASYN differ from each other by selecting the
245
+ samples :math: `x_i` ahead of generating the new samples.
203
246
204
247
The **regular ** SMOTE algorithm --- cf. to the :class: `SMOTE ` object --- does not
205
248
impose any rule and will randomly pick-up all possible :math: `x_i` available.
0 commit comments