Skip to content

Commit 2d4902e

Browse files
ddudnikglemaitre
authored andcommitted
EHN: implementation of SMOTE-NC for continuous and categorical mixed types (#412)
1 parent 444707b commit 2d4902e

File tree

9 files changed

+570
-22
lines changed

9 files changed

+570
-22
lines changed

doc/api.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ Prototype selection
7272
over_sampling.ADASYN
7373
over_sampling.RandomOverSampler
7474
over_sampling.SMOTE
75+
over_sampling.SMOTENC
7576

7677

7778
.. _combine_ref:

doc/over_sampling.rst

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,44 @@ some variant of the SMOTE algorithm::
160160
>>> print(sorted(Counter(y_resampled).items()))
161161
[(0, 4674), (1, 4674), (2, 4674)]
162162

163+
When dealing with mixed data type such as continuous and categorical features,
164+
none of the presented methods (apart of the class :class:`RandomOverSampler`)
165+
can deal with the categorical features. The :class:`SMOTENC` [CBHK2002]_ is an
166+
extension of the :class:`SMOTE` algorithm for which categorical data are
167+
treated differently::
168+
169+
>>> # create a synthetic data set with continuous and categorical features
170+
>>> rng = np.random.RandomState(42)
171+
>>> n_samples = 50
172+
>>> X = np.empty((n_samples, 3), dtype=object)
173+
>>> X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
174+
>>> X[:, 1] = rng.randn(n_samples)
175+
>>> X[:, 2] = rng.randint(3, size=n_samples)
176+
>>> y = np.array([0] * 20 + [1] * 30)
177+
>>> print(sorted(Counter(y).items()))
178+
[(0, 20), (1, 30)]
179+
180+
In this data set, the first and last features are considered as categorical
181+
features. One need to provide this information to :class:`SMOTENC` via the
182+
parameters ``categorical_features`` either by passing the indices of these
183+
features or a boolean mask marking these features::
184+
185+
>>> from imblearn.over_sampling import SMOTENC
186+
>>> smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
187+
>>> X_resampled, y_resampled = smote_nc.fit_resample(X, y)
188+
>>> print(sorted(Counter(y_resampled).items()))
189+
[(0, 30), (1, 30)]
190+
>>> print(X_resampled[-5:])
191+
[['B' 0.1989993778979113 0]
192+
['A' -0.3657680728116921 1]
193+
['B' 0.8790828729585258 0]
194+
['A' 0.3710891618824609 0]
195+
['A' 0.3327240726719727 0]]
196+
197+
Therefore, it can be seen that the samples generated in the first and last
198+
columns are belonging to the same categories originally presented without any
199+
other extra interpolation.
200+
163201
.. topic:: References
164202

165203
.. [HWB2005] H. Han, W. Wen-Yuan, M. Bing-Huan, "Borderline-SMOTE: a new
@@ -198,8 +236,13 @@ interpolation will create a sample on the line between :math:`x_{i}` and
198236
:scale: 60
199237
:align: center
200238

201-
Each SMOTE variant and ADASYN differ from each other by selecting the samples
202-
:math:`x_i` ahead of generating the new samples.
239+
SMOTE-NC slightly change the way a new sample is generated by performing
240+
something specific for the categorical features. In fact, the categories of a
241+
new generated sample are decided by picking the most frequent category of the
242+
nearest neighbors present during the generation.
243+
244+
The other SMOTE variants and ADASYN differ from each other by selecting the
245+
samples :math:`x_i` ahead of generating the new samples.
203246

204247
The **regular** SMOTE algorithm --- cf. to the :class:`SMOTE` object --- does not
205248
impose any rule and will randomly pick-up all possible :math:`x_i` available.

doc/whats_new/v0.0.4.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,12 @@ New features
4141
under-sampling stage before each boosting iteration of AdaBoost.
4242
:issue:`469` by :user:`Guillaume Lemaitre <glemaitre>`.
4343

44+
- Add :class:`imblern.over_sampling.SMOTENC` which generate synthetic samples
45+
on data set with heterogeneous data type (continuous and categorical
46+
features).
47+
:issue:`412` by :user:`Denis Dudnik <ddudnik>` and
48+
:user:`Guillaume Lemaitre <glemaitre>`.
49+
4450
Enhancement
4551
...........
4652

examples/over-sampling/plot_comparison_over_sampling.py

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121

2222
from imblearn.pipeline import make_pipeline
2323
from imblearn.over_sampling import ADASYN
24-
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE
24+
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, SMOTENC
2525
from imblearn.over_sampling import RandomOverSampler
2626
from imblearn.base import BaseSampler
2727

@@ -226,4 +226,29 @@ def _fit_resample(self, X, y):
226226
ax[1].set_title('Resampling using {}'.format(sampler.__class__.__name__))
227227
fig.tight_layout()
228228

229+
###############################################################################
230+
# When dealing with a mixed of continuous and categorical features, SMOTE-NC
231+
# is the only method which can handle this case.
232+
233+
# create a synthetic data set with continuous and categorical features
234+
rng = np.random.RandomState(42)
235+
n_samples = 50
236+
X = np.empty((n_samples, 3), dtype=object)
237+
X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
238+
X[:, 1] = rng.randn(n_samples)
239+
X[:, 2] = rng.randint(3, size=n_samples)
240+
y = np.array([0] * 20 + [1] * 30)
241+
242+
print('The original imbalanced dataset')
243+
print(sorted(Counter(y).items()))
244+
print('The first and last columns are containing categorical features:')
245+
print(X[:5])
246+
247+
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
248+
X_resampled, y_resampled = smote_nc.fit_resample(X, y)
249+
print('Dataset after resampling:')
250+
print(sorted(Counter(y_resampled).items()))
251+
print('SMOTE-NC will generate categories for the categorical features:')
252+
print(X_resampled[-5:])
253+
229254
plt.show()

imblearn/over_sampling/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from ._smote import SMOTE
99
from ._smote import BorderlineSMOTE
1010
from ._smote import SVMSMOTE
11+
from ._smote import SMOTENC
1112

1213
__all__ = ['ADASYN', 'RandomOverSampler',
13-
'SMOTE', 'BorderlineSMOTE', 'SVMSMOTE']
14+
'SMOTE', 'BorderlineSMOTE', 'SVMSMOTE', 'SMOTENC']

0 commit comments

Comments
 (0)