Note
Go to the end to download the full example code
Work with imbalanced-learn#
Imbalanced-learn
is an open source, MIT-licensed library relying on scikit-learn
and provides tools when dealing with classification with imbalanced classes. In this tutorial,
we will show how to combine abess.linear.LogisticRegression
and imbalanced-learn
to
handle a imbalanced binary classification task.
import warnings
warnings.filterwarnings('ignore')
import numpy as np
from abess.linear import LogisticRegression
from abess.datasets import make_glm_data
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, EditedNearestNeighbours
Synthetic data#
Generate imbalanced dataset (X, y). Here, we use make_glm_data
to generate a balanced
binary dataset data
and then drop 90% of positive samples. Thus, the imbalance ratio of
our example is around 10:1.
n, p, k = 5000, 2000, 10
random_state = 12345
np.random.seed(random_state)
data = make_glm_data(n=n, p=p, k=k, family='binomial')
idx0 = np.where(data.y == 0)[0] # index of negative sample
idx1 = np.where(data.y == 1)[0] # index of positive sample
idx = np.array(list(set(idx0).union(set(idx1[:int(n/20)]))))
X, y = data.x[idx], data.y[idx]
print('Generated dataset has {} positive samples and {} negative samples.'.format(np.sum(y==1), np.sum(y==0)))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)
print('Train size: {}, Test size: {}.'.format(len(y_train), len(y_test)))
Generated dataset has 250 positive samples and 2553 negative samples.
Train size: 2102, Test size: 701.
Base estimator#
model = LogisticRegression(support_size=k)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))
Balanced accuracy score: 0.853
Over-sampling#
RandomOverSampler
ros = RandomOverSampler()
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)
model = LogisticRegression(support_size=k)
model.fit(X_train_resampled, y_train_resampled)
y_pred = model.predict(X_test)
print('Resampled size: ', len(y_train_resampled))
print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))
Resampled size: 3832
Balanced accuracy score: 0.883
SMOTE
X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train)
model = LogisticRegression(support_size=k)
model.fit(X_train_resampled, y_train_resampled)
y_pred = model.predict(X_test)
print('Resampled size: ', len(y_train_resampled))
print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))
Resampled size: 3832
Balanced accuracy score: 0.897
ADASYN
X_train_resampled, y_train_resampled = ADASYN().fit_resample(X_train, y_train)
model = LogisticRegression(support_size=k)
model.fit(X_train_resampled, y_train_resampled)
y_pred = model.predict(X_test)
print('Resampled size: ', len(y_train_resampled))
print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))
Resampled size: 3792
Balanced accuracy score: 0.898
Under-sampling#
RandomUnderSampler
rus = RandomUnderSampler()
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)
model = LogisticRegression(support_size=k)
model.fit(X_train_resampled, y_train_resampled)
y_pred = model.predict(X_test)
print('Resampled size: ', len(y_train_resampled))
print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))
Resampled size: 372
Balanced accuracy score: 0.892
EditedNearestNeighbours
enn = EditedNearestNeighbours(kind_sel='all')
X_train_resampled, y_train_resampled = enn.fit_resample(X_train, y_train)
model = LogisticRegression(support_size=k)
model.fit(X_train_resampled, y_train_resampled)
y_pred = model.predict(X_test)
print('Resampled size: ', len(y_train_resampled))
print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))
Resampled size: 1650
Balanced accuracy score: 0.872
Pipeline#
In the following, we show how to construct a pipeline. Note that pipeline implemented by sklearn requires that all intermediate estimators must be transformers. However, resamplers in imblearn are not transformers. Instead, we explicitly use pipeline implemented by imblearn here.
from imblearn.pipeline import Pipeline as imbPipeline
resamplers = {
'RandomOverSampler': RandomOverSampler,
'SMOTE': SMOTE,
'ADASYN': ADASYN,
'RandomUnderSampler': RandomUnderSampler,
'EditedNearestNeighbours': EditedNearestNeighbours
}
for name in resamplers.keys():
resampler = resamplers[name]()
estimators = [('resampler', resampler), ('clf', LogisticRegression(support_size=k))]
pipe = imbPipeline(estimators)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print('{}: {}'.format(name, balanced_accuracy_score(y_test, y_pred).round(3)) )
RandomOverSampler: 0.89
SMOTE: 0.897
ADASYN: 0.898
RandomUnderSampler: 0.88
EditedNearestNeighbours: 0.872
sphinx_gallery_thumbnail_path = 'Tutorial/figure/imbalanced-learn.png'
Total running time of the script: (0 minutes 9.330 seconds)