.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/5-scikit-learn-connection/plot_6_imbalanced_learn.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_5-scikit-learn-connection_plot_6_imbalanced_learn.py: =========================== Work with imbalanced-learn =========================== ``Imbalanced-learn`` is an open source, MIT-licensed library relying on scikit-learn and provides tools when dealing with classification with imbalanced classes. In this tutorial, we will show how to combine ``abess.linear.LogisticRegression`` and ``imbalanced-learn`` to handle a imbalanced binary classification task. .. GENERATED FROM PYTHON SOURCE LINES 12-23 .. code-block:: Python import warnings warnings.filterwarnings('ignore') import numpy as np from abess.linear import LogisticRegression from abess.datasets import make_glm_data from sklearn.model_selection import train_test_split from sklearn.metrics import balanced_accuracy_score from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN from imblearn.under_sampling import RandomUnderSampler, EditedNearestNeighbours .. GENERATED FROM PYTHON SOURCE LINES 24-26 Synthetic data --------------- .. GENERATED FROM PYTHON SOURCE LINES 29-32 Generate imbalanced dataset (X, y). Here, we use ``make_glm_data`` to generate a balanced binary dataset ``data`` and then drop 90% of positive samples. Thus, the imbalance ratio of our example is around 10:1. .. GENERATED FROM PYTHON SOURCE LINES 32-45 .. code-block:: Python n, p, k = 5000, 2000, 10 random_state = 12345 np.random.seed(random_state) data = make_glm_data(n=n, p=p, k=k, family='binomial') idx0 = np.where(data.y == 0)[0] # index of negative sample idx1 = np.where(data.y == 1)[0] # index of positive sample idx = np.array(list(set(idx0).union(set(idx1[:int(n/20)])))) X, y = data.x[idx], data.y[idx] print('Generated dataset has {} positive samples and {} negative samples.'.format(np.sum(y==1), np.sum(y==0))) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state) print('Train size: {}, Test size: {}.'.format(len(y_train), len(y_test))) .. rst-class:: sphx-glr-script-out .. code-block:: none Generated dataset has 250 positive samples and 2553 negative samples. Train size: 2102, Test size: 701. .. GENERATED FROM PYTHON SOURCE LINES 46-48 Base estimator --------------- .. GENERATED FROM PYTHON SOURCE LINES 48-54 .. code-block:: Python model = LogisticRegression(support_size=k) model.fit(X_train, y_train) y_pred = model.predict(X_test) print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3)) .. rst-class:: sphx-glr-script-out .. code-block:: none Balanced accuracy score: 0.853 .. GENERATED FROM PYTHON SOURCE LINES 55-57 Over-sampling -------------- .. GENERATED FROM PYTHON SOURCE LINES 60-61 RandomOverSampler .. GENERATED FROM PYTHON SOURCE LINES 61-70 .. code-block:: Python ros = RandomOverSampler() X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train) model = LogisticRegression(support_size=k) model.fit(X_train_resampled, y_train_resampled) y_pred = model.predict(X_test) print('Resampled size: ', len(y_train_resampled)) print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3)) .. rst-class:: sphx-glr-script-out .. code-block:: none Resampled size: 3832 Balanced accuracy score: 0.883 .. GENERATED FROM PYTHON SOURCE LINES 71-72 SMOTE .. GENERATED FROM PYTHON SOURCE LINES 72-81 .. code-block:: Python X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train) model = LogisticRegression(support_size=k) model.fit(X_train_resampled, y_train_resampled) y_pred = model.predict(X_test) print('Resampled size: ', len(y_train_resampled)) print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3)) .. rst-class:: sphx-glr-script-out .. code-block:: none Resampled size: 3832 Balanced accuracy score: 0.897 .. GENERATED FROM PYTHON SOURCE LINES 82-83 ADASYN .. GENERATED FROM PYTHON SOURCE LINES 83-92 .. code-block:: Python X_train_resampled, y_train_resampled = ADASYN().fit_resample(X_train, y_train) model = LogisticRegression(support_size=k) model.fit(X_train_resampled, y_train_resampled) y_pred = model.predict(X_test) print('Resampled size: ', len(y_train_resampled)) print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3)) .. rst-class:: sphx-glr-script-out .. code-block:: none Resampled size: 3792 Balanced accuracy score: 0.898 .. GENERATED FROM PYTHON SOURCE LINES 93-95 Under-sampling ---------------- .. GENERATED FROM PYTHON SOURCE LINES 98-99 RandomUnderSampler .. GENERATED FROM PYTHON SOURCE LINES 99-109 .. code-block:: Python rus = RandomUnderSampler() X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train) model = LogisticRegression(support_size=k) model.fit(X_train_resampled, y_train_resampled) y_pred = model.predict(X_test) print('Resampled size: ', len(y_train_resampled)) print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3)) .. rst-class:: sphx-glr-script-out .. code-block:: none Resampled size: 372 Balanced accuracy score: 0.892 .. GENERATED FROM PYTHON SOURCE LINES 110-111 EditedNearestNeighbours .. GENERATED FROM PYTHON SOURCE LINES 111-121 .. code-block:: Python enn = EditedNearestNeighbours(kind_sel='all') X_train_resampled, y_train_resampled = enn.fit_resample(X_train, y_train) model = LogisticRegression(support_size=k) model.fit(X_train_resampled, y_train_resampled) y_pred = model.predict(X_test) print('Resampled size: ', len(y_train_resampled)) print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3)) .. rst-class:: sphx-glr-script-out .. code-block:: none Resampled size: 1650 Balanced accuracy score: 0.872 .. GENERATED FROM PYTHON SOURCE LINES 122-124 Pipeline --------- .. GENERATED FROM PYTHON SOURCE LINES 126-131 In the following, we show how to construct a pipeline. Note that pipeline implemented by sklearn requires that all intermediate estimators must be transformers. However, resamplers in imblearn are not transformers. Instead, we explicitly use pipeline implemented by imblearn here. .. GENERATED FROM PYTHON SOURCE LINES 131-149 .. code-block:: Python from imblearn.pipeline import Pipeline as imbPipeline resamplers = { 'RandomOverSampler': RandomOverSampler, 'SMOTE': SMOTE, 'ADASYN': ADASYN, 'RandomUnderSampler': RandomUnderSampler, 'EditedNearestNeighbours': EditedNearestNeighbours } for name in resamplers.keys(): resampler = resamplers[name]() estimators = [('resampler', resampler), ('clf', LogisticRegression(support_size=k))] pipe = imbPipeline(estimators) pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) print('{}: {}'.format(name, balanced_accuracy_score(y_test, y_pred).round(3)) ) .. rst-class:: sphx-glr-script-out .. code-block:: none RandomOverSampler: 0.89 SMOTE: 0.897 ADASYN: 0.898 RandomUnderSampler: 0.88 EditedNearestNeighbours: 0.872 .. GENERATED FROM PYTHON SOURCE LINES 150-150 sphinx_gallery_thumbnail_path = 'Tutorial/figure/imbalanced-learn.png' .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 10.010 seconds) .. _sphx_glr_download_auto_gallery_5-scikit-learn-connection_plot_6_imbalanced_learn.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_6_imbalanced_learn.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_6_imbalanced_learn.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_