.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_gallery/5-scikit-learn-connection/plot_6_imbalanced_learn.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_gallery_5-scikit-learn-connection_plot_6_imbalanced_learn.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_gallery_5-scikit-learn-connection_plot_6_imbalanced_learn.py:


===========================
Work with imbalanced-learn
===========================
``Imbalanced-learn`` is an open source, MIT-licensed library relying on scikit-learn 
and provides tools when dealing with classification with imbalanced classes. In this tutorial, 
we will show how to combine ``abess.linear.LogisticRegression`` and ``imbalanced-learn`` to 
handle a imbalanced binary classification task.

.. GENERATED FROM PYTHON SOURCE LINES 12-23

.. code-block:: Python

    import warnings
    warnings.filterwarnings('ignore')
    import numpy as np
    from abess.linear import LogisticRegression
    from abess.datasets import make_glm_data
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import balanced_accuracy_score
    from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
    from imblearn.under_sampling import RandomUnderSampler, EditedNearestNeighbours


.. GENERATED FROM PYTHON SOURCE LINES 24-26

Synthetic data
---------------

.. GENERATED FROM PYTHON SOURCE LINES 29-32

Generate imbalanced dataset (X, y). Here, we use ``make_glm_data`` to generate a balanced 
binary dataset ``data`` and then drop 90% of positive samples. Thus, the imbalance ratio of
our example is around 10:1.

.. GENERATED FROM PYTHON SOURCE LINES 32-45

.. code-block:: Python

    n, p, k = 5000, 2000, 10
    random_state = 12345
    np.random.seed(random_state)
    data = make_glm_data(n=n, p=p, k=k, family='binomial')
    idx0 = np.where(data.y == 0)[0]  # index of negative sample
    idx1 = np.where(data.y == 1)[0]  # index of positive sample
    idx = np.array(list(set(idx0).union(set(idx1[:int(n/20)]))))
    X, y = data.x[idx], data.y[idx]
    print('Generated dataset has {} positive samples and {} negative samples.'.format(np.sum(y==1), np.sum(y==0)))
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)
    print('Train size: {}, Test size: {}.'.format(len(y_train), len(y_test)))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Generated dataset has 250 positive samples and 2553 negative samples.
    Train size: 2102, Test size: 701.


.. GENERATED FROM PYTHON SOURCE LINES 46-48

Base estimator
---------------

.. GENERATED FROM PYTHON SOURCE LINES 48-54

.. code-block:: Python

    model = LogisticRegression(support_size=k)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Balanced accuracy score:  0.853


.. GENERATED FROM PYTHON SOURCE LINES 55-57

Over-sampling
--------------

.. GENERATED FROM PYTHON SOURCE LINES 60-61

RandomOverSampler

.. GENERATED FROM PYTHON SOURCE LINES 61-70

.. code-block:: Python

    ros = RandomOverSampler()
    X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)
    model = LogisticRegression(support_size=k)
    model.fit(X_train_resampled, y_train_resampled)
    y_pred = model.predict(X_test)
    print('Resampled size: ', len(y_train_resampled))
    print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Resampled size:  3832
    Balanced accuracy score:  0.883


.. GENERATED FROM PYTHON SOURCE LINES 71-72

SMOTE

.. GENERATED FROM PYTHON SOURCE LINES 72-81

.. code-block:: Python


    X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train)
    model = LogisticRegression(support_size=k)
    model.fit(X_train_resampled, y_train_resampled)
    y_pred = model.predict(X_test)
    print('Resampled size: ', len(y_train_resampled))
    print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Resampled size:  3832
    Balanced accuracy score:  0.897


.. GENERATED FROM PYTHON SOURCE LINES 82-83

ADASYN

.. GENERATED FROM PYTHON SOURCE LINES 83-92

.. code-block:: Python


    X_train_resampled, y_train_resampled = ADASYN().fit_resample(X_train, y_train)
    model = LogisticRegression(support_size=k)
    model.fit(X_train_resampled, y_train_resampled)
    y_pred = model.predict(X_test)
    print('Resampled size: ', len(y_train_resampled))
    print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Resampled size:  3792
    Balanced accuracy score:  0.898


.. GENERATED FROM PYTHON SOURCE LINES 93-95

Under-sampling
----------------

.. GENERATED FROM PYTHON SOURCE LINES 98-99

RandomUnderSampler

.. GENERATED FROM PYTHON SOURCE LINES 99-109

.. code-block:: Python


    rus = RandomUnderSampler()
    X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)
    model = LogisticRegression(support_size=k)
    model.fit(X_train_resampled, y_train_resampled)
    y_pred = model.predict(X_test)
    print('Resampled size: ', len(y_train_resampled))
    print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Resampled size:  372
    Balanced accuracy score:  0.892


.. GENERATED FROM PYTHON SOURCE LINES 110-111

EditedNearestNeighbours

.. GENERATED FROM PYTHON SOURCE LINES 111-121

.. code-block:: Python


    enn = EditedNearestNeighbours(kind_sel='all')
    X_train_resampled, y_train_resampled = enn.fit_resample(X_train, y_train)
    model = LogisticRegression(support_size=k)
    model.fit(X_train_resampled, y_train_resampled)
    y_pred = model.predict(X_test)
    print('Resampled size: ', len(y_train_resampled))
    print('Balanced accuracy score: ', balanced_accuracy_score(y_test, y_pred).round(3))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Resampled size:  1650
    Balanced accuracy score:  0.872


.. GENERATED FROM PYTHON SOURCE LINES 122-124

Pipeline
---------

.. GENERATED FROM PYTHON SOURCE LINES 126-131

In the following, we show how to construct a pipeline.
Note that pipeline implemented by sklearn requires that all intermediate 
estimators must be transformers.
However, resamplers in imblearn are not transformers.
Instead, we explicitly use pipeline implemented by imblearn here.

.. GENERATED FROM PYTHON SOURCE LINES 131-149

.. code-block:: Python


    from imblearn.pipeline import Pipeline as imbPipeline
    resamplers = {  
                    'RandomOverSampler': RandomOverSampler, 
                    'SMOTE': SMOTE, 
                    'ADASYN': ADASYN, 
                    'RandomUnderSampler': RandomUnderSampler, 
                    'EditedNearestNeighbours': EditedNearestNeighbours
                }
    for name in resamplers.keys():
        resampler = resamplers[name]()
        estimators = [('resampler', resampler), ('clf', LogisticRegression(support_size=k))]
        pipe = imbPipeline(estimators)
        pipe.fit(X_train, y_train)
        y_pred = pipe.predict(X_test)
        print('{}: {}'.format(name, balanced_accuracy_score(y_test, y_pred).round(3)) )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    RandomOverSampler: 0.89
    SMOTE: 0.897
    ADASYN: 0.898
    RandomUnderSampler: 0.88
    EditedNearestNeighbours: 0.872


.. GENERATED FROM PYTHON SOURCE LINES 150-150

sphinx_gallery_thumbnail_path = 'Tutorial/figure/imbalanced-learn.png'


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 10.010 seconds)


.. _sphx_glr_download_auto_gallery_5-scikit-learn-connection_plot_6_imbalanced_learn.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_6_imbalanced_learn.ipynb <plot_6_imbalanced_learn.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_6_imbalanced_learn.py <plot_6_imbalanced_learn.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_