.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/5-scikit-learn-connection/plot_1_scikit_learn.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_5-scikit-learn-connection_plot_1_scikit_learn.py: Work with scikit-learn ====================== .. GENERATED FROM PYTHON SOURCE LINES 7-11 ``abess`` is very easy to work with the famous package ``scikit-learn``, and here is an example. We going to illustrate the integration of the ``abess`` with ``scikit-learn``’s pre-processing and model selection modules to build a non-linear model for diagnosing malignant tumors. Let start with importing necessary dependencies: .. GENERATED FROM PYTHON SOURCE LINES 11-23 .. code-block:: Python import numpy as np from abess.datasets import make_glm_data from abess.linear import LinearRegression, LogisticRegression from sklearn.datasets import fetch_openml, load_breast_cancer from sklearn.pipeline import Pipeline, make_pipeline from sklearn.metrics import roc_auc_score, make_scorer, roc_curve, auc from sklearn.compose import ColumnTransformer from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, PolynomialFeatures, StandardScaler from sklearn.model_selection import GridSearchCV, TimeSeriesSplit, cross_val_score from sklearn.feature_selection import SelectFromModel .. GENERATED FROM PYTHON SOURCE LINES 24-29 Establish the process --------------------- Suppose we would like to extend the original variables to their interactions, and then do ``LogisticRegression`` on them. This can be record with ``Pipeline``: .. GENERATED FROM PYTHON SOURCE LINES 29-37 .. code-block:: Python pipe = Pipeline([ ('poly', PolynomialFeatures(include_bias=False)), # without intercept ('standard', StandardScaler()), ('alogistic', LogisticRegression()) ]) .. GENERATED FROM PYTHON SOURCE LINES 38-43 Parameter grid -------------- We can give different parameters to model and let the program choose the best. Here we should give parameters for ``PolynomialFeatures``, for example: .. GENERATED FROM PYTHON SOURCE LINES 43-52 .. code-block:: Python param_grid = { # whether the "self-combination" (e.g. X^2, X^3) exists 'poly__interaction_only': [True, False], 'poly__degree': [1, 2, 3] # the degree of polynomial } .. GENERATED FROM PYTHON SOURCE LINES 53-60 Note that the program would try all combinations of what we give, which means that there are :math:`2\times3=6` combinations of parameters will be tried. Criterion --------- After giving a grid of parameters, we should define what is a "better" result. For example, the AUC (area under ROC curve) can be a criterion and the larger, the better. .. GENERATED FROM PYTHON SOURCE LINES 60-64 .. code-block:: Python scorer = make_scorer(roc_auc_score, greater_is_better=True) .. GENERATED FROM PYTHON SOURCE LINES 65-68 Cross Validation ---------------- For more accurate results, cross validation (CV) is often formed. .. GENERATED FROM PYTHON SOURCE LINES 70-75 Suppose that the data is independent and identically distributed (i.i.d.) that all samples stem from the same generative process and that the generative process has no memory of past generated samples. A typical CV strategy is K-fold and a corresponding grid search procedure can be made as follows: .. GENERATED FROM PYTHON SOURCE LINES 75-78 .. code-block:: Python grid_search = GridSearchCV(pipe, param_grid, scoring=scorer, cv=5) .. GENERATED FROM PYTHON SOURCE LINES 79-83 However, if there exists correlation between observations (e.g. time-series data), K-fold strategy is not appropriate any more. An alternative CV strategy is ``TimeSeriesSplit``. It is a variation of K-fold which returns first K folds as train set and the (K+1)-th fold as test set. .. GENERATED FROM PYTHON SOURCE LINES 85-88 The following example shows a combinatioon of ``abess`` and ``TimeSeriesSplit`` applied to ``Bike_Sharing_Demand`` dataset and it returns the cv score of a specific choice of ``support_size``. .. GENERATED FROM PYTHON SOURCE LINES 88-118 .. code-block:: Python bike_sharing = fetch_openml('Bike_Sharing_Demand', version=2, as_frame=True) df = bike_sharing.frame X = df.drop('count', axis='columns') y = df['count'] / df['count'].max() ts_cv = TimeSeriesSplit( n_splits=5, gap=48, max_train_size=10000, test_size=1000, ) categorical_columns = ['weather', 'season', 'holiday', 'workingday',] one_hot_encoder = OneHotEncoder(handle_unknown='ignore') one_hot_abess_pipeline = make_pipeline( ColumnTransformer( transformers=[ ('categorical', one_hot_encoder, categorical_columns), ('one_hot_time', one_hot_encoder, ['hour', 'weekday', 'month']), ], remainder=MinMaxScaler(), ), LinearRegression(support_size=5), ) scores = cross_val_score(one_hot_abess_pipeline, X, y, cv=ts_cv) print("%0.2f score with a standard deviation of %0.2f" % (scores.mean(), scores.std())) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.21 score with a standard deviation of 0.06 .. GENERATED FROM PYTHON SOURCE LINES 119-123 Model fitting ------------- Eveything is prepared now. We can simply load the data and put it into ``grid_search``: .. GENERATED FROM PYTHON SOURCE LINES 123-129 .. code-block:: Python X, y = load_breast_cancer(return_X_y=True) grid_search.fit(X, y) print([grid_search.best_score_, grid_search.best_params_]) .. rst-class:: sphx-glr-script-out .. code-block:: none [0.9714645755670978, {'poly__degree': 2, 'poly__interaction_only': True}] .. GENERATED FROM PYTHON SOURCE LINES 130-137 The output of the code reports the information of the polynomial features for the selected model among candidates, and its corresponding area under the curve (AUC), which is over 0.97, indicating the selected model would have an admirable contribution in practice. Moreover, the best choice of parameter combination is shown above: 2 degree with "self-combination", implying the inclusion of the pairwise interactions between any two features can lead to a better model generalization. .. GENERATED FROM PYTHON SOURCE LINES 139-140 Here is its ROC curve: .. GENERATED FROM PYTHON SOURCE LINES 140-153 .. code-block:: Python import matplotlib.pyplot as plt proba = grid_search.predict_proba(X) fpr, tpr, _ = roc_curve(y, proba[:, 1]) plt.plot(fpr, tpr) plt.plot([0, 1], [0, 1], 'k--', label="ROC curve (area = %0.2f)" % auc(fpr, tpr)) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title("Receiver operating characteristic (ROC) curve") plt.legend(loc="lower right") plt.show() .. image-sg:: /auto_gallery/5-scikit-learn-connection/images/sphx_glr_plot_1_scikit_learn_001.png :alt: Receiver operating characteristic (ROC) curve :srcset: /auto_gallery/5-scikit-learn-connection/images/sphx_glr_plot_1_scikit_learn_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 154-156 Feature selection ------------------ .. GENERATED FROM PYTHON SOURCE LINES 158-162 Besides being used to make prediction explicitly, ``abess`` can be exploited to select important features. The following example shows how to perform abess-based feature selection using ``sklearn.feature_selection.SelectFromModel``. .. GENERATED FROM PYTHON SOURCE LINES 165-177 .. code-block:: Python np.random.seed(0) n, p, k = 300, 1000, 5 data = make_glm_data(n=n, p=p, k=k, family='gaussian') X, y = data.x, data.y print('Shape of original data: ', X.shape) model = LinearRegression().fit(X, y) sfm = SelectFromModel(model, prefit=True) X_new = sfm.transform(X) print('Shape of transformed data: ', X_new.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none Shape of original data: (300, 1000) Shape of transformed data: (300, 5) .. GENERATED FROM PYTHON SOURCE LINES 178-179 sphinx_gallery_thumbnail_path = 'Tutorial/figure/scikit_learn.png' .. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 48.834 seconds) .. _sphx_glr_download_auto_gallery_5-scikit-learn-connection_plot_1_scikit_learn.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_1_scikit_learn.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_1_scikit_learn.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_