.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/1-glm/plot_1_LinearRegression.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_1-glm_plot_1_LinearRegression.py: ================= Linear Regression ================= In this tutorial, we are going to demonstrate how to use the ``abess`` package to carry out best subset selection in linear regression with both simulated data and real data. .. GENERATED FROM PYTHON SOURCE LINES 11-35 Our package ``abess`` implements a polynomial algorithm in the following best-subset selection problem: .. math:: \min_{\beta\in \mathbb{R}^p} \frac{1}{2n} ||y-X\beta||^2_2,\quad \text{s.t.}\ ||\beta||_0\leq s, where :math:`\| \cdot \|_2` is the :math:`\ell_2` norm, :math:`\|\beta\|_0=\sum_{i=1}^pI( \beta_i\neq 0)` is the :math:`\ell_0` norm of :math:`\beta`, and the sparsity level :math:`s` is an unknown non-negative integer to be determined. Next, we present an example to show the ``abess`` package can get an optimal estimation. Toward optimality: adaptive best-subset selection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Synthetic dataset ~~~~~~~~~~~~~~~~~ We generate a design matrix :math:`X` containing :math:`n = 300` observations and each observation has :math:`p = 1000` predictors. The response variable :math:`y` is linearly related to the first, second, and fifth predictors in :math:`X`: .. math:: y = 3X_1 + 1.5X_2 + 2X_5 + \epsilon, where :math:`\epsilon` is a standard normal random variable. .. GENERATED FROM PYTHON SOURCE LINES 36-51 .. code-block:: Python import numpy as np from abess.datasets import make_glm_data np.random.seed(0) n = 300 p = 1000 true_support_set=[0, 1, 4] true_coef = np.array([3, 1.5, 2]) real_coef = np.zeros(p) real_coef[true_support_set] = true_coef data1 = make_glm_data(n=n, p=p, k=len(true_coef), family="gaussian", coef_=real_coef) print(data1.x.shape) print(data1.y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (300, 1000) (300,) .. GENERATED FROM PYTHON SOURCE LINES 52-65 This dataset is high-dimensional and brings large challenge for subset selection. As a typical data examples, it mimics data appeared in real-world for modern scientific researches and data mining, and serves a good quick example for demonstrating the power of the ``abess`` library. Optimality ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The optimality of subset selection means: - ``true_support_set`` (i.e. ``[0, 1, 4]``) can be exactly identified; - the estimated coefficients is `ordinary least squares (OLS) estimator `__ under the true subset such that is very closed to ``true_coef = np.array([3, 1.5, 2])``. To understand the second criterion, we take a look on the estimation given by ``scikit-learn`` library: .. GENERATED FROM PYTHON SOURCE LINES 65-70 .. code-block:: Python from sklearn.linear_model import LinearRegression as SKLLinearRegression sklearn_lr = SKLLinearRegression() sklearn_lr.fit(data1.x[:, [0, 1, 4]], data1.y) print("OLS estimator: ", sklearn_lr.coef_) .. rst-class:: sphx-glr-script-out .. code-block:: none OLS estimator: [3.00085183 1.44388323 2.00633204] .. GENERATED FROM PYTHON SOURCE LINES 71-74 The fitted coefficients ``sklearn_lr.coef_`` is OLS estimator when the true support set is known. It is very closed to the ``true_coef``, and is hard to be improve under finite sample size. .. GENERATED FROM PYTHON SOURCE LINES 76-82 Adaptive Best Subset Selection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The adaptive best subset selection (ABESS) algorithm is a very powerful for the selection of the best subset. We will illustrate its power by showing it can reach to the optimality. The following code shows the simple syntax for using ABESS algorithm via ``abess`` library. .. GENERATED FROM PYTHON SOURCE LINES 82-87 .. code-block:: Python from abess import LinearRegression model = LinearRegression() model.fit(data1.x, data1.y) .. raw:: html
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 88-97 ``LinearRegression`` functions in ``abess`` is designed for selecting the best subset under the linear model, which can be imported by: ``from abess import LinearRegression``. Following similar syntax like ``scikit-learn``, we can fit the data via ABESS algorithm. Next, we going to see that the above approach can successfully recover the true set ``np.array([0, 1, 4])``. The fitted coefficients are stored in ``model.coef_``. We use ``np.nonzero`` function to find the selected subset of ``abess``, and we can extract the non-zero entries in ``model.coef_`` which is the coefficients estimation for the selected predictors. .. GENERATED FROM PYTHON SOURCE LINES 97-102 .. code-block:: Python ind = np.nonzero(model.coef_) print("estimated non-zero: ", ind) print("estimated coef: ", model.coef_[ind]) .. rst-class:: sphx-glr-script-out .. code-block:: none estimated non-zero: (array([0, 1, 4]),) estimated coef: [3.00085183 1.44388323 2.00633204] .. GENERATED FROM PYTHON SOURCE LINES 103-106 From the result, we know that ``abess`` exactly found the true set ``np.array([0, 1, 4])`` among all 1000 predictors. Besides, the estimated coefficients of them are quite close to the real ones, and is exactly the same as the estimation ``sklearn_lr.coef_`` given by ``scikit-learn``. .. GENERATED FROM PYTHON SOURCE LINES 108-119 Real data example ^^^^^^^^^^^^^^^^^ Hitters Dataset ~~~~~~~~~~~~~~~ Now we focus on real data on the `Hitters dataset `__. We hope to use several predictors related to the performance of the baseball athletes last year to predict their salary. First, let's have a look at this dataset. There are 19 variables except `Salary` and 322 observations. .. GENERATED FROM PYTHON SOURCE LINES 119-127 .. code-block:: Python import os import pandas as pd data2 = pd.read_csv(os.path.join(os.getcwd(), 'Hitters.csv')) print(data2.shape) print(data2.head(5)) .. rst-class:: sphx-glr-script-out .. code-block:: none (322, 20) AtBat Hits HmRun Runs RBI ... PutOuts Assists Errors Salary NewLeague 0 293 66 1 30 29 ... 446 33 20 NaN A 1 315 81 7 24 38 ... 632 43 10 475.0 N 2 479 130 18 66 72 ... 880 82 14 480.0 A 3 496 141 20 65 78 ... 200 11 3 500.0 N 4 321 87 10 39 42 ... 805 40 4 91.5 N [5 rows x 20 columns] .. GENERATED FROM PYTHON SOURCE LINES 128-130 Since the dataset contains some missing values, we simply drop those rows with missing values. Then we have 263 observations remain: .. GENERATED FROM PYTHON SOURCE LINES 130-135 .. code-block:: Python data2 = data2.dropna() print(data2.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (263, 20) .. GENERATED FROM PYTHON SOURCE LINES 136-138 What is more, before fitting, we need to transfer the character variables to dummy variables: .. GENERATED FROM PYTHON SOURCE LINES 138-145 .. code-block:: Python data2 = pd.get_dummies(data2) data2 = data2.drop(['League_A', 'Division_E', 'NewLeague_A'], axis=1) print(data2.shape) print(data2.head(5)) .. rst-class:: sphx-glr-script-out .. code-block:: none (263, 20) AtBat Hits HmRun Runs ... Salary League_N Division_W NewLeague_N 1 315 81 7 24 ... 475.0 True True True 2 479 130 18 66 ... 480.0 False True False 3 496 141 20 65 ... 500.0 True False True 4 321 87 10 39 ... 91.5 True False True 5 594 169 4 74 ... 750.0 False True False [5 rows x 20 columns] .. GENERATED FROM PYTHON SOURCE LINES 146-150 Model Fitting ~~~~~~~~~~~~~ As what we do in simulated data, an adaptive best subset can be formed easily: .. GENERATED FROM PYTHON SOURCE LINES 150-157 .. code-block:: Python x = np.array(data2.drop('Salary', axis=1)) y = np.array(data2['Salary']) model = LinearRegression(support_size=range(20)) model.fit(x, y) .. raw:: html
LinearRegression(support_size=range(0, 20))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 158-159 The result can be shown as follows: .. GENERATED FROM PYTHON SOURCE LINES 159-165 .. code-block:: Python ind = np.nonzero(model.coef_) print("non-zero:\n", data2.columns[ind]) print("coef:\n", model.coef_) .. rst-class:: sphx-glr-script-out .. code-block:: none non-zero: Index(['Hits', 'CRBI', 'PutOuts', 'League_N'], dtype='object') coef: [ 0. 2.67579779 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.681779 0. 0.27350022 0. 0. 0. -139.9538855 0. ] .. GENERATED FROM PYTHON SOURCE LINES 166-168 Automatically, variables `Hits`, `CHits`, `CHmRun`, `PutOuts`, `League_N` are chosen in the model (the chosen sparsity level is 5). .. GENERATED FROM PYTHON SOURCE LINES 170-173 More on the results ~~~~~~~~~~~~~~~~~~~ We can also plot the path of abess process: .. GENERATED FROM PYTHON SOURCE LINES 173-191 .. code-block:: Python import matplotlib.pyplot as plt coef = np.zeros((20, 19)) ic = np.zeros(20) for s in range(20): model = LinearRegression(support_size=s) model.fit(x, y) coef[s, :] = model.coef_ ic[s] = model.eval_loss_ for i in range(19): plt.plot(coef[:, i], label=i) plt.xlabel('support_size') plt.ylabel('coefficients') plt.title('ABESS Path') plt.show() .. image-sg:: /auto_gallery/1-glm/images/sphx_glr_plot_1_LinearRegression_001.png :alt: ABESS Path :srcset: /auto_gallery/1-glm/images/sphx_glr_plot_1_LinearRegression_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 192-194 Besides, we can also generate a graph about the tuning parameter. Remember that we used the default EBIC to tune the support size. .. GENERATED FROM PYTHON SOURCE LINES 194-201 .. code-block:: Python plt.plot(ic, 'o-') plt.xlabel('support_size') plt.ylabel('EBIC') plt.title('Model selection via EBIC') plt.show() .. image-sg:: /auto_gallery/1-glm/images/sphx_glr_plot_1_LinearRegression_002.png :alt: Model selection via EBIC :srcset: /auto_gallery/1-glm/images/sphx_glr_plot_1_LinearRegression_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 202-205 In EBIC criterion, a subset with the support size 4 has the lowest value, so the process adaptively chooses 4 variables. Note that under other information criteria, the result may be different. .. GENERATED FROM PYTHON SOURCE LINES 207-211 R tutorial ^^^^^^^^^^ For R tutorial, please view https://abess-team.github.io/abess/articles/v01-abess-guide.html. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 1.675 seconds) .. _sphx_glr_download_auto_gallery_1-glm_plot_1_LinearRegression.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_1_LinearRegression.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_1_LinearRegression.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_