.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/3-advanced-features/plot_best_group.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_3-advanced-features_plot_best_group.py: Best Subset of Group Selection ============================== .. GENERATED FROM PYTHON SOURCE LINES 6-33 Introduction ------------ Best subset of group selection (BSGS) aims to choose a small part of non-overlapping groups to achieve the best interpretability on the response variable. BSGS is practically useful for the analysis of ubiquitously existing variables with certain group structures. For instance, a categorical variable with several levels is often represented by a group of dummy variables. Besides, in a nonparametric additive model, a continuous component can be represented by a set of basis functions (e.g., a linear combination of spline basis functions). Finally, specific prior knowledge can impose group structures on variables. A typical example is that the genes belonging to the same biological pathway can be considered as a group in the genomic data analysis. Figure for distinct BSGS and best-subset selection is presented below. .. image:: ../../Tutorial/figure/best-subset-group-selection.png The BSGS can be achieved by solving: .. math:: \min_{\beta\in \mathbb{R}^p} \frac{1}{2n} ||y-X\beta||_2^2,\; \textup{s.t.}\ ||\beta||_{0,2}\leq s . where :math:`||\beta||_{0,2} = \sum_{j=1}^J I(||\beta_{G_j}||_2\neq 0)` in which :math:`||\cdot||_2` is the :math:`\ell_2` norm and model size :math:`s` is a positive integer to be determined from data. Regardless of the NP-hard of this problem, Zhang et al develop a certifiably polynomial algorithm to solve it. This algorithm is integrated in the ``abess`` package, and user can handily select best group subset by assigning a proper value to the ``group`` arguments: Using best group subset selection --------------------------------- We still use the dataset ``data`` generated before, which has 100 samples, 5 useful variables and 15 irrelevant variables. .. GENERATED FROM PYTHON SOURCE LINES 33-52 .. code-block:: Python import numpy as np from abess.datasets import make_glm_data from abess.linear import LinearRegression np.random.seed(0) # generate data n = 100 p = 20 k = 5 coef1 = 0.5*np.ones(5) coef2 = np.zeros(5) coef3 = 0.5*np.ones(5) coef4 = np.zeros(5) coef = np.hstack((coef1, coef2, coef3, coef4)) data = make_glm_data(n=n, p=p, k=k, family='gaussian', coef_ = coef) print('real coefficients:\n', data.coef_, '\n') .. rst-class:: sphx-glr-script-out .. code-block:: none real coefficients: [0.5 0.5 0.5 0.5 0.5 0. 0. 0. 0. 0. 0.5 0.5 0.5 0.5 0.5 0. 0. 0. 0. 0. ] .. GENERATED FROM PYTHON SOURCE LINES 53-54 Support we have some prior information that every 5 variables as a group: .. GENERATED FROM PYTHON SOURCE LINES 54-58 .. code-block:: Python group = np.linspace(0, 3, 4).repeat(5) print('group index:\n', group) .. rst-class:: sphx-glr-script-out .. code-block:: none group index: [0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3.] .. GENERATED FROM PYTHON SOURCE LINES 59-64 Then we can set the ``group`` argument in function. Besides, the ``support_size`` here indicates the number of groups, instead of the number of variables. Similarly, ``always_select``, ``A_init`` and other parameters related to "index" should also be group index, instead of the variable one. .. GENERATED FROM PYTHON SOURCE LINES 64-70 .. code-block:: Python model1 = LinearRegression(support_size=range(3), group=group) model1.fit(data.x, data.y) print('coefficients:\n', model1.coef_) .. rst-class:: sphx-glr-script-out .. code-block:: none coefficients: [0.65915697 0.45713643 0.49044526 0.43927599 0.62863533 0. 0. 0. 0. 0. 0.575272 0.41249505 0.37598688 0.59901008 0.58798189 0. 0. 0. 0. 0. ] .. GENERATED FROM PYTHON SOURCE LINES 71-75 The fitted result suggest that only two groups are selected (since ``support_size`` is from 0 to 2) and the selected variables are shown above. Next, we want to compare the result of a given group structure with that without a given group structure. .. GENERATED FROM PYTHON SOURCE LINES 75-79 .. code-block:: Python model2 = LinearRegression() model2.fit(data.x, data.y) print('coefficients:\n', model2.coef_) .. rst-class:: sphx-glr-script-out .. code-block:: none coefficients: [0.61823344 0.54500673 0.59272352 0.42754021 0.65843857 0. 0. 0. 0. 0. 0. 0. 0. 0.66978731 0.55137187 0. 0. 0. 0. 0. ] .. GENERATED FROM PYTHON SOURCE LINES 80-89 The result from a model without a given group structure omits three predictors belonging to the active set. The ``abess`` R package also supports best group subset selection. For R tutorial, please view https://abess-team.github.io/abess/articles/v07-advancedFeatures.html. sphinx_gallery_thumbnail_path = 'Tutorial/figure/best-subset-group-selection.png' .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.007 seconds) .. _sphx_glr_download_auto_gallery_3-advanced-features_plot_best_group.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_best_group.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_best_group.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_