{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Best Subset of Group Selection\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\nBest subset of group selection (BSGS) aims to choose a small part of non-overlapping groups to achieve the best interpretability on the response variable.\nBSGS is practically useful for the analysis of ubiquitously existing variables with certain group structures.\nFor instance, a categorical variable with several levels is often represented by a group of dummy variables.\nBesides, in a nonparametric additive model, a continuous component can be represented by a set of basis functions\n(e.g., a linear combination of spline basis functions). Finally, specific prior knowledge can impose group structures on variables.\nA typical example is that the genes belonging to the same biological pathway can be considered as a group in the genomic data analysis.\nFigure for distinct BSGS and best-subset selection is presented below.\n\n\n\nThe BSGS can be achieved by solving:\n\n\\begin{align}\\min_{\\beta\\in \\mathbb{R}^p} \\frac{1}{2n} ||y-X\\beta||_2^2,\\; \\textup{s.t.}\\ ||\\beta||_{0,2}\\leq s .\\end{align}\n\n\nwhere $||\\beta||_{0,2} = \\sum_{j=1}^J I(||\\beta_{G_j}||_2\\neq 0)$ in which $||\\cdot||_2$ is the $\\ell_2$ norm and model size $s$ is a positive integer to be determined from data.\n\nRegardless of the NP-hard of this problem, Zhang et al develop a certifiably polynomial algorithm to solve it.\nThis algorithm is integrated in the ``abess`` package, and user can handily select best group subset by assigning a proper value to the ``group`` arguments:\n\n## Using best group subset selection\nWe still use the dataset ``data`` generated before, which has 100\nsamples, 5 useful variables and 15 irrelevant variables.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\nfrom abess.datasets import make_glm_data\nfrom abess.linear import LinearRegression\n\nnp.random.seed(0)\n\n# generate data\nn = 100\np = 20\nk = 5\ncoef1 = 0.5*np.ones(5)\ncoef2 = np.zeros(5)\ncoef3 = 0.5*np.ones(5)\ncoef4 = np.zeros(5)\ncoef = np.hstack((coef1, coef2, coef3, coef4))\ndata = make_glm_data(n=n, p=p, k=k, family='gaussian', coef_ = coef)\nprint('real coefficients:\\n', data.coef_, '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Support we have some prior information that every 5 variables as a group:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"group = np.linspace(0, 3, 4).repeat(5)\nprint('group index:\\n', group)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can set the ``group`` argument in function. Besides, the\n``support_size`` here indicates the number of groups, instead of the\nnumber of variables. Similarly, ``always_select``, ``A_init`` and other\nparameters related to \"index\" should also be group index, instead of\nthe variable one.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model1 = LinearRegression(support_size=range(3), group=group)\nmodel1.fit(data.x, data.y)\nprint('coefficients:\\n', model1.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fitted result suggest that only two groups are selected (since ``support_size`` is from 0 to 2) and the selected variables are shown above.\n\nNext, we want to compare the result of a given group structure with that without a given group structure.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model2 = LinearRegression()\nmodel2.fit(data.x, data.y)\nprint('coefficients:\\n', model2.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result from a model without a given group structure omits three predictors \nbelonging to the active set.\nThe ``abess`` R package also supports best group subset selection.\n\nFor R tutorial, please view\nhttps://abess-team.github.io/abess/articles/v07-advancedFeatures.html.\n\nsphinx_gallery_thumbnail_path = 'Tutorial/figure/best-subset-group-selection.png'\n\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 0
}