{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Large-Sample Data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n\n \n\nA large sample size leads to a large range of possible support sizes which adds to the computational burdon.\nThe computational tip here is to use the golden-section searching to avoid support size enumeration.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A motivated observation\nHere we generate a simple example under linear model via make_glm_data.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from time import time\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom abess.datasets import make_glm_data\nfrom abess.linear import LinearRegression\n\nnp.random.seed(0)\ndata = make_glm_data(n=100, p=20, k=5, family='gaussian')\n\nic = np.zeros(21)\nfor sz in range(21):\n model = LinearRegression(support_size=[sz], ic_type='ebic')\n model.fit(data.x, data.y)\n ic[sz] = model.ic_\nprint(\"lowest point: \", np.argmin(ic))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The generated data contains 100 observations with 20 predictors,\nwhile 5 of them are useful (has non-zero coefficients).\nUses extended Bayesian information criterion (EBIC), the abess successfully detect the true support size.\n\nWe go further and take a look on the support size versus EBIC returned\nby LinearRegression in abess.linear.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.plot(ic, 'o-')\nplt.xlabel('support size')\nplt.ylabel('EBIC')\nplt.title('Model Selection via EBIC')\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the figure, we can find that\nthe curve should is a strictly unimodal function achieving minimum at the true subset size,\nwhere support_size = 5 is the lowest point.\n\nMotivated by this observation, we consider a golden-section search technique to determine the optimal support size\nassociated with the minimum EBIC.\n\n \n\nCompare to the sequential searching, the golden section is much faster because it skip some support sizes which are likely to be a non-optimal one.\nPrecisely, searching the optimal support size one by one from a candidate set with $O(s_{max})$ complexity,\n**golden-section** reduce the time complexity to $O(\\ln(s_{max}))$, giving a significant computational improvement.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Usage: golden-section\nIn abess package, golden-section technique can be easily formed like:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = LinearRegression(path_type='gs', s_min=0, s_max=20)\nmodel.fit(data.x, data.y)\nprint(\"real coef:\\n\", np.nonzero(data.coef_))\nprint(\"predicted coef:\\n\", np.nonzero(model.coef_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where path_type = 'gs' means using golden-section rather than search the support size one-by-one.\ns_min and s_max indicates the left and right bound of range of the support size.\nNote that in golden-section searching, we should not give support_size, which is only useful for sequential strategy.\n\nThe output of golden-section strategy suggests the optimal model size is accurately detected.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Golden-section v.s. Sequential-searching: runtime comparison\nIn this part, we perform a runtime comparison experiment to demonstrate the speed gain brought by golden-section.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "t1 = time()\nmodel = LinearRegression(support_size=range(21))\nmodel.fit(data.x, data.y)\nprint(\"sequential time: \", time() - t1)\n\nt2 = time()\nmodel = LinearRegression(path_type='gs', s_min=0, s_max=20)\nmodel.fit(data.x, data.y)\nprint(\"golden-section time: \", time() - t2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The golden-section runs much faster than sequential method.\nThe speed gain would be enlarged when the range of support size is larger.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The abess R package also supports golden-section.\nFor R tutorial, please view\nhttps://abess-team.github.io/abess/articles/v09-fasterSetting.html.\n\nsphinx_gallery_thumbnail_path = 'Tutorial/figure/large-sample.png'\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 0 }