.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/4-computation-tips/plot_large_sample.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_4-computation-tips_plot_large_sample.py: Large-Sample Data ================= .. GENERATED FROM PYTHON SOURCE LINES 7-14 Introduction ^^^^^^^^^^^^^^^^^^^^^^^ .. image:: ../../Tutorial/figure/large-sample.png A large sample size leads to a large range of possible support sizes which adds to the computational burdon. The computational tip here is to use the golden-section searching to avoid support size enumeration. .. GENERATED FROM PYTHON SOURCE LINES 16-19 A motivated observation ^^^^^^^^^^^^^^^^^^^^^^^ Here we generate a simple example under linear model via ``make_glm_data``. .. GENERATED FROM PYTHON SOURCE LINES 19-36 .. code-block:: Python from time import time import numpy as np import matplotlib.pyplot as plt from abess.datasets import make_glm_data from abess.linear import LinearRegression np.random.seed(0) data = make_glm_data(n=100, p=20, k=5, family='gaussian') ic = np.zeros(21) for sz in range(21): model = LinearRegression(support_size=[sz], ic_type='ebic') model.fit(data.x, data.y) ic[sz] = model.eval_loss_ print("lowest point: ", np.argmin(ic)) .. rst-class:: sphx-glr-script-out .. code-block:: none lowest point: 5 .. GENERATED FROM PYTHON SOURCE LINES 37-43 The generated data contains 100 observations with 20 predictors, while 5 of them are useful (has non-zero coefficients). Uses extended Bayesian information criterion (EBIC), the ``abess`` successfully detect the true support size. We go further and take a look on the support size versus EBIC returned by ``LinearRegression`` in ``abess.linear``. .. GENERATED FROM PYTHON SOURCE LINES 45-51 .. code-block:: Python plt.plot(ic, 'o-') plt.xlabel('support size') plt.ylabel('EBIC') plt.title('Model Selection via EBIC') plt.show() .. image-sg:: /auto_gallery/4-computation-tips/images/sphx_glr_plot_large_sample_001.png :alt: Model Selection via EBIC :srcset: /auto_gallery/4-computation-tips/images/sphx_glr_plot_large_sample_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 52-65 From the figure, we can find that the curve should is a strictly unimodal function achieving minimum at the true subset size, where ``support_size = 5`` is the lowest point. Motivated by this observation, we consider a golden-section search technique to determine the optimal support size associated with the minimum EBIC. .. image:: ../../Tutorial/figure/goldenSection.png Compare to the sequential searching, the golden section is much faster because it skip some support sizes which are likely to be a non-optimal one. Precisely, searching the optimal support size one by one from a candidate set with :math:`O(s_{max})` complexity, **golden-section** reduce the time complexity to :math:`O(\ln(s_{max}))`, giving a significant computational improvement. .. GENERATED FROM PYTHON SOURCE LINES 67-70 Usage: golden-section ^^^^^^^^^^^^^^^^^^^^^ In ``abess`` package, golden-section technique can be easily formed like: .. GENERATED FROM PYTHON SOURCE LINES 70-76 .. code-block:: Python model = LinearRegression(path_type='gs', s_min=0, s_max=20) model.fit(data.x, data.y) print("real coef:\n", np.nonzero(data.coef_)[0]) print("predicted coef:\n", np.nonzero(model.coef_)[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none real coef: [ 2 5 10 11 18] predicted coef: [ 2 5 10 11 18] .. GENERATED FROM PYTHON SOURCE LINES 77-83 where ``path_type = 'gs'`` means using golden-section rather than search the support size one-by-one. ``s_min`` and ``s_max`` indicates the left and right bound of range of the support size. Note that in golden-section searching, we should not give ``support_size``, which is only useful for sequential strategy. The output of golden-section strategy suggests the optimal model size is accurately detected. .. GENERATED FROM PYTHON SOURCE LINES 86-90 Golden-section v.s. Sequential-searching: runtime comparison ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In this part, we perform a runtime comparison experiment to demonstrate the speed gain brought by golden-section. .. GENERATED FROM PYTHON SOURCE LINES 90-102 .. code-block:: Python t1 = time() model = LinearRegression(support_size=range(21)) model.fit(data.x, data.y) print("sequential time: ", time() - t1) t2 = time() model = LinearRegression(path_type='gs', s_min=0, s_max=20) model.fit(data.x, data.y) print("golden-section time: ", time() - t2) .. rst-class:: sphx-glr-script-out .. code-block:: none sequential time: 0.002244710922241211 golden-section time: 0.0012612342834472656 .. GENERATED FROM PYTHON SOURCE LINES 103-105 The golden-section runs much faster than sequential method. The speed gain would be enlarged when the range of support size is larger. .. GENERATED FROM PYTHON SOURCE LINES 107-113 The ``abess`` R package also supports golden-section. For R tutorial, please view https://abess-team.github.io/abess/articles/v09-fasterSetting.html. sphinx_gallery_thumbnail_path = 'Tutorial/figure/large-sample.png' .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.163 seconds) .. _sphx_glr_download_auto_gallery_4-computation-tips_plot_large_sample.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_large_sample.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_large_sample.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_