.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/4-computation-tips/plot_large_dimension.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_4-computation-tips_plot_large_dimension.py: =========================== Ultra-High dimensional data =========================== .. GENERATED FROM PYTHON SOURCE LINES 9-25 Introduction ^^^^^^^^^^^^ Recent technological advances have made it possible to collect ultra-high dimensional data. A common feature of these data is that the number of variables :math:`p` is generally much larger than sample sizes :math:`n`. For instance, the number of gene expression profiles is in the order of tens of thousands while the number of patient samples is in the order of tens or hundreds. Ultra-high dimensional predictors increase computational cost but reduce estimation accuracy for any statistical procedure. We visualize linear regression analysis in the context of ultra-high dimensionality in the following: .. image:: ../../Tutorial/figure/highDimension.png ``abess`` library implements severals features to efficiently analyze the ultra-high dimensional data with a fast speed. In this tutorial, we going to brief describe these helpful features, including: feature screening and importance searching. These features may also improve the statistical accuracy and algorithmic stability. .. GENERATED FROM PYTHON SOURCE LINES 27-51 Feature screening ^^^^^^^^^^^^^^^^^ Feature screening (FS, a.k.a., sure independence screening) is one of the most famous frameworks for tackling the challenges brought by ultra-high dimensional data. The FS can theoretically maintain all effective predictors with a high probability, which is called "the sure screening property". The FS is capable of even exponentially growing dimension. Practically, FS tries to filtering out the features that have very few marginal contribution on the loss function, hence effectively reducing the dimensionality :math:`p` to a moderate scale so that performing statistical algorithm is efficient. In our program, to carrying out the FS, user need to pass an integer smaller than the number of the predictors to the ``screening_size``. Then the program will first calculate the marginal likelihood of each predictor and reserve those predictors with the ``screening_size`` largest marginal likelihood. Then, the ABESS algorithm is conducted only on this screened subset. Using feature screening ^^^^^^^^^^^^^^^^^^^^^^^ Here is an example under sparse linear model with three variables have impact on the response. This dataset comprise 500 observations, and each observation has 10000 features. We use ``LinearRegression`` to analyze the synthetic dataset, and set ``screening_size = 100`` to maintain the 100 features with the largest marginal utilities. .. GENERATED FROM PYTHON SOURCE LINES 51-62 .. code-block:: Python from abess.linear import LogisticRegression from time import time import numpy as np from abess.datasets import make_glm_data from abess.linear import LinearRegression data = make_glm_data(n=500, p=10000, k=3, family='gaussian') model = LinearRegression(support_size=range(0, 5), screening_size=100) model.fit(data.x, data.y) .. raw:: html

LinearRegression(screening_size=100, support_size=range(0, 5))

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 63-67 .. code-block:: Python print('real coefficients\' indexes:', np.nonzero(data.coef_)[0]) print('fitted coefficients\' indexes:', np.nonzero(model.coef_)[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none real coefficients' indexes: [7211 8688 8789] fitted coefficients' indexes: [7211 8688 8789] .. GENERATED FROM PYTHON SOURCE LINES 68-71 It can be seen that the estimated support set is identical to the true support set. We also study the runtime when the FS is .. GENERATED FROM PYTHON SOURCE LINES 71-83 .. code-block:: Python model1 = LinearRegression(support_size=range(0, 20)) model2 = LinearRegression(support_size=range(0, 20), screening_size=100) t1 = time() model1.fit(data.x, data.y) t2 = time() model2.fit(data.x, data.y) t3 = time() print("Runtime (without screening) : ", t2 - t1) print("Runtime (with screening) : ", t3 - t2) .. rst-class:: sphx-glr-script-out .. code-block:: none Runtime (without screening) : 0.26412415504455566 Runtime (with screening) : 0.2332899570465088 .. GENERATED FROM PYTHON SOURCE LINES 84-89 The runtime reported above suggests the FS visibly reduce runtimes. Not all of best subset selection methods support feature screening (e.g., RobustPCA). Please see Python API for more details. .. GENERATED FROM PYTHON SOURCE LINES 92-108 Important searching ^^^^^^^^^^^^^^^^^^^ Suppose that there are only a few variables are important (i.e. too many noise variables), it may be a vise choice to focus on some important variables in splicing process. This can save a lot of time, especially under a large :math:`p`. In ``abess`` package, an argument called ``important_search`` is used for it, which means the size of inactive set for each splicing process. By default, this argument is set as 0, and the total inactive variables would be contained in the inactive set. But if an positive integer is given, the splicing process would focus on active set and the most important ``important_search`` inactive variables. After splicing iteration convergence on this subset, we check if the chosen variables are still the most important ones by recomputing on the full set with the new active set. If not, we update the subset and perform splicing again. From our empirical experience, it would not iterate many time to reach a stable subset. After that, the active set on the stable subset would be treated as that on the full set. .. GENERATED FROM PYTHON SOURCE LINES 110-112 Using important searching ^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 112-120 .. code-block:: Python # Here, we use a classification task as an example to demonstrate how to use important searching. # This dataset comprise 200 observations, and each observation has 5000 # features. data = make_glm_data(n=200, p=5000, k=10, family="binomial") .. GENERATED FROM PYTHON SOURCE LINES 121-123 We use ``LogisticRegression`` but only focus on 500 most important variables. The specific code is presented below: .. GENERATED FROM PYTHON SOURCE LINES 123-129 .. code-block:: Python model1 = LogisticRegression(important_search=500) t1 = time() model1.fit(data.x, data.y) t2 = time() print("time : ", t2 - t1) .. rst-class:: sphx-glr-script-out .. code-block:: none time : 0.20044207572937012 .. GENERATED FROM PYTHON SOURCE LINES 130-132 However, if we turn off the important searching (setting ``important_search = 0``), and using ``LogisticRegression`` as usual: .. GENERATED FROM PYTHON SOURCE LINES 132-138 .. code-block:: Python t1 = time() model2 = LogisticRegression(important_search=0) model2.fit(data.x, data.y) t2 = time() print("time : ", t2 - t1) .. rst-class:: sphx-glr-script-out .. code-block:: none time : 0.47933244705200195 .. GENERATED FROM PYTHON SOURCE LINES 139-143 It is easily see that the time consumption is larger than before. Finally, we investigate the estimated support sets given by ``model1`` and ``model2`` as follow: .. GENERATED FROM PYTHON SOURCE LINES 143-149 .. code-block:: Python print("support set (with important searching):\n", np.nonzero(model1.coef_)[0]) print( "support set (without important searching):\n", np.nonzero( model2.coef_)[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none support set (with important searching): [ 30 34 452 1920 3207 4626] support set (without important searching): [ 30 34 452 1920 3207 4626] .. GENERATED FROM PYTHON SOURCE LINES 150-154 The estimated support sets are the same. From this example, we can see that important searching uses much less time to reach the same result. Therefore, we recommend use important searching for large :math:`p` situation. .. GENERATED FROM PYTHON SOURCE LINES 157-169 Experimental evidences: important searching ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Here we compare the AUC and runtime for ``LogisticRegression`` under different ``important_search`` and the test code can be found here: https://github.com/abess-team/abess/blob/master/docs/simulation/Python/plot_impsearch.py. We present the numerical results under 100 replications below. .. image:: ../../Tutorial/figure/impsearch.png At a low level of ``important_search``, however, the performance (AUC) has been very good. In this situation, a lower ``important_search`` can save lots of time and space. .. GENERATED FROM PYTHON SOURCE LINES 171-177 The ``abess`` R package also supports feature screening and important searching. For R tutorial, please view https://abess-team.github.io/abess/articles/v07-advancedFeatures.html and https://abess-team.github.io/abess/articles/v09-fasterSetting.html. sphinx_gallery_thumbnail_path = 'Tutorial/figure/highDimension.png' .. rst-class:: sphx-glr-timing **Total running time of the script:** (11 minutes 23.429 seconds) .. _sphx_glr_download_auto_gallery_4-computation-tips_plot_large_dimension.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_large_dimension.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_large_dimension.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_