.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/2-pca/plot_6_PCA.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_2-pca_plot_6_PCA.py: Principal Component Analysis ============================ This notebook introduces what is adaptive best subset selection principal component analysis (SparsePCA) and uses a real data example to show how to use it. .. GENERATED FROM PYTHON SOURCE LINES 8-66 PCA --- Principal component analysis (PCA) is an important method in the field of data science, which can reduce the dimension of data and simplify our model. It actually solves an optimization problem like: .. math:: \max_{v} v^{\top}\Sigma v,\qquad s.t.\quad v^Tv=1. where :math:`\Sigma = X^TX / (n-1)` and :math:`X` is the **centered** sample matrix. We also denote that :math:`X` is a :math:`n\times p` matrix, where each row is an observation and each column is a variable. Then, before further analysis, we can project :math:`X` to :math:`v` (thus dimension reduction), without losing too much information. However, consider that: - The PC is a linear combination of all primary variables (:math:`Xv`), but sometimes we may tend to use less variables for clearer interpretation (and less computational complexity); - It has been proved that if :math:`p/n` does not converge to :math:`0`, the classical PCA is not consistent, but this would happen in some high-dimensional data analysis. For example, in gene analysis, the dataset may contain plenty of genes (variables) and we would like to find a subset of them, which can explain most information. Compared with using all genes, this small subset may perform better on interpretation, without losing much information. Then we can focus on these variables in further analysis. When we are trapped by these problems, a classical PCA may not be a best choice, since it uses all variables. One of the alternatives is `SparsePCA`, which is able to seek for principal component with a sparsity limitation: .. math:: \max_{v} v^{\top}\Sigma v,\qquad s.t.\quad v^Tv=1,\ ||v||_0\leq s. where :math:`s` is a non-negative integer, which indicates how many primary variables are used in principal component. With `SparsePCA`, we can search for the best subset of variables to form principal component and it retains consistency even under :math:`p>>n`. And we make two remarks: - Clearly, if :math:`s` is equal or larger than the number of primary variables, this sparsity limitation is actually useless, so the problem is equivalent to a classical PCA. - With less variables, the PC must have lower explained variance. However, this decrease is slight if we choose a good :math:`s` and at this price, we can interpret the PC much better. It is worthy. In the next section, we will show how to perform `SparsePCA`. Real Data Example (Communities and Crime Dataset) ------------------------------------------------- Here we will use real data analysis to show how to perform `SparsePCA`. The data we use is from `UCI: Communities and Crime Data Set `__ and we pick up its 99 predictive variables as our samples. Firstly, we read the data and pick up those variables we are interested in. .. GENERATED FROM PYTHON SOURCE LINES 66-78 .. code-block:: Python import numpy as np from abess.decomposition import SparsePCA X = np.genfromtxt('communities.data', delimiter=',') X = X[:, 5:127] # numeric predictiors X = X[:, ~np.isnan(X).any(axis=0)] # drop variables with nan n, p = X.shape print(n) print(p) .. rst-class:: sphx-glr-script-out .. code-block:: none 1994 99 .. GENERATED FROM PYTHON SOURCE LINES 79-89 Model fitting ^^^^^^^^^^^^^ To build an SparsePCA model, we need to give the target sparisty to its `support_size` argument. Our program supports adaptively finding a best sparisty in a given range. Fixed sparsity """""""""""""" If we only focus on one fixed sparsity, we can simply give a single integer to fit on this situation. And then the fitted sparse principal component is stored in `SparsePCA.coef_`: .. GENERATED FROM PYTHON SOURCE LINES 89-93 .. code-block:: Python model = SparsePCA(support_size=20) .. GENERATED FROM PYTHON SOURCE LINES 94-98 Give either :math:`X` or :math:`\Sigma` to `model.fit()`, the fitting process will start. The argument `is_normal = False` here means that the program will not normalize :math:`X`. Note that if both :math:`X` and :math:`Sigma` are given, the program prefers to use :math:`X`. .. GENERATED FROM PYTHON SOURCE LINES 98-103 .. code-block:: Python model.fit(X=X, is_normal=False) # model.fit(Sigma = np.cov(X.T)) .. raw:: html
SparsePCA(support_size=20)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 104-106 After fitting, `model.coef_` returns the sparse principal component and its non-zero positions correspond to variables used. .. GENERATED FROM PYTHON SOURCE LINES 106-113 .. code-block:: Python temp = np.nonzero(model.coef_)[0] print('sparsity: ', temp.size) print('non-zero position: \n', temp) print(model.coef_.T) .. rst-class:: sphx-glr-script-out .. code-block:: none sparsity: 20 non-zero position: [ 3 4 5 17 28 49 55 56 57 58 59 60 61 62 65 66 67 69 90 96] [[ 0. 0. 0. -0.20905321 0.15082783 0.26436836 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.13962306 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.14795039 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.13366389 0. 0. 0. 0. 0. 0.28227473 0.28982717 0.29245315 0.29340846 -0.27796975 0.27817835 0.20793385 0.19020622 0. 0. 0.15462671 -0.13627653 0.25848049 0. -0.14433668 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.27644605 0. 0. 0. 0. 0. 0.16881151 0. 0. ]] .. GENERATED FROM PYTHON SOURCE LINES 114-124 Adaptive sparsity """"""""""""""""" What's more, **abess** also support a range of sparsity and adaptively choose the best-explain one. However, usually a higher sparsity level would lead to better explaination. Now, we need to build an :math:`s_{max} \times 1` binomial matrix, where :math:`s_{max}` indicates the max target sparsity and each row indicates one sparsity level (i.e. start from :math:`1`, till :math:`s_{max}`). For each position with :math:`1`, **abess** would try to fit the model under that sparsity and finally give the best one. .. GENERATED FROM PYTHON SOURCE LINES 124-137 .. code-block:: Python # fit sparsity from 1 to 20 support_size = np.ones((20, 1)) # build model model = SparsePCA(support_size=support_size) model.fit(X, is_normal=False) # results temp = np.nonzero(model.coef_)[0] print('chosen sparsity: ', temp.size) print('non-zero position: \n', temp) print(model.coef_.T) .. rst-class:: sphx-glr-script-out .. code-block:: none chosen sparsity: 20 non-zero position: [11 12 17 19 20 21 27 29 30 35 36 44 76 78 79 80 81 82 83 84] [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -0.27618663 -0.23703262 0. 0. 0. 0. 0.18384733 0. -0.2275236 -0.21204903 -0.19753942 0. 0. 0. 0. 0. 0.21358573 0. 0.18270928 -0.18928695 0. 0. 0. 0. 0.1760962 -0.17481418 0. 0. 0. 0. 0. 0. 0. -0.18581084 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.23804122 0. -0.2415995 -0.24785373 -0.24947283 -0.23938391 -0.23605314 -0.28015859 -0.23841083 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]] .. GENERATED FROM PYTHON SOURCE LINES 138-141 Because of warm-start, the results here may not be the same as fixed sparsity. Then, the explained variance can be computed by: .. GENERATED FROM PYTHON SOURCE LINES 141-149 .. code-block:: Python Xc = X - X.mean(axis=0) Xv = Xc @ model.coef_ explained = Xv.T @ Xv # explained variance (information) total = sum(np.diag(Xc.T @ Xc)) # total variance (information) print('explained ratio: ', explained / total) .. rst-class:: sphx-glr-script-out .. code-block:: none explained ratio: [[0.16920803]] .. GENERATED FROM PYTHON SOURCE LINES 150-157 More on the results ^^^^^^^^^^^^^^^^^^^ We can give different target sparsity (change `s_begin` and `s_end`) to get different sparse loadings. Interestingly, we can seek for a smaller sparsity which can explain most of the variance. In this example, if we try sparsities from :math:`0` to :math:`p`, and calculate the ratio of explained variance: .. GENERATED FROM PYTHON SOURCE LINES 157-178 .. code-block:: Python num = 30 i = 0 sparsity = np.linspace(1, p - 1, num, dtype='int') explain = np.zeros(num) Xc = X - X.mean(axis=0) for s in sparsity: model = SparsePCA( support_size=np.ones((s, 1)), exchange_num=int(s), max_iter=50 ) model.fit(X, is_normal=False) Xv = Xc @ model.coef_ explain[i] = Xv.T @ Xv i += 1 print('80%+ : ', sparsity[explain > 0.8 * explain[num - 1]]) print('90%+ : ', sparsity[explain > 0.9 * explain[num - 1]]) .. rst-class:: sphx-glr-script-out .. code-block:: none 80%+ : [31 34 37 41 44 47 51 54 57 61 64 67 71 74 77 81 84 87 91 94 98] 90%+ : [41 44 47 51 54 57 61 64 67 71 74 77 81 84 87 91 94 98] .. GENERATED FROM PYTHON SOURCE LINES 179-182 If we denote the explained ratio from all 99 variables as 100%, the curve indicates that at least 31 variables can reach 80% (blue dashed line) and 41 variables can reach 90% (red dashed line). .. GENERATED FROM PYTHON SOURCE LINES 182-208 .. code-block:: Python import matplotlib.pyplot as plt plt.plot(sparsity, explain) plt.xlabel('Sparsity') plt.ylabel('Explained variance') plt.ylabel('Sparsity versus Explained Variance') ind = np.where(explain > 0.8 * explain[num - 1])[0][0] plt.plot([0, sparsity[ind]], [explain[ind], explain[ind]], 'b--') plt.plot([sparsity[ind], sparsity[ind]], [0, explain[ind]], 'b--') plt.text(sparsity[ind], 0, str(sparsity[ind])) plt.text(0, explain[ind], '80%') ind = np.where(explain > 0.9 * explain[num - 1])[0][0] plt.plot([0, sparsity[ind]], [explain[ind], explain[ind]], 'r--') plt.plot([sparsity[ind], sparsity[ind]], [0, explain[ind]], 'r--') plt.text(sparsity[ind], 0, str(sparsity[ind])) plt.text(0, explain[ind], '90%') plt.plot([0, p], [explain[num - 1], explain[num - 1]], color='gray', linestyle='--') plt.text(0, explain[num - 1], '100%') plt.show() .. image-sg:: /auto_gallery/2-pca/images/sphx_glr_plot_6_PCA_001.png :alt: plot 6 PCA :srcset: /auto_gallery/2-pca/images/sphx_glr_plot_6_PCA_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 209-212 This result shows that using less than half of all 99 variables can be close to perfect. For example, if we choose sparsity 31, the used variables are: .. GENERATED FROM PYTHON SOURCE LINES 212-219 .. code-block:: Python model = SparsePCA(support_size=31) model.fit(X, is_normal=False) temp = np.nonzero(model.coef_)[0] print('non-zero position: \n', temp) .. rst-class:: sphx-glr-script-out .. code-block:: none non-zero position: [ 2 11 12 15 17 19 20 21 22 25 27 28 29 30 31 32 35 36 42 43 44 45 49 76 78 79 80 81 82 83 84] .. GENERATED FROM PYTHON SOURCE LINES 220-256 Extension: Group PCA -------------------- Group PCA ^^^^^^^^^ Furthermore, in some situations, some variables may need to be considered together, that is, they should be "used" or "unused" for PC at the same time, which we call "group information". The optimization problem becomes: .. math:: \max_{v} v^{\top}\Sigma v,\qquad s.t.\quad v^Tv=1,\ \sum_{g=1}^G I(||v_g||\neq 0)\leq s. where we suppose there are :math:`G` groups, and the :math:`g`-th one correspond to :math:`v_g`, :math:`v = [v_1^{\top},v_2^{\top},\cdots,v_G^{\top}]^{\top}`. Then we are interested to find :math:`s` (or less) important groups. Group problem is extraordinarily important in real data analysis. Still take gene analysis as an example, several sites would be related to one character, and it is meaningless to consider each of them alone. `SparsePCA` can also deal with group information. Here we make sure that variables in the same group address close to each other (if not, the data should be sorted first). Simulated Data Example ^^^^^^^^^^^^^^^^^^^^^^ Suppose that the data above have group information like: - Group 0: {the 1st, 2nd, ..., 6th variable}; - Group 1: {the 7th, 8th, ..., 12th variable}; - ... - Group 15: {the 91st, 92nd, ..., 96th variable}; - Group 16: {the 97th, 98th, 99th variables}. Denote different groups as different numbers: .. GENERATED FROM PYTHON SOURCE LINES 256-264 .. code-block:: Python g_info = np.arange(17) g_info = g_info.repeat(6) g_info = g_info[0:99] print(g_info) .. rst-class:: sphx-glr-script-out .. code-block:: none [ 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16] .. GENERATED FROM PYTHON SOURCE LINES 265-266 And fit a group sparse PCA model with additional argument `group=g_info`: .. GENERATED FROM PYTHON SOURCE LINES 266-271 .. code-block:: Python model = SparsePCA(support_size=np.ones((6, 1)), group=g_info) model.fit(X, is_normal=False) .. raw:: html
SparsePCA(group=array([ 0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,
            2,  3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,
            5,  5,  6,  6,  6,  6,  6,  6,  7,  7,  7,  7,  7,  7,  8,  8,  8,
            8,  8,  8,  9,  9,  9,  9,  9,  9, 10, 10, 10, 10, 10, 10, 11, 11,
           11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 14,
           14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16]),
              support_size=array([[1.],
           [1.],
           [1.],
           [1.],
           [1.],
           [1.]]))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 272-273 The result comes to: .. GENERATED FROM PYTHON SOURCE LINES 273-283 .. code-block:: Python print(model.coef_.T) temp = np.nonzero(model.coef_)[0] temp = np.unique(g_info[temp]) print('non-zero group: \n', temp) print('chosen sparsity: ', temp.size) .. rst-class:: sphx-glr-script-out .. code-block:: none [[-0.04658786 -0.08867352 -0.04812687 0.20196704 -0.15626888 -0.24420959 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -0.03874654 -0.12602642 -0.05981463 -0.11432162 -0.13193075 -0.13909102 -0.14829175 -0.28195208 -0.28747947 -0.28866805 -0.28772941 0.25581611 -0.25726917 -0.19242565 -0.17526315 -0.08740502 -0.06877806 -0.14679731 0.1392863 -0.24410922 -0.10815193 0.14329729 -0.03971005 0.00862702 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -0.26329492 0.12107731 0.07789811 0.04291127 0.0606106 -0.00599532 0. 0. 0. ]] non-zero group: [ 0 8 9 10 11 15] chosen sparsity: 6 .. GENERATED FROM PYTHON SOURCE LINES 284-307 Hence we can focus on variables in Group 0, 8, 9, 10, 11, 15. Extension: Multiple principal components ---------------------------------------- Multiple principal components ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In some cases, we may seek for more than one principal components under sparsity. Actually, we can iteratively solve the largest principal component and then mapping the covariance matrix to its orthogonal space: .. math:: \Sigma' = (1-vv^{\top})\Sigma(1-vv^{\top}) where :math:`\Sigma` is the currect covariance matrix and :math:`v` is its (sparse) principal component. We map it into :math:`\Sigma'`, which indicates the orthogonal space of :math:`v`, and then solve the sparse principal component again. By this iteration process, we can acquire multiple principal components and they are sorted from the largest to the smallest. In our program, there is an additional argument `number`, which indicates the number of principal components we need, defaulted by 1. Now the `support_size` is shaped in :math:`s_{max}\times \text{number}` and each column indicates one principal component. .. GENERATED FROM PYTHON SOURCE LINES 307-313 .. code-block:: Python model = SparsePCA(support_size=np.ones((31, 3))) model.fit(X, is_normal=False, number=3) model.coef_.shape .. rst-class:: sphx-glr-script-out .. code-block:: none (99, 3) .. GENERATED FROM PYTHON SOURCE LINES 314-316 Here, each column of the `model.coef_` is a sparse PC (from the largest to the smallest), for example the second one is that: .. GENERATED FROM PYTHON SOURCE LINES 316-320 .. code-block:: Python model.coef_[:, 1] .. rst-class:: sphx-glr-script-out .. code-block:: none array([ 0. , 0. , 0. , -0.19036307, 0.1546593 , 0.22909258, 0. , 0. , 0. , 0. , 0. , 0.15753326, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.10274482, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.11546037, 0. , 0. , 0.11150352, 0.1198269 , 0.12927627, 0.27435616, 0.28022643, 0.28220338, 0.28150593, -0.25022706, 0.24872368, 0.17726069, 0.15920166, 0. , 0. , 0.12531301, -0.13462701, 0.22695789, 0.11034058, -0.14302014, 0. , 0. , -0.11505769, 0. , 0. , 0. , 0. , 0. , 0.10140235, 0.10764364, 0.10731785, 0. , 0. , 0. , 0. , 0. , 0.1151606 , 0. , 0. , 0. , 0.26358121, -0.11368877, 0. , 0. , 0. , 0. , 0.16906762, 0.11129358, 0. ]) .. GENERATED FROM PYTHON SOURCE LINES 321-322 If we want to compute the explained variance of them, it is also quite easy: .. GENERATED FROM PYTHON SOURCE LINES 322-328 .. code-block:: Python Xv = Xc.dot(model.coef_) explained = np.sum(np.diag(Xv.T.dot(Xv))) print('explained ratio: ', explained / total) .. rst-class:: sphx-glr-script-out .. code-block:: none explained ratio: 0.46018459657037164 .. GENERATED FROM PYTHON SOURCE LINES 329-333 R tutorial ---------- For R tutorial, please view https://abess-team.github.io/abess/articles/v08-sPCA.html. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 10.199 seconds) .. _sphx_glr_download_auto_gallery_2-pca_plot_6_PCA.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_6_PCA.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_6_PCA.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_