Principal Component Analysis

This notebook introduces what is adaptive best subset selection principal component analysis (SparsePCA) and uses a real data example to show how to use it.

PCA

Principal component analysis (PCA) is an important method in the field of data science, which can reduce the dimension of data and simplify our model. It actually solves an optimization problem like:

\[\max_{v} v^{\top}\Sigma v,\qquad s.t.\quad v^Tv=1.\]

where \(\Sigma = X^TX / (n-1)\) and \(X\) is the centered sample matrix. We also denote that \(X\) is a \(n\times p\) matrix, where each row is an observation and each column is a variable.

Then, before further analysis, we can project \(X\) to \(v\) (thus dimension reduction), without losing too much information.

However, consider that:

  • The PC is a linear combination of all primary variables (\(Xv\)), but sometimes we may tend to use less variables for clearer interpretation (and less computational complexity);

  • It has been proved that if \(p/n\) does not converge to \(0\), the classical PCA is not consistent, but this would happen in some high-dimensional data analysis.

For example, in gene analysis, the dataset may contain plenty of genes (variables) and we would like to find a subset of them, which can explain most information. Compared with using all genes, this small subset may perform better on interpretation, without losing much information. Then we can focus on these variables in further analysis.

When we are trapped by these problems, a classical PCA may not be a best choice, since it uses all variables. One of the alternatives is SparsePCA, which is able to seek for principal component with a sparsity limitation:

\[\max_{v} v^{\top}\Sigma v,\qquad s.t.\quad v^Tv=1,\ ||v||_0\leq s.\]

where \(s\) is a non-negative integer, which indicates how many primary variables are used in principal component. With SparsePCA, we can search for the best subset of variables to form principal component and it retains consistency even under \(p>>n\). And we make two remarks:

  • Clearly, if \(s\) is equal or larger than the number of primary variables, this sparsity limitation is actually useless, so the problem is equivalent to a classical PCA.

  • With less variables, the PC must have lower explained variance. However, this decrease is slight if we choose a good \(s\) and at this price, we can interpret the PC much better. It is worthy.

In the next section, we will show how to perform SparsePCA.

Real Data Example (Communities and Crime Dataset)

Here we will use real data analysis to show how to perform SparsePCA. The data we use is from UCI: Communities and Crime Data Set and we pick up its 99 predictive variables as our samples.

Firstly, we read the data and pick up those variables we are interested in.

import numpy as np
from abess.decomposition import SparsePCA

X = np.genfromtxt('communities.data', delimiter=',')
X = X[:, 5:127]                         # numeric predictiors
X = X[:, ~np.isnan(X).any(axis=0)]    # drop variables with nan

n, p = X.shape
print(n)
print(p)
1994
99

Model fitting

To build an SparsePCA model, we need to give the target sparisty to its support_size argument. Our program supports adaptively finding a best sparisty in a given range.

Fixed sparsity

If we only focus on one fixed sparsity, we can simply give a single integer to fit on this situation. And then the fitted sparse principal component is stored in SparsePCA.coef_:

model = SparsePCA(support_size=20)

Give either \(X\) or \(\Sigma\) to model.fit(), the fitting process will start. The argument is_normal = False here means that the program will not normalize \(X\). Note that if both \(X\) and \(Sigma\) are given, the program prefers to use \(X\).

model.fit(X=X, is_normal=False)
# model.fit(Sigma = np.cov(X.T))
SparsePCA(support_size=20)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


After fitting, model.coef_ returns the sparse principal component and its non-zero positions correspond to variables used.

temp = np.nonzero(model.coef_)[0]
print('sparsity: ', temp.size)
print('non-zero position: \n', temp)
print(model.coef_.T)
sparsity:  20
non-zero position:
 [ 3  4  5 17 28 49 55 56 57 58 59 60 61 62 65 66 67 69 90 96]
[[ 0.          0.          0.         -0.20905321  0.15082783  0.26436836
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.13962306
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.14795039  0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.13366389  0.          0.          0.          0.
   0.          0.28227473  0.28982717  0.29245315  0.29340846 -0.27796975
   0.27817835  0.20793385  0.19020622  0.          0.          0.15462671
  -0.13627653  0.25848049  0.         -0.14433668  0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.27644605  0.          0.          0.          0.          0.
   0.16881151  0.          0.        ]]

Adaptive sparsity

What's more, abess also support a range of sparsity and adaptively choose the best-explain one. However, usually a higher sparsity level would lead to better explaination.

Now, we need to build an \(s_{max} \times 1\) binomial matrix, where \(s_{max}\) indicates the max target sparsity and each row indicates one sparsity level (i.e. start from \(1\), till \(s_{max}\)). For each position with \(1\), abess would try to fit the model under that sparsity and finally give the best one.

# fit sparsity from 1 to 20
support_size = np.ones((20, 1))
# build model
model = SparsePCA(support_size=support_size)
model.fit(X, is_normal=False)
# results
temp = np.nonzero(model.coef_)[0]
print('chosen sparsity: ', temp.size)
print('non-zero position: \n', temp)
print(model.coef_.T)
chosen sparsity:  20
non-zero position:
 [11 12 17 19 20 21 27 29 30 35 36 44 76 78 79 80 81 82 83 84]
[[ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.         -0.27618663
  -0.23703262  0.          0.          0.          0.          0.18384733
   0.         -0.2275236  -0.21204903 -0.19753942  0.          0.
   0.          0.          0.          0.21358573  0.          0.18270928
  -0.18928695  0.          0.          0.          0.          0.1760962
  -0.17481418  0.          0.          0.          0.          0.
   0.          0.         -0.18581084  0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.23804122  0.
  -0.2415995  -0.24785373 -0.24947283 -0.23938391 -0.23605314 -0.28015859
  -0.23841083  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]]

Because of warm-start, the results here may not be the same as fixed sparsity.

Then, the explained variance can be computed by:

Xc = X - X.mean(axis=0)
Xv = Xc @ model.coef_
explained = Xv.T @ Xv                   # explained variance (information)
total = sum(np.diag(Xc.T @ Xc))         # total variance (information)
print('explained ratio: ', explained / total)
explained ratio:  [[0.16920803]]

More on the results

We can give different target sparsity (change s_begin and s_end) to get different sparse loadings. Interestingly, we can seek for a smaller sparsity which can explain most of the variance.

In this example, if we try sparsities from \(0\) to \(p\), and calculate the ratio of explained variance:

num = 30
i = 0
sparsity = np.linspace(1, p - 1, num, dtype='int')
explain = np.zeros(num)
Xc = X - X.mean(axis=0)
for s in sparsity:
    model = SparsePCA(
        support_size=np.ones((s, 1)),
        exchange_num=int(s),
        max_iter=50
    )
    model.fit(X, is_normal=False)
    Xv = Xc @ model.coef_
    explain[i] = Xv.T @ Xv
    i += 1

print('80%+ : ', sparsity[explain > 0.8 * explain[num - 1]])
print('90%+ : ', sparsity[explain > 0.9 * explain[num - 1]])
80%+ :  [31 34 37 41 44 47 51 54 57 61 64 67 71 74 77 81 84 87 91 94 98]
90%+ :  [41 44 47 51 54 57 61 64 67 71 74 77 81 84 87 91 94 98]

If we denote the explained ratio from all 99 variables as 100%, the curve indicates that at least 31 variables can reach 80% (blue dashed line) and 41 variables can reach 90% (red dashed line).

import matplotlib.pyplot as plt

plt.plot(sparsity, explain)
plt.xlabel('Sparsity')
plt.ylabel('Explained variance')
plt.ylabel('Sparsity versus Explained Variance')

ind = np.where(explain > 0.8 * explain[num - 1])[0][0]
plt.plot([0, sparsity[ind]], [explain[ind], explain[ind]], 'b--')
plt.plot([sparsity[ind], sparsity[ind]], [0, explain[ind]], 'b--')
plt.text(sparsity[ind], 0, str(sparsity[ind]))
plt.text(0, explain[ind], '80%')

ind = np.where(explain > 0.9 * explain[num - 1])[0][0]
plt.plot([0, sparsity[ind]], [explain[ind], explain[ind]], 'r--')
plt.plot([sparsity[ind], sparsity[ind]], [0, explain[ind]], 'r--')
plt.text(sparsity[ind], 0, str(sparsity[ind]))
plt.text(0, explain[ind], '90%')

plt.plot([0, p], [explain[num - 1], explain[num - 1]],
         color='gray', linestyle='--')
plt.text(0, explain[num - 1], '100%')

plt.show()
plot 6 PCA

This result shows that using less than half of all 99 variables can be close to perfect. For example, if we choose sparsity 31, the used variables are:

model = SparsePCA(support_size=31)
model.fit(X, is_normal=False)
temp = np.nonzero(model.coef_)[0]
print('non-zero position: \n', temp)
non-zero position:
 [ 2 11 12 15 17 19 20 21 22 25 27 28 29 30 31 32 35 36 42 43 44 45 49 76
 78 79 80 81 82 83 84]

Extension: Group PCA

Group PCA

Furthermore, in some situations, some variables may need to be considered together, that is, they should be "used" or "unused" for PC at the same time, which we call "group information". The optimization problem becomes:

\[\max_{v} v^{\top}\Sigma v,\qquad s.t.\quad v^Tv=1,\ \sum_{g=1}^G I(||v_g||\neq 0)\leq s.\]

where we suppose there are \(G\) groups, and the \(g\)-th one correspond to \(v_g\), \(v = [v_1^{\top},v_2^{\top},\cdots,v_G^{\top}]^{\top}\). Then we are interested to find \(s\) (or less) important groups.

Group problem is extraordinarily important in real data analysis. Still take gene analysis as an example, several sites would be related to one character, and it is meaningless to consider each of them alone.

SparsePCA can also deal with group information. Here we make sure that variables in the same group address close to each other (if not, the data should be sorted first).

Simulated Data Example

Suppose that the data above have group information like:

  • Group 0: {the 1st, 2nd, ..., 6th variable};

  • Group 1: {the 7th, 8th, ..., 12th variable};

  • ...

  • Group 15: {the 91st, 92nd, ..., 96th variable};

  • Group 16: {the 97th, 98th, 99th variables}.

Denote different groups as different numbers:

g_info = np.arange(17)
g_info = g_info.repeat(6)
g_info = g_info[0:99]

print(g_info)
[ 0  0  0  0  0  0  1  1  1  1  1  1  2  2  2  2  2  2  3  3  3  3  3  3
  4  4  4  4  4  4  5  5  5  5  5  5  6  6  6  6  6  6  7  7  7  7  7  7
  8  8  8  8  8  8  9  9  9  9  9  9 10 10 10 10 10 10 11 11 11 11 11 11
 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15
 16 16 16]

And fit a group sparse PCA model with additional argument group=g_info:

model = SparsePCA(support_size=np.ones((6, 1)), group=g_info)
model.fit(X, is_normal=False)
SparsePCA(group=array([ 0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,
        2,  3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,
        5,  5,  6,  6,  6,  6,  6,  6,  7,  7,  7,  7,  7,  7,  8,  8,  8,
        8,  8,  8,  9,  9,  9,  9,  9,  9, 10, 10, 10, 10, 10, 10, 11, 11,
       11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 14,
       14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16]),
          support_size=array([[1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.]]))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


The result comes to:

print(model.coef_.T)

temp = np.nonzero(model.coef_)[0]
temp = np.unique(g_info[temp])

print('non-zero group: \n', temp)
print('chosen sparsity: ', temp.size)
[[-0.04658786 -0.08867352 -0.04812687  0.20196704 -0.15626888 -0.24420959
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
  -0.03874654 -0.12602642 -0.05981463 -0.11432162 -0.13193075 -0.13909102
  -0.14829175 -0.28195208 -0.28747947 -0.28866805 -0.28772941  0.25581611
  -0.25726917 -0.19242565 -0.17526315 -0.08740502 -0.06877806 -0.14679731
   0.1392863  -0.24410922 -0.10815193  0.14329729 -0.03971005  0.00862702
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
  -0.26329492  0.12107731  0.07789811  0.04291127  0.0606106  -0.00599532
   0.          0.          0.        ]]
non-zero group:
 [ 0  8  9 10 11 15]
chosen sparsity:  6

Hence we can focus on variables in Group 0, 8, 9, 10, 11, 15.

Extension: Multiple principal components

Multiple principal components

In some cases, we may seek for more than one principal components under sparsity. Actually, we can iteratively solve the largest principal component and then mapping the covariance matrix to its orthogonal space:

\[\Sigma' = (1-vv^{\top})\Sigma(1-vv^{\top})\]

where \(\Sigma\) is the currect covariance matrix and \(v\) is its (sparse) principal component. We map it into \(\Sigma'\), which indicates the orthogonal space of \(v\), and then solve the sparse principal component again.

By this iteration process, we can acquire multiple principal components and they are sorted from the largest to the smallest. In our program, there is an additional argument number, which indicates the number of principal components we need, defaulted by 1. Now the support_size is shaped in \(s_{max}\times \text{number}\) and each column indicates one principal component.

model = SparsePCA(support_size=np.ones((31, 3)))
model.fit(X, is_normal=False, number=3)
model.coef_.shape
(99, 3)

Here, each column of the model.coef_ is a sparse PC (from the largest to the smallest), for example the second one is that:

array([ 0.        ,  0.        ,  0.        , -0.19036307,  0.1546593 ,
        0.22909258,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.15753326,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.10274482,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.11546037,
        0.        ,  0.        ,  0.11150352,  0.1198269 ,  0.12927627,
        0.27435616,  0.28022643,  0.28220338,  0.28150593, -0.25022706,
        0.24872368,  0.17726069,  0.15920166,  0.        ,  0.        ,
        0.12531301, -0.13462701,  0.22695789,  0.11034058, -0.14302014,
        0.        ,  0.        , -0.11505769,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.10140235,  0.10764364,
        0.10731785,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.1151606 ,  0.        ,  0.        ,  0.        ,
        0.26358121, -0.11368877,  0.        ,  0.        ,  0.        ,
        0.        ,  0.16906762,  0.11129358,  0.        ])

If we want to compute the explained variance of them, it is also quite easy:

Xv = Xc.dot(model.coef_)
explained = np.sum(np.diag(Xv.T.dot(Xv)))
print('explained ratio: ', explained / total)
explained ratio:  0.46018459657037164