RobustPCA¶

Warning

In the old version of abess (before 0.4.0), this class is named abess.pca.abessRPCA. Please note that it will be deprecated in version 0.6.0.

class abess.decomposition.RobustPCA(max_iter=20, exchange_num=5, is_warm_start=True, support_size=None, ic_type='gic', ic_coef=1.0, always_select=None, thread=1, sparse_matrix=False, splicing_type=1)[source]

Adaptive Best-Subset Selection(ABESS) algorithm for robust principal component analysis.

Parameters
• splicing_type ({0, 1}, optional, default=1) -- The type of splicing. "0" for decreasing by half, "1" for decresing by one.

• max_iter (int, optional, default=20) -- Maximum number of iterations taken for the splicing algorithm to converge. Due to the limitation of loss reduction, the splicing algorithm must be able to converge. The number of iterations is only to simplify the implementation.

• is_warm_start (bool, optional, default=True) -- When tuning the optimal parameter combination, whether to use the last solution as a warm start to accelerate the iterative convergence of the splicing algorithm.

• path_type ({"seq", "gs"}, optional, default="seq") --

The method to be used to select the optimal support size.

• For path_type = "seq", we solve the best subset selection problem for each size in support_size.

• For path_type = "gs", we solve the best subset selection problem with support size ranged in (s_min, s_max), where the specific support size to be considered is determined by golden section.

• support_size (array-like, optional) -- default=range(min(n, int(n/(log(log(n))log(p))))). An integer vector representing the alternative support sizes. Only used when path_type = "seq".

• alpha (float, optional, default=0) --

Constant that multiples the L2 term in loss function, controlling regularization strength. It should be non-negative.

• If alpha = 0, it indicates ordinary least square.

• s_min (int, optional, default=0) -- The lower bound of golden-section-search for sparsity searching.

• s_max (int, optional, default=min(n, int(n/(log(log(n))log(p)))).) -- The higher bound of golden-section-search for sparsity searching.

• ic_type ({'aic', 'bic', 'gic', 'ebic'}, optional, default='ebic') -- The type of criterion for choosing the support size.

• cv (int, optional, default=1) --

The folds number when use the cross-validation method.

• If cv=1, cross-validation would not be used.

• If cv>1, support size will be chosen by CV's test loss, instead of IC.

• thread (int, optional, default=1) --

• If thread = 0, the maximum number of threads supported by the device will be used.

• screening_size (int, optional, default=-1) --

The number of variables remaining after screening. It should be a non-negative number smaller than p, but larger than any value in support_size.

• If screening_size=-1, screening will not be used.

• If screening_size=0, screening_size will be set as $$\\min(p, int(n / (\\log(\\log(n))\\log(p))))$$.

• always_select (array-like, optional, default=[]) -- An array contains the indexes of variables we want to consider in the model.

• primary_model_fit_max_iter (int, optional, default=10) -- The maximal number of iteration for primary_model_fit.

• primary_model_fit_epsilon (float, optional, default=1e-08) -- The epsilon (threshold) of iteration for primary_model_fit.

coef_

Estimated coefficients for the best subset selection problem.

Type

array-like, shape(p_features, ) or (p_features, M_responses)

intercept_

The intercept in the model.

Type

float or array-like, shape(M_responses,)

ic_

If cv=1, it stores the score under chosen information criterion.

Type

float

test_loss_

If cv>1, it stores the test loss under cross-validation.

Type

float

train_loss_

The loss on training data.

Type

float

References

• Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, and Xueqin Wang. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52):33117-33123, 2020.

Examples

>>> ### Sparsity known
>>>
>>> from abess.decomposition import RobustPCA
>>> import numpy as np
>>> np.random.seed(12345)
>>> model = RobustPCA(support_size = 10)
>>>
>>> ### X known
>>> X = np.random.randn(100, 50)
>>> model.fit(X, r = 10)
RobustPCA(always_select=[], support_size=10)
>>> print(model.coef_)
[[0.         0.         0.         ... 0.         3.71203604 0.        ]
[0.         0.         0.         ... 0.         0.         0.        ]
[0.         0.         0.         ... 0.         0.         0.        ]
...
[0.         0.         0.         ... 0.         0.         0.        ]
[0.         0.         0.         ... 0.         0.         0.        ]
[0.         0.         0.         ... 0.         0.         0.        ]]

fit(X, y=None, r=None, group=None, A_init=None)[source]

The fit function is used to transfer the information of data and return the fit result.

Parameters
• X (array-like, shape(n_samples, p_features)) -- Training data.

• y (ignore) -- Ignore.

• r (int) -- Rank of the (recovered) information matrix L. It should be smaller than rank of X (at least smaller than X.shape[1]).

• group (int, optional, default=np.ones(p)) -- The group index for each variable.