.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_gallery/2-pca/plot_7_RPCA.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_gallery_2-pca_plot_7_RPCA.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_gallery_2-pca_plot_7_RPCA.py:


Robust Principal Component Analysis
===================================
This notebook introduces what is adaptive best subset selection robust principal component analysis (RobustPCA) and then we show how it works using **abess** package on an artificial example.

.. GENERATED FROM PYTHON SOURCE LINES 8-18

PCA
---
Principal component analysis (PCA) is an important method in the field of data science, which can reduce the dimension of data and simplify our model. It solves an optimization problem like:

.. math::
    \max_{v} v^T\Sigma v,\qquad s.t.\quad v^Tv=1.


where :math:`\Sigma = X^TX/(n-1)` and :math:`X\in \mathbb{R}^{n\times p}` is the centered sample matrix with each row containing one observation of :math:`p` variables.


.. GENERATED FROM PYTHON SOURCE LINES 20-98

Robust-PCA (RPCA)
-----------------
However, the original PCA is sensitive to outliers, which may be unavoidable in real data:

- Object has extreme performance due to fortuity, but he/she shows normal in repeated tests;
- Wrong observation/recording/computing, e.g. missing or dead pixels, X-ray spikes.

In this situation, PCA may spend too much attention on unnecessary variables.
That's why Robust-PCA (RPCA) is presented, which can be used to recover the (low-rank) sample for subsequent processing.

In mathematics, RPCA manages to divide the sample matrix :math:`X` into two parts:

.. math::
    X = S + L,


where :math:`S` is the sparse "outlier" matrix and :math:`L` is the "information" matrix with a low rank.
Generally, we also suppose :math:`S` is not low-rank and :math:`L` is not sparse, in order to get unique solution.

.. image:: ../../Tutorial/figure/rpca.png

In Lagrange format,

.. math::
    \min _{S, L}\|X-S-L\|_{F} \leq \varepsilon, s . t . \quad \operatorname{rank}(L)=r,\|S\|_{0} \leq s


where :math:`s` is the sparsity of :math:`S`.
After RPCA, the information matrix :math:`L` can be used in further analysis.

Note that it does NOT deal with "noise", which may stay in :math:`L` and need further procession.

Hard Impute
^^^^^^^^^^^
To solve its sub-problem, RPCA under known outlier positions, we follow a process called "Hard Impute".
The main idea is to estimate the outlier values by precise values with KPCA, where :math:`K=r`.

Here are the steps:

1. Input :math:`X, outliers, M, \varepsilon`, where :math:`outliers` records the non-zero positions in :math:`S`;

2. Denote :math:`X_{\text{new}} \leftarrow {\bf 0}` with the same shape of :math:`X`;

3. For :math:`i = 1,2, \dots, M`:

   - :math:`X_{\text{old}} = \begin{cases} X_{\text{new}},&\text{for } outliers\\X,&\text{for others}\end{cases}`;

   - Form KPCA on :math:`X_{\text{old}}` with :math:`K=r`, and denote :math:`v` as the eigenvectors;

   - :math:`X_{\text{new}} = X_{\text{old}}\cdot v\cdot v^T`;

   - If :math:`\|X_{\text{new}} - X_{\text{old}}\| < \varepsilon`, break;

   End for;

4. Return :math:`X_{\text{new}}` as :math:`L`;

where :math:`M` is the maximum iteration times and :math:`\varepsilon` is the convergence coefficient.

The final :math:`X_{\text{new}}` is supposed to be :math:`L` under given outlier positions.

RPCA Application
^^^^^^^^^^^^^^^^
Recently, RPCA is more widely used, for example,

- Video Decomposition:
  in a surveillance video, the background may be unchanged for a long time while only a few pixels (e.g. people) update.
  In order to improve the efficiency of storing and analysis, we need to decomposite the video into background and
  foreground. Since the background is unchanged, it can be stored well in a low-rank matrix, while the foreground, which is
  usually quite small, can be indicated by a sparse matrix. That is what RPCA does.

- Face recognition:
  due to complex lighting conditions, a small part of the facial features may be unrecognized (e.g. shadow).
  In the face recognition, we need to remove the effects of shadows and focus on the face data. Actually, since the face data is almost unchanged (for one person), and the shadows affect only a small part, it is also a suitable situation to use RPCA. Here are some examples:

.. image:: ../../Tutorial/figure/rpca_shadow.png


.. GENERATED FROM PYTHON SOURCE LINES 100-106

Simulated Data Example
----------------------
Fitting model
^^^^^^^^^^^^^
Now we generate an example with :math:`100` rows and :math:`100` columns with :math:`200` outliers.
We are looking forward to recovering it with a low rank :math:`10`.

.. GENERATED FROM PYTHON SOURCE LINES 106-132

.. code-block:: Python


    from abess.decomposition import RobustPCA
    import numpy as np


    def gen_data(n, p, s, r, seed=0):
        np.random.seed(seed)
        outlier = np.random.choice(n * p, s, replace=False)
        outlier = np.vstack((outlier // p, outlier % p)).T
        L = np.dot(np.random.rand(n, r), np.random.rand(r, n))
        S = np.zeros((n, p))
        S[outlier[:, 0], outlier[:, 1]] = float(np.random.randn(1)) * 10
        X = L + S
        return X, S


    n = 100     # rows
    p = 100     # columns
    s = 200     # outliers
    r = 10      # rank(L)

    X, S = gen_data(n, p, s, r)
    print(f'X shape: {X.shape}')
    # print(f'outlier: \n{outlier}')


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    X shape: (100, 100)


.. GENERATED FROM PYTHON SOURCE LINES 133-137

In order to use our program, users should call ``RobustPCA()`` and give
the outlier number to ``support_size``. Note that it can be a specific
integer or an integer interval. For the latter case, a support size will
be chosen by information criterion (e.g. GIC) adaptively.

.. GENERATED FROM PYTHON SOURCE LINES 137-142

.. code-block:: Python


    # support_size can be a interval like `range(s_min, s_max)`
    model = RobustPCA(support_size=s)


.. GENERATED FROM PYTHON SOURCE LINES 143-146

It is quite easy to fit this model, with ``RobustPCA.fit`` function. Given
the original sample matrix :math:`X` and :math:`rank(L)` we want, the
program will give a result quickly.

.. GENERATED FROM PYTHON SOURCE LINES 146-149

.. code-block:: Python


    model.fit(X, r=r)  # r=rank(L)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <style>#sk-container-id-11 {
      /* Definition of color scheme common for light and dark mode */
      --sklearn-color-text: black;
      --sklearn-color-line: gray;
      /* Definition of color scheme for unfitted estimators */
      --sklearn-color-unfitted-level-0: #fff5e6;
      --sklearn-color-unfitted-level-1: #f6e4d2;
      --sklearn-color-unfitted-level-2: #ffe0b3;
      --sklearn-color-unfitted-level-3: chocolate;
      /* Definition of color scheme for fitted estimators */
      --sklearn-color-fitted-level-0: #f0f8ff;
      --sklearn-color-fitted-level-1: #d4ebff;
      --sklearn-color-fitted-level-2: #b3dbfd;
      --sklearn-color-fitted-level-3: cornflowerblue;

      /* Specific color for light theme */
      --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));
      --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-icon: #696969;

      @media (prefers-color-scheme: dark) {
        /* Redefinition of color scheme for dark theme */
        --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));
        --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-icon: #878787;
      }
    }

    #sk-container-id-11 {
      color: var(--sklearn-color-text);
    }

    #sk-container-id-11 pre {
      padding: 0;
    }

    #sk-container-id-11 input.sk-hidden--visually {
      border: 0;
      clip: rect(1px 1px 1px 1px);
      clip: rect(1px, 1px, 1px, 1px);
      height: 1px;
      margin: -1px;
      overflow: hidden;
      padding: 0;
      position: absolute;
      width: 1px;
    }

    #sk-container-id-11 div.sk-dashed-wrapped {
      border: 1px dashed var(--sklearn-color-line);
      margin: 0 0.4em 0.5em 0.4em;
      box-sizing: border-box;
      padding-bottom: 0.4em;
      background-color: var(--sklearn-color-background);
    }

    #sk-container-id-11 div.sk-container {
      /* jupyter's `normalize.less` sets `[hidden] { display: none; }`
         but bootstrap.min.css set `[hidden] { display: none !important; }`
         so we also need the `!important` here to be able to override the
         default hidden behavior on the sphinx rendered scikit-learn.org.
         See: https://github.com/scikit-learn/scikit-learn/issues/21755 */
      display: inline-block !important;
      position: relative;
    }

    #sk-container-id-11 div.sk-text-repr-fallback {
      display: none;
    }

    div.sk-parallel-item,
    div.sk-serial,
    div.sk-item {
      /* draw centered vertical line to link estimators */
      background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));
      background-size: 2px 100%;
      background-repeat: no-repeat;
      background-position: center center;
    }

    /* Parallel-specific style estimator block */

    #sk-container-id-11 div.sk-parallel-item::after {
      content: "";
      width: 100%;
      border-bottom: 2px solid var(--sklearn-color-text-on-default-background);
      flex-grow: 1;
    }

    #sk-container-id-11 div.sk-parallel {
      display: flex;
      align-items: stretch;
      justify-content: center;
      background-color: var(--sklearn-color-background);
      position: relative;
    }

    #sk-container-id-11 div.sk-parallel-item {
      display: flex;
      flex-direction: column;
    }

    #sk-container-id-11 div.sk-parallel-item:first-child::after {
      align-self: flex-end;
      width: 50%;
    }

    #sk-container-id-11 div.sk-parallel-item:last-child::after {
      align-self: flex-start;
      width: 50%;
    }

    #sk-container-id-11 div.sk-parallel-item:only-child::after {
      width: 0;
    }

    /* Serial-specific style estimator block */

    #sk-container-id-11 div.sk-serial {
      display: flex;
      flex-direction: column;
      align-items: center;
      background-color: var(--sklearn-color-background);
      padding-right: 1em;
      padding-left: 1em;
    }


    /* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is
    clickable and can be expanded/collapsed.
    - Pipeline and ColumnTransformer use this feature and define the default style
    - Estimators will overwrite some part of the style using the `sk-estimator` class
    */

    /* Pipeline and ColumnTransformer style (default) */

    #sk-container-id-11 div.sk-toggleable {
      /* Default theme specific background. It is overwritten whether we have a
      specific estimator or a Pipeline/ColumnTransformer */
      background-color: var(--sklearn-color-background);
    }

    /* Toggleable label */
    #sk-container-id-11 label.sk-toggleable__label {
      cursor: pointer;
      display: block;
      width: 100%;
      margin-bottom: 0;
      padding: 0.5em;
      box-sizing: border-box;
      text-align: center;
    }

    #sk-container-id-11 label.sk-toggleable__label-arrow:before {
      /* Arrow on the left of the label */
      content: "▸";
      float: left;
      margin-right: 0.25em;
      color: var(--sklearn-color-icon);
    }

    #sk-container-id-11 label.sk-toggleable__label-arrow:hover:before {
      color: var(--sklearn-color-text);
    }

    /* Toggleable content - dropdown */

    #sk-container-id-11 div.sk-toggleable__content {
      max-height: 0;
      max-width: 0;
      overflow: hidden;
      text-align: left;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }

    #sk-container-id-11 div.sk-toggleable__content.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }

    #sk-container-id-11 div.sk-toggleable__content pre {
      margin: 0.2em;
      border-radius: 0.25em;
      color: var(--sklearn-color-text);
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }

    #sk-container-id-11 div.sk-toggleable__content.fitted pre {
      /* unfitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }

    #sk-container-id-11 input.sk-toggleable__control:checked~div.sk-toggleable__content {
      /* Expand drop-down */
      max-height: 200px;
      max-width: 100%;
      overflow: auto;
    }

    #sk-container-id-11 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {
      content: "▾";
    }

    /* Pipeline/ColumnTransformer-specific style */

    #sk-container-id-11 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }

    #sk-container-id-11 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      background-color: var(--sklearn-color-fitted-level-2);
    }

    /* Estimator-specific style */

    /* Colorize estimator box */
    #sk-container-id-11 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }

    #sk-container-id-11 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }

    #sk-container-id-11 div.sk-label label.sk-toggleable__label,
    #sk-container-id-11 div.sk-label label {
      /* The background is the default theme color */
      color: var(--sklearn-color-text-on-default-background);
    }

    /* On hover, darken the color of the background */
    #sk-container-id-11 div.sk-label:hover label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }

    /* Label box, darken color on hover, fitted */
    #sk-container-id-11 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-fitted-level-2);
    }

    /* Estimator label */

    #sk-container-id-11 div.sk-label label {
      font-family: monospace;
      font-weight: bold;
      display: inline-block;
      line-height: 1.2em;
    }

    #sk-container-id-11 div.sk-label-container {
      text-align: center;
    }

    /* Estimator-specific */
    #sk-container-id-11 div.sk-estimator {
      font-family: monospace;
      border: 1px dotted var(--sklearn-color-border-box);
      border-radius: 0.25em;
      box-sizing: border-box;
      margin-bottom: 0.5em;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }

    #sk-container-id-11 div.sk-estimator.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }

    /* on hover */
    #sk-container-id-11 div.sk-estimator:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }

    #sk-container-id-11 div.sk-estimator.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }

    /* Specification for estimator info (e.g. "i" and "?") */

    /* Common style for "i" and "?" */

    .sk-estimator-doc-link,
    a:link.sk-estimator-doc-link,
    a:visited.sk-estimator-doc-link {
      float: right;
      font-size: smaller;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1em;
      height: 1em;
      width: 1em;
      text-decoration: none !important;
      margin-left: 1ex;
      /* unfitted */
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
      color: var(--sklearn-color-unfitted-level-1);
    }

    .sk-estimator-doc-link.fitted,
    a:link.sk-estimator-doc-link.fitted,
    a:visited.sk-estimator-doc-link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }

    /* On hover */
    div.sk-estimator:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover,
    div.sk-label-container:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }

    div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover,
    div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }

    /* Span, style for the box shown on hovering the info icon */
    .sk-estimator-doc-link span {
      display: none;
      z-index: 9999;
      position: relative;
      font-weight: normal;
      right: .2ex;
      padding: .5ex;
      margin: .5ex;
      width: min-content;
      min-width: 20ex;
      max-width: 50ex;
      color: var(--sklearn-color-text);
      box-shadow: 2pt 2pt 4pt #999;
      /* unfitted */
      background: var(--sklearn-color-unfitted-level-0);
      border: .5pt solid var(--sklearn-color-unfitted-level-3);
    }

    .sk-estimator-doc-link.fitted span {
      /* fitted */
      background: var(--sklearn-color-fitted-level-0);
      border: var(--sklearn-color-fitted-level-3);
    }

    .sk-estimator-doc-link:hover span {
      display: block;
    }

    /* "?"-specific style due to the `<a>` HTML tag */

    #sk-container-id-11 a.estimator_doc_link {
      float: right;
      font-size: 1rem;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1rem;
      height: 1rem;
      width: 1rem;
      text-decoration: none;
      /* unfitted */
      color: var(--sklearn-color-unfitted-level-1);
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
    }

    #sk-container-id-11 a.estimator_doc_link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }

    /* On hover */
    #sk-container-id-11 a.estimator_doc_link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }

    #sk-container-id-11 a.estimator_doc_link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
    }
    </style><div id="sk-container-id-11" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>RobustPCA(support_size=200)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item"><div class="sk-estimator fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-11" type="checkbox" checked><label for="sk-estimator-id-11" class="sk-toggleable__label fitted sk-toggleable__label-arrow fitted">&nbsp;RobustPCA<span class="sk-estimator-doc-link fitted">i<span>Fitted</span></span></label><div class="sk-toggleable__content fitted"><pre>RobustPCA(support_size=200)</pre></div> </div></div></div></div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 150-151

Now the estimated outlier matrix is stored in ``model.coef_``.

.. GENERATED FROM PYTHON SOURCE LINES 151-156

.. code-block:: Python


    S_est = model.coef_
    print(f'estimated sparsity: {np.count_nonzero(S_est)}')


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    estimated sparsity: 200


.. GENERATED FROM PYTHON SOURCE LINES 157-160

More on the result
^^^^^^^^^^^^^^^^^^
To check the performance of the program, we use TPR, FPR as the criterion.

.. GENERATED FROM PYTHON SOURCE LINES 160-182

.. code-block:: Python


    def TPR(pred, real):
        TP = (pred != 0) & (real != 0)
        P = (real != 0)
        return sum(sum(TP)) / sum(sum(P))


    def FPR(pred, real):
        FP = (pred != 0) & (real == 0)
        N = (real == 0)
        return sum(sum(FP)) / sum(sum(N))


    def test_model(pred, real):
        tpr = TPR(pred, real)
        fpr = FPR(pred, real)
        return np.array([tpr, fpr])


    print(f'[TPR  FPR] = {test_model(S_est, S)}')


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    [TPR  FPR] = [0.925      0.00153061]


.. GENERATED FROM PYTHON SOURCE LINES 183-184

We can also change different random seed to test for more situation:

.. GENERATED FROM PYTHON SOURCE LINES 184-195

.. code-block:: Python


    M = 30  # use 30 different seed
    res = np.zeros(2)
    for seed in range(M):
        X, S = gen_data(n, p, s, r, seed)
        model = RobustPCA(support_size=s).fit(X, r=r)
        res += test_model(model.coef_, S)

    print(f'[TPR  FPR] = {res/M}')


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    [TPR  FPR] = [0.89866667 0.00206803]


.. GENERATED FROM PYTHON SOURCE LINES 196-197

Under all of these situations, ``RobustPCA`` has a good performance.

.. GENERATED FROM PYTHON SOURCE LINES 199-206

R tutorial
----------
For R tutorial, please view
https://abess-team.github.io/abess/articles/v08-sPCA.html.

sphinx_gallery_thumbnail_path = 'Tutorial/figure/rpca_shadow.png'


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 1.339 seconds)


.. _sphx_glr_download_auto_gallery_2-pca_plot_7_RPCA.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_7_RPCA.ipynb <plot_7_RPCA.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_7_RPCA.py <plot_7_RPCA.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_