Develop New Features#

In this tutorial, we will show you how to develop a new algorithm for specific best-subset problem with abess's procedure.

Preliminaries#

We have endeavor to make developing new features easily. Before developing the code, please make sure you have:

Core C++#

The main files related to the core are in src, which are written in C++. Among them, some important files:

  • api.cpp/api.h contain the API's, which are the entrance for both R & Python.

  • AlgorithmXXX.h records the implement of each concrete algorithm;

If you want to add a new algorithm, all of them should be noticed.

Besides, we have implemented ABESS algorithms for generalized linear model on src/AlgorithmGLM.h [code temp] and principal component analysis (PCA) on src/AlgorithmPCA.h [code temp]. You can check them to help your own developing.

Write an API#

API’s are all defined in the src/api.cpp[code temp] and the related header file src/api.h[code temp]. We have written some API functions (e.g. abessGLM_API()), so you can either add a new function for the new algorithm or simply add into existing one.

First of all, the algorithm name and its number should be determined.

The format of a new algorithm’s name is “abess+your_algorithm”, which means that using abess to solve the problem, and its number should be an integer unused by others.

In the following part, we suppose to create an algorithm named abess_new_algorithm with number 123.

Next, four important data type should be determined:

  • T1 : type of Y

  • T2 : type of coefficients

  • T3 : type of intercept

  • T4 : type of X

The algorithm variable are based on them: [code link]

vector<Algorithm<{T1}, {T2}, {T3}, {T4}>*> algorithm_list(algorithm_list_size);

Take LinearRegression (the linear regression on abess) as an example,

  • Y: dense vector

  • Coefficients: dense vector

  • Intercept: numeric

  • X: dense/sparse matrix

so that we define:

vector<Algorithm<Eigen::VectorXd, Eigen::VectorXd, double, Eigen::MatrixXd> *> algorithm_list_uni_dense(algorithm_list_size);
vector<Algorithm<Eigen::VectorXd, Eigen::VectorXd, double, Eigen::SparseMatrix<double>> *> algorithm_list_uni_sparse(algorithm_list_size);

After that, request memory to initial the algorithm: [code link]

for (int i = 0; i < algorithm_list_size; i++)
{
    if (model_type == 123)    // number of algorithm
    {
        algorithm_list[i] = new abessLm<{T4}>(...);
    }
}

Finally, call abessWorkflow(), which would compute the result: [code link]

// "List" is a preset structure to store results
List out_result = abessWorkflow<{T1}, {T2}, {T3}, {T4}>(..., algorithm_list);

Implement your Algorithm#

The implemented algorithms are stored in src/AlgorithmXXX.h. We have implemented some algorithms (e.g. AlgorithmGLM.h), so you can either create a new file containing new algorithm or simply add into existing one.

The new algorithm should inherit a base class, called Algorithm, which defined in Algorithm.h. And then rewrite some virtual function interfaces to fit specify problem. The implementation is modularized such that you can easily extend the package.

A concrete algorithm is like: [code link]

#include "Algorithm.h"

template <class T4>
class abess_new_algorithm : public Algorithm<{T1}, {T2}, {T3}, T4>  // T1, T2, T3 are the same as above, which are fixed.
{
public:
    // constructor and destructor
    abess_new_algorithm(...) : Algorithm<...>::Algorithm(...){};
    ~abess_new_algorithm(){};

    void primary_model_fit(...){
        // solve the subproblem under given active set
        // record the sparse answer in variable "beta"
    };

    double loss_function(...){
        // define and compute loss under given active set
        // return the current loss
    };

    void sacrifice(...){
        // define and compute sacrifice for all variables (both forward and backward)
        // record sacrifice in variable "bd"
    };

    double effective_number_of_parameter(...){
        // return effective number of parameter
    };
}

Note that sacrifice function here would compute “forward/backward sacrifices” and record them in bd.

  • For active variable, the lower (backward) sacrifice is, the more likely it will be dropped;

  • For inactive variable, the higher (forward) sacrifice is, the more likely it will come into use.

If you create a new file to store the algorithm, remember to include it inside src/api.cpp. [code temp]

Now your new method has been connected to the whole frame. In the next section, we focus on how to build R or Python package based on the core code.

R & Python Package#

R Package#

To make sure your code available for R, run

R CMD INSTALL R-package

Then, this package would be installed into R session if the R package dependence (Rcpp and Matrix) have been installed.

After that, the object in R can be passed to Cpp via the unified API abessCpp. We strongly suggest the R function is named as abessXXX and use roxygen2 to write R documentation and devtools to configure your package.

Python Package#

First of all, you should ensure the C++ code available for Python, cd into directory abess/python and run $ python setup.py install. (Same steps in Installation)

It may take a few minutes to install:

  • if the installation throw some errors, it means that the C++ code may be wrong;

  • if the installation runs without errors, it will finish with message like “Finished processing dependencies for abess”.

Then create a new python file in python/abess or open an existed file, such as python/abess/linear.py, to add a python API for your new method.

A simple new method can be added like: [code temp].

# all algorithms should inherit the base class `bess_base`
from .bess_base import bess_base

class new_algorithm(bess_base):
    """
    Here is some introduction.
    """
    def __init__(self, ...):
        super(abess_new_algorithm, self).__init__(
            algorithm_type="abess",
            model_type="new_algorithm",
            # other init
        )
    def fit(self, ...):
        # override `bess_base.fit()`, if necessary

    def custom_function(self, ...):
        # some custom functions, e.g. predict

The base class implements a fit function, which plays a role on checking input and calling C++ API to compute results. You may want to override it for custom features. [code temp].

Then, the final step is to link this Python class with the model type number (it has been defined in Section Core C++). In the fit function, you would find somewhere like:

if self.model_type == "new_algorithm":
    model_type_int = 123    # same number in C++

Finally, don’t forget to import the new algorithm in python/abess/__init__.py.

Now run $ python setup.py install again and this time the installation would be finished quickly. Congratulation! Your work can now be used by:

from abess import new_algorithm

bess_base#

As we show above, any new methods are based on bess_base, which can be found in bess_base.py: [code link]

from sklearn.base import BaseEstimator
class bess_base(BaseEstimator):
     def __init__(...):
        # some init
     def fit(...):
        # check input, warp with cpp

Actually, it is based on sklearn.base.BaseEstimator [code link]. Two methods, get_params and set_params are offered in this base class.

In our package, we write an method called fit to realize the abess process. Of cause, you can also override it like SparsePCA.

Verify you result#

After programming the code, it is necessary to verify the contributed function can return a reasonable result. Here, we share our experience for it. Notice that the core our algorithm are forward and backward sacrifices, as long as they are properly programming, the contributed function would work well.

  • Check primary_model_fit and loss_function

Secondly, we recommend you consider primary_model_fit for the computation of backward sacrifices. To check whether it works well, you can leverage the parameter always.include in R. Actually, when the number of elements pass to always.include is equal to support.size (always_include and support_size in Python), our algorithm is no need to do variable selection since all element must be selected, and thus, our implementation framework would just simply solving a convex problem by conducting primary_model_fit and the solution should match to (or close to) the function implemented in R/Python. Take the PCA task as an example, we should expect that, the results returned by abess:

data(USArrests)
abess_fit <- abesspca(USArrests, always.include = c(1:3), support.size = 3)
as.vector(spca_fit[["coef"]])[1:3]

should match with that returned by the princomp function:

princomp_fit <- loadings(princomp(USArrests[, 1:3]))[, 1]
princomp_fit

Actually, in our implementation, the results returned in two code blocks is match in magnitude. If the results are match, you can congratulate for your correct coding. We also recommend you write a automatic test case for this following the content below.

At the same time, you can see whether the loss_function is right by comparing spca_fit[["loss"]] and the variance of the first principal component.

  • Check sacrifice

Thirdly, we recommend you consider sacrifice. Checking the function sacrifice needs more efforts. Monte Carlo studies should be conduct to check whether sacrifice is properly programmed such that the effective/relevant variables can be detected when sample size is large. We strongly recommend to check the result by setting: - sample size at least 1000 - dimension is less than 50 - the true support size is less than 5 - variables are independence - the support size from 0 to the ground true - the \(l_2\) regularization is zero.

In most of the cases, this setting is very helpful for checking code. Generally, the output of abess would match to the correct under this setting. Take linear regression in R as our example, the code for checking is demonstrated below:

n <- 1000
p <- 50
support_size <- 3
dataset <- generate.data(n, p, support_size, seed = 1)
abess_fit <- abess(dataset[["x"]], dataset[["y"]], support.size = 0:support_size)
## estimated support:
extract(abess_fit, support.size = support_size)[["support.vars"]]
## true support:
which(dataset[["beta"]] != 0)

In this example, the estimated support set is the same as the true.

Write test cases#

It is always a good habit to do some test for the changed package. Contributions with test cases included are easier to accept.

We use testthat for unit tests in R and pytest in Python. You may need to install first.

You can find some examples here and please feel free to add your test code into it (or create a new test file) under the test folder:

A good test code should contain:

  • possible input modes as well as some wrong input;

  • check whether the output is expected;

  • possible extreme cases;

All test under pytest folder should be checked after coding.