Develop New Features ==================== In this tutorial, we will show you how to develop a new algorithm for specific best-subset problem with ``abess``'s procedure. Preliminaries ------------- We have endeavor to make developing new features easily. Before developing the code, please make sure you have: - understood the `ABESS algorithm `__; - installed ``abess`` via the code in github by following `Installation <../Installation.html>`__ instruction; - read the `Appendix: Architecture of abess `__ of ``abess`` library; - some experience on writing R or Python code. Core C++ -------- The main files related to the core are in ``src``, which are written in C++. Among them, some important files: - ``api.cpp/api.h`` contain the API's, which are the entrance for both R & Python. - ``AlgorithmXXX.h`` records the implement of each concrete algorithm; If you want to add a new algorithm, all of them should be noticed. Besides, we have implemented ABESS algorithms for generalized linear model on ``src/AlgorithmGLM.h`` `[code temp] `__ and principal component analysis (PCA) on ``src/AlgorithmPCA.h`` `[code temp] `__. You can check them to help your own developing. Write an API ~~~~~~~~~~~~ API’s are all defined in the ``src/api.cpp``\ `[code temp] `__ and the related header file ``src/api.h``\ `[code temp] `__. We have written some API functions (e.g. ``abessGLM_API()``), so you can either add a new function for the new algorithm or simply add into existing one. First of all, the algorithm name and its number should be determined. The format of a new algorithm’s name is “**abess+your_algorithm**”, which means that using abess to solve the problem, and its number should be an integer unused by others. In the following part, we suppose to create an algorithm named ``abess_new_algorithm`` with number 123. Next, four important data type should be determined: - T1 : type of Y - T2 : type of coefficients - T3 : type of intercept - T4 : type of X The algorithm variable are based on them: `[code link] `__ .. code:: cpp vector*> algorithm_list(algorithm_list_size); Take ``LinearRegression`` (the linear regression on abess) as an example, - Y: dense vector - Coefficients: dense vector - Intercept: numeric - X: dense/sparse matrix so that we define: .. code:: cpp vector *> algorithm_list_uni_dense(algorithm_list_size); vector> *> algorithm_list_uni_sparse(algorithm_list_size); After that, request memory to initial the algorithm: `[code link] `__ .. code:: cpp for (int i = 0; i < algorithm_list_size; i++) { if (model_type == 123) // number of algorithm { algorithm_list[i] = new abessLm<{T4}>(...); } } Finally, call ``abessWorkflow()``, which would compute the result: `[code link] `__ .. code:: cpp // "List" is a preset structure to store results List out_result = abessWorkflow<{T1}, {T2}, {T3}, {T4}>(..., algorithm_list); Implement your Algorithm ~~~~~~~~~~~~~~~~~~~~~~~~ The implemented algorithms are stored in ``src/AlgorithmXXX.h``. We have implemented some algorithms (e.g. ``AlgorithmGLM.h``), so you can either create a new file containing new algorithm or simply add into existing one. The new algorithm should inherit a base class, called *Algorithm*, which defined in ``Algorithm.h``. And then rewrite some virtual function interfaces to fit specify problem. The implementation is modularized such that you can easily extend the package. .. raw:: html A concrete algorithm is like: `[code link] `__ .. code:: cpp #include "Algorithm.h" template class abess_new_algorithm : public Algorithm<{T1}, {T2}, {T3}, T4> // T1, T2, T3 are the same as above, which are fixed. { public: // constructor and destructor abess_new_algorithm(...) : Algorithm<...>::Algorithm(...){}; ~abess_new_algorithm(){}; void primary_model_fit(...){ // solve the subproblem under given active set // record the sparse answer in variable "beta" }; double loss_function(...){ // define and compute loss under given active set // return the current loss }; void sacrifice(...){ // define and compute sacrifice for all variables (both forward and backward) // record sacrifice in variable "bd" }; double effective_number_of_parameter(...){ // return effective number of parameter }; } Note that ``sacrifice`` function here would compute “forward/backward sacrifices” and record them in ``bd``. - For active variable, the lower (backward) sacrifice is, the more likely it will be dropped; - For inactive variable, the higher (forward) sacrifice is, the more likely it will come into use. If you create a new file to store the algorithm, remember to include it inside ``src/api.cpp``. `[code temp] `__ Now your new method has been connected to the whole frame. In the next section, we focus on how to build R or Python package based on the core code. R & Python Package ------------------ R Package ~~~~~~~~~ To make sure your code available for R, run .. code:: powershell R CMD INSTALL R-package Then, this package would be installed into R session if the R package dependence (``Rcpp`` and ``Matrix``) have been installed. After that, the object in R can be passed to Cpp via the unified API ``abessCpp``. We strongly suggest the R function is named as ``abessXXX`` and use ``roxygen2`` to write R documentation and ``devtools`` to configure your package. Python Package ~~~~~~~~~~~~~~ First of all, you should ensure the C++ code available for Python, ``cd`` into directory ``abess/python`` and run ``$ python setup.py install``. (Same steps in `Installation `__) It may take a few minutes to install: - if the installation throw some errors, it means that the C++ code may be wrong; - if the installation runs without errors, it will finish with message like “*Finished processing dependencies for abess*”. Then create a new python file in ``python/abess`` or open an existed file, such as ``python/abess/linear.py``, to add a python API for your new method. A simple new method can be added like: `[code temp] `__. .. code:: python # all algorithms should inherit the base class `bess_base` from .bess_base import bess_base class new_algorithm(bess_base): """ Here is some introduction. """ def __init__(self, ...): super(abess_new_algorithm, self).__init__( algorithm_type="abess", model_type="new_algorithm", # other init ) def fit(self, ...): # override `bess_base.fit()`, if necessary def custom_function(self, ...): # some custom functions, e.g. predict The base class implements a ``fit`` function, which plays a role on checking input and calling C++ API to compute results. You may want to override it for custom features. `[code temp] `__. Then, the final step is to link this Python class with the model type number (it has been defined in Section **Core C++**). In the ``fit`` function, you would find somewhere like: .. code:: python if self.model_type == "new_algorithm": model_type_int = 123 # same number in C++ Finally, don’t forget to import the new algorithm in ``python/abess/__init__.py``. Now run ``$ python setup.py install`` again and this time the installation would be finished quickly. Congratulation! Your work can now be used by: .. code:: python from abess import new_algorithm bess_base ^^^^^^^^^ As we show above, any new methods are based on ``bess_base``, which can be found in ``bess_base.py``: `[code link] `__ .. code:: python from sklearn.base import BaseEstimator class bess_base(BaseEstimator): def __init__(...): # some init def fit(...): # check input, warp with cpp Actually, it is based on ``sklearn.base.BaseEstimator`` `[code link] `__. Two methods, ``get_params`` and ``set_params`` are offered in this base class. In our package, we write an method called ``fit`` to realize the abess process. Of cause, you can also override it like ``SparsePCA``. Verify you result ----------------- After programming the code, it is necessary to verify the contributed function can return a reasonable result. Here, we share our experience for it. Notice that the core our algorithm are forward and backward sacrifices, as long as they are properly programming, the contributed function would work well. - Check ``primary_model_fit`` and ``loss_function`` Secondly, we recommend you consider ``primary_model_fit`` for the computation of backward sacrifices. To check whether it works well, you can leverage the parameter ``always.include`` in R. Actually, when the number of elements pass to ``always.include`` is equal to ``support.size`` (``always_include`` and ``support_size`` in Python), our algorithm is no need to do variable selection since all element must be selected, and thus, our implementation framework would just simply solving a convex problem by conducting ``primary_model_fit`` and the solution should match to (or close to) the function implemented in R/Python. Take the PCA task as an example, we should expect that, the results returned by ``abess``: .. code:: r data(USArrests) abess_fit <- abesspca(USArrests, always.include = c(1:3), support.size = 3) as.vector(spca_fit[["coef"]])[1:3] should match with that returned by the ``princomp`` function: .. code:: r princomp_fit <- loadings(princomp(USArrests[, 1:3]))[, 1] princomp_fit Actually, in our implementation, the results returned in two code blocks is match in magnitude. If the results are match, you can congratulate for your correct coding. We also recommend you write a automatic test case for this following the content below. At the same time, you can see whether the ``loss_function`` is right by comparing ``spca_fit[["loss"]]`` and the variance of the first principal component. - Check ``sacrifice`` Thirdly, we recommend you consider ``sacrifice``. Checking the function ``sacrifice`` needs more efforts. Monte Carlo studies should be conduct to check whether ``sacrifice`` is properly programmed such that the effective/relevant variables can be detected when sample size is large. We strongly recommend to check the result by setting: - sample size at least 1000 - dimension is less than 50 - the true support size is less than 5 - variables are independence - the support size from 0 to the ground true - the :math:`l_2` regularization is zero. In most of the cases, this setting is very helpful for checking code. Generally, the output of ``abess`` would match to the correct under this setting. Take linear regression in R as our example, the code for checking is demonstrated below: .. code:: r n <- 1000 p <- 50 support_size <- 3 dataset <- generate.data(n, p, support_size, seed = 1) abess_fit <- abess(dataset[["x"]], dataset[["y"]], support.size = 0:support_size) ## estimated support: extract(abess_fit, support.size = support_size)[["support.vars"]] ## true support: which(dataset[["beta"]] != 0) In this example, the estimated support set is the same as the true. .. - Check ``effective_number_of_parameter`` .. Finally, .. Miscellaneous .. ------------- .. Code style .. ~~~~~~~~~~ .. New R code should follow the `tidyverse style guide `__. You can use the .. `styler `__ R package to apply this style .. by conducting R command: ``style_file("path-to-newfile.R")`` New Python .. code should follow the `PEP8 style guide `__ Please don’t .. restyle code that has nothing to do with your code. .. Test cases .. ~~~~~~~~~~ Write test cases ----------------- It is always a good habit to do some test for the changed package. Contributions with test cases included are easier to accept. We use `testthat `__ for unit tests in R and `pytest `__ in Python. You may need to install first. You can find some examples here and please feel free to add your test code into it (or create a new test file) under the test folder: - `R test folder `__: ``abess/R-package/tests/testthat``. - `Python test folder `__: ``python/pytest``. A good test code should contain: - possible input modes as well as some wrong input; - check whether the output is expected; - possible extreme cases; All test under pytest folder should be checked after coding.