Develop New Features: GLM#

Specifically, it could be pretty easy that you don't even need to understand the whole ABESS algorithm before developing a best-subset problem on Generalized Linear Models (GLM). In this tutorial, we will show you how to implement it.

For more general models, please check Develop New Features instead.

Preliminaries#

We have endeavor to make developing new features easily. Before developing the code, please make sure you have:

installed abess via the code in github by following Installation instruction;
some experience on writing R or Python code.

Generalized linear model#

In mathematics, we often denote the variables as \(\mathbf{X}\) and the outcome as \(\mathbf{y}\), which is assumed to be generated from a distribution in an exponential family, e.g.

Normal Distribution;
Binomial Distribution;
Possion Distribution;
...

And the generalized linear model would be like:

\[\mathbb{E}(\mathbf{y}|\mathbf{X}) = \mathbf{\mu} = g^{-1}(\mathbf{X\beta}),\]

where \(\mathbb{E}(\mathbf{y}|\mathbf{X})\) is the expected value of \(\mathbf{y}\) conditional on \(\mathbf{X}\); \(g\) is the link function; \(\beta\) is the model parameters. Let's take the logistic regression as an example, where:

\[\mathbb{E}(y) = \mathbb{P}(y=1) = p,\quad g^{-1}(z) = \frac{1}{1+\exp(-z)}.\]

Core C++ for GLM#

The first step is to write an API, which is the same as Develop New Features: Write an API.

Then, to implemented algorithms on GLM, you need to focus on src/AlgorithmGLM.h, where we have implemented a base model called _abessGLM and the new algorithm should inherit it.

template <class T4>
class abess_new_GLM_algorithm : public _abessGLM<{T1}, {T2}, {T3}, T4>  // T1, T2, T3 are the same as above, which are fixed.
{
public:
    // constructor and destructor
    abess_new_GLM_algorithm(...) : _abessGLM<...>::_abessGLM(...){};
    ~abess_new_GLM_algorithm(){};

    Eigen::MatrixXd gradian_core(...) {
        // the gradian matrix can be expressed as G = X^T * A,
        // returns the gradian core A
    };
    Eigen::VectorXd hessian_core(...) {
        // the hessian matrix can be expressed as H = X^T * D * X,
        // returns the (diagnal values of) diagnal matrix D.
    };
    {T1} inv_link_function(...) {
        // returns inverse link function g^{-1}(X, beta),
        // i.e. the predicted y
    };
    {T1} log_probability(...) {
        // returns log P(y | X, beta)
    };
    bool null_model(...) {
        // returns a null model,
        // i.e. given only y, fit an intercept
    };
}

Compared with the general implement, it do not require touching the core of abess, but only use the knowledge of model itself.

Let's still discuss the logistic model and consider we hope to maximize log-likelihood: [code link]

\[\begin{split}l &= \frac{1}{2}\log\prod_i \mathbb{P}(y_i|X_i, \beta)\\ &= \frac{1}{2}\sum_i \left[y_i\log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i})\right]\\ &= -\frac{1}{2}\sum_i \left[y_i\log(1-e^{-X_i^T\beta}) + (1-y_i)\log(1-e^{X_i^T\beta})\right],\end{split}\]

From this formula, we can get the inv_link_function and log_probability. And we continue on its derivatives on \(\beta\):

\[\begin{split}\frac{\partial l}{\partial \beta} &= \sum_i X_i y_i(1-y_i),\\ \frac{\partial^2 l}{\partial \beta^2} &= \sum_i X_iX_i^T y_i(1-y_i)\end{split}\]

From this formula, we can get the gradian_core and hessian_core. Finally, the null_model should be:

\[\mathbb{E}(y) = g^{-1}(C),\quad i.e.\quad C = g(\overline{y}),\]

where \(\overline{y}\) is the mean of \(y\).

Now your new method has been connected to the whole frame. You can continue on the following steps like Develop New Features: R & Python Package.