make_glm_data#

class abess.datasets.make_glm_data[source]#

Generate a dataset with single response.

Parameters

n (int) -- The number of observations.
p (int) -- The number of predictors of interest.
k (int) -- The number of nonzero coefficients in the underlying regression model.
family ({gaussian, binomial, poisson, gamma, cox}) -- The distribution of the simulated response. "gaussian" for univariate quantitative response, "binomial" for binary classification response, "poisson" for counting response, "gamma" for positive continuous response, "cox" for left-censored response.
rho (float, optional, default=0) -- A parameter used to characterize the pairwise correlation in predictors.
corr_type (string, optional, default="const") -- The structure of correlation matrix. "const" for constant pairwise correlation, "exp" for pairwise correlation with exponential decay.
sigma (float, optional, default=1) -- The variance of the gaussian noise. It would be unused if snr is not None.
coef (array_like, optional, default=None) -- The coefficient values in the underlying regression model.
censoring (bool, optional, default=True) -- For Cox data, it indicates whether censoring is existed.
c (int, optional, default=1) -- For Cox data and censoring=True, it indicates the maximum censoring time. So that all observations have chances to be censored at (0, c).
scal (float, optional, default=10) -- The scale of survival time in Cox data.
snr (float, optional, default=None) -- A numerical value controlling the signal-to-noise ratio (SNR) in gaussian data.
class_num (int, optional, default=3) -- The number of possible classes in oridinal dataset, i.e. \(y \in \{0, 1, 2, ..., \text{class_num}-1\}\)

x#

Design matrix of predictors.

Type: array-like, shape(n, p)

y#

Response variable.

Type: array-like, shape(n,)

coef_#

The coefficients used in the underlying regression model. It has k nonzero values.

Type: array-like, shape(p,)

Notes

The output, whose type is named data, contains three elements: x, y and coef_, which correspond the variables, responses and coefficients, respectively.

Each row of x or y indicates a sample and is independent to the other.

We denote \(x, y, \beta\) for one sample in the math formulas below.

Linear Regression
- Usage: family='gaussian'[, sigma=...]
- Model: \(y \sim N(\mu, \sigma^2),\ \mu = x^T\beta\).
  the coefficient \(\beta\sim U[m, 100m]\), where \(m = 5\sqrt{2\log p/n}\);
  
  the variance \(\sigma = 1\).
Logistic Regression
- Usage: family='binomial'
- Model: \(y \sim \text{Binom}(\pi),\ \text{logit}(\pi) = x^T \beta\).
  the coefficient \(\beta\sim U[2m, 10m]\), where \(m = 5\sqrt{2\log p/n}\).
Poisson Regression
- Usage: family='poisson'
- Model: \(y \sim \text{Poisson}(\lambda),\ \lambda = \exp(x^T \beta)\).
  the coefficient \(\beta\sim U[2m, 10m]\), where \(m = 5\sqrt{2\log p/n}\).
Gamma Regression
- Usage: family='gamma'
- Model: \(y \sim \text{Gamma}(k, \theta),\ k\theta = -1/(x^T \beta + \epsilon), k\sim U[0.1, 100.1]\) in shape-scale definition.
  the coefficient \(\beta\sim U[m, 100m]\), where \(m = 5\sqrt{2\log p/n}\).
Cox PH Survival Analysis
- Usage: family='cox'[, scal=..., censoring=..., c=...]
- Model: \(y=\min(t,C)\), where \(t = \left[-\dfrac{\log U}{\exp(X \beta)}\right]^s,\ U\sim N(0,1),\ s=\dfrac{1}{\text{scal}}\) and censoring time \(C\sim U(0, c)\).
  the coefficient \(\beta\sim U[2m, 10m]\), where \(m = 5\sqrt{2\log p/n}\);
  
  the scale of survival time \(\text{scal} = 10\);
  
  censoring is enabled, and max censoring time \(c=1\).
Ordinal Regression
- Usage: family='ordinal'[, class_num=...]
- Model: \(y\in \{0, 1, \dots, n_{class}\}\), \(\mathbb{P}(y\leq i) = \dfrac{1} {1+\exp(-x^T\beta - \varepsilon_i)}\), where \(i\in \{0, 1, \dots, n_{class}\}\) and \(\forall i<j, \varepsilon_i < \varepsilon_j\).
  the coefficient \(\beta\sim U[-M, M]\), where \(M = 125\sqrt{2\log p/n}\);
  
  the intercept: \(\forall i,\varepsilon_i\sim U[-M, M]\);
  
  the number of classes \(n_{class}=3\).

__init__(n, p, k, family, rho=0, corr_type='const', sigma=1, coef_=None, censoring=True, c=1, scal=10, snr=None, class_num=3)[source]#