make_glm_data#
- class abess.datasets.make_glm_data[source]#
Generate a dataset with single response.
- Parameters
n (int) -- The number of observations.
p (int) -- The number of predictors of interest.
k (int) -- The number of nonzero coefficients in the underlying regression model.
family ({gaussian, binomial, poisson, gamma, cox}) -- The distribution of the simulated response. "gaussian" for univariate quantitative response, "binomial" for binary classification response, "poisson" for counting response, "gamma" for positive continuous response, "cox" for left-censored response.
rho (float, optional, default=0) -- A parameter used to characterize the pairwise correlation in predictors.
corr_type (string, optional, default="const") -- The structure of correlation matrix. "const" for constant pairwise correlation, "exp" for pairwise correlation with exponential decay.
sigma (float, optional, default=1) -- The variance of the gaussian noise. It would be unused if snr is not None.
coef (array_like, optional, default=None) -- The coefficient values in the underlying regression model.
censoring (bool, optional, default=True) -- For Cox data, it indicates whether censoring is existed.
c (int, optional, default=1) -- For Cox data and censoring=True, it indicates the maximum censoring time. So that all observations have chances to be censored at (0, c).
scal (float, optional, default=10) -- The scale of survival time in Cox data.
snr (float, optional, default=None) -- A numerical value controlling the signal-to-noise ratio (SNR) in gaussian data.
class_num (int, optional, default=3) -- The number of possible classes in oridinal dataset, i.e. \(y \in \{0, 1, 2, ..., \text{class_num}-1\}\)
- x#
Design matrix of predictors.
- Type
array-like, shape(n, p)
- y#
Response variable.
- Type
array-like, shape(n,)
- coef_#
The coefficients used in the underlying regression model. It has k nonzero values.
- Type
array-like, shape(p,)
Notes
The output, whose type is named
data
, contains three elements:x
,y
andcoef_
, which correspond the variables, responses and coefficients, respectively.Each row of
x
ory
indicates a sample and is independent to the other.We denote \(x, y, \beta\) for one sample in the math formulas below.
Linear Regression
Usage:
family='gaussian'[, sigma=...]
Model: \(y \sim N(\mu, \sigma^2),\ \mu = x^T\beta\).
the coefficient \(\beta\sim U[m, 100m]\), where \(m = 5\sqrt{2\log p/n}\);
the variance \(\sigma = 1\).
Logistic Regression
Usage:
family='binomial'
Model: \(y \sim \text{Binom}(\pi),\ \text{logit}(\pi) = x^T \beta\).
the coefficient \(\beta\sim U[2m, 10m]\), where \(m = 5\sqrt{2\log p/n}\).
Poisson Regression
Usage:
family='poisson'
Model: \(y \sim \text{Poisson}(\lambda),\ \lambda = \exp(x^T \beta)\).
the coefficient \(\beta\sim U[2m, 10m]\), where \(m = 5\sqrt{2\log p/n}\).
Gamma Regression
Usage:
family='gamma'
Model: \(y \sim \text{Gamma}(k, \theta),\ k\theta = -1/(x^T \beta + \epsilon), k\sim U[0.1, 100.1]\) in shape-scale definition.
the coefficient \(\beta\sim U[m, 100m]\), where \(m = 5\sqrt{2\log p/n}\).
Cox PH Survival Analysis
Usage:
family='cox'[, scal=..., censoring=..., c=...]
Model: \(y=\min(t,C)\), where \(t = \left[-\dfrac{\log U}{\exp(X \beta)}\right]^s,\ U\sim N(0,1),\ s=\dfrac{1}{\text{scal}}\) and censoring time \(C\sim U(0, c)\).
the coefficient \(\beta\sim U[2m, 10m]\), where \(m = 5\sqrt{2\log p/n}\);
the scale of survival time \(\text{scal} = 10\);
censoring is enabled, and max censoring time \(c=1\).
Ordinal Regression
Usage:
family='ordinal'[, class_num=...]
Model: \(y\in \{0, 1, \dots, n_{class}\}\), \(\mathbb{P}(y\leq i) = \dfrac{1} {1+\exp(-x^T\beta - \varepsilon_i)}\), where \(i\in \{0, 1, \dots, n_{class}\}\) and \(\forall i<j, \varepsilon_i < \varepsilon_j\).
the coefficient \(\beta\sim U[-M, M]\), where \(M = 125\sqrt{2\log p/n}\);
the intercept: \(\forall i,\varepsilon_i\sim U[-M, M]\);
the number of classes \(n_{class}=3\).