make_glm_data

class abess.datasets.make_glm_data(n, p, k, family, rho=0, corr_type='const', sigma=1, coef_=None, censoring=True, c=1, scal=10, snr=None, class_num=3)[source]

Generate a dataset with single response.

Parameters
  • n (int) -- The number of observations.

  • p (int) -- The number of predictors of interest.

  • k (int) -- The number of nonzero coefficients in the underlying regression model.

  • family ({gaussian, binomial, poisson, gamma, cox}) -- The distribution of the simulated response. "gaussian" for univariate quantitative response, "binomial" for binary classification response, "poisson" for counting response, "gamma" for positive continuous response, "cox" for left-censored response.

  • rho (float, optional, default=0) -- A parameter used to characterize the pairwise correlation in predictors.

  • corr_type (string, optional, default="const") -- The structure of correlation matrix. "const" for constant pairwise correlation, "exp" for pairwise correlation with exponential decay.

  • sigma (float, optional, default=1) -- The variance of the gaussian noise. It would be unused if snr is not None.

  • coef (array_like, optional, default=None) -- The coefficient values in the underlying regression model.

  • censoring (bool, optional, default=True) -- For Cox data, it indicates whether censoring is existed.

  • c (int, optional, default=1) -- For Cox data and censoring=True, it indicates the maximum censoring time. So that all observations have chances to be censored at (0, c).

  • scal (float, optional, default=10) -- The scale of survival time in Cox data.

  • snr (float, optional, default=None) -- A numerical value controlling the signal-to-noise ratio (SNR) in gaussian data.

  • class_num (int, optional, default=3) -- The number of possible classes in oridinal dataset, i.e. \(y \in \{0, 1, 2, ..., \text{class_num}-1\}\)

x

Design matrix of predictors.

Type

array-like, shape(n, p)

y

Response variable.

Type

array-like, shape(n,)

coef_

The coefficients used in the underlying regression model. It has k nonzero values.

Type

array-like, shape(p,)

Notes

The output, whose type is named data, contains three elements: x, y and coef_, which correspond the variables, responses and coefficients, respectively.

Each row of x or y indicates a sample and is independent to the other.

We denote \(x, y, \beta\) for one sample in the math formulas below.

  • Linear Regression

    • Usage: family='gaussian'[, sigma=...]

    • Model: \(y \sim N(\mu, \sigma^2),\ \mu = x^T\beta\).

      • the coefficient \(\beta\sim U[m, 100m]\), where \(m = 5\sqrt{2\log p/n}\);

      • the variance \(\sigma = 1\).

  • Logistic Regression

    • Usage: family='binomial'

    • Model: \(y \sim \text{Binom}(\pi),\ \text{logit}(\pi) = x^T \beta\).

      • the coefficient \(\beta\sim U[2m, 10m]\), where \(m = 5\sqrt{2\log p/n}\).

  • Poisson Regression

    • Usage: family='poisson'

    • Model: \(y \sim \text{Poisson}(\lambda),\ \lambda = \exp(x^T \beta)\).

      • the coefficient \(\beta\sim U[2m, 10m]\), where \(m = 5\sqrt{2\log p/n}\).

  • Gamma Regression

    • Usage: family='gamma'

    • Model: \(y \sim \text{Gamma}(k, \theta),\ k\theta = \exp(x^T \beta + \epsilon), k\sim U[0.1, 100.1]\) in shape-scale definition.

      • the coefficient \(\beta\sim U[m, 100m]\), where \(m = 5\sqrt{2\log p/n}\).

  • Cox PH Survival Analysis

    • Usage: family='cox'[, scal=..., censoring=..., c=...]

    • Model: \(y=\min(t,C)\), where \(t = \left[-\dfrac{\log U}{\exp(X \beta)}\right]^s,\ U\sim N(0,1),\ s=\dfrac{1}{\text{scal}}\) and censoring time \(C\sim U(0, c)\).

      • the coefficient \(\beta\sim U[2m, 10m]\), where \(m = 5\sqrt{2\log p/n}\);

      • the scale of survival time \(\text{scal} = 10\);

      • censoring is enabled, and max censoring time \(c=1\).

  • Ordinal Regression

    • Usage: family='ordinal'[, class_num=...]

    • Model: \(y\in \{0, 1, \dots, n_{class}\}\), \(\mathbb{P}(y\leq i) = \dfrac{1} {1+\exp(-x^T\beta - \varepsilon_i)}\), where \(i\in \{0, 1, \dots, n_{class}\}\) and \(\forall i<j, \varepsilon_i < \varepsilon_j\).

      • the coefficient \(\beta\sim U[-M, M]\), where \(M = 125\sqrt{2\log p/n}\);

      • the intercept: \(\forall i,\varepsilon_i\sim U[-M, M]\);

      • the number of classes \(n_{class}=3\).