MultinomialRegression#
Warning
In the old version of abess (before 0.4.0), this model is named abess.linear.abessMultinomial. Please note that it will be deprecated in version 0.6.0.
- class abess.linear.MultinomialRegression[source]#
Adaptive Best-Subset Selection(ABESS) algorithm for multiclassification problem.
- Parameters
path_type ({"seq", "gs"}, optional, default="seq") --
The method to be used to select the optimal support size.
For path_type = "seq", we solve the best subset selection problem for each size in support_size.
For path_type = "gs", we solve the best subset selection problem with support size ranged in (s_min, s_max), where the specific support size to be considered is determined by golden section.
support_size (array-like, optional) -- default=range(min(n, int(n/(log(log(n))log(p))))). An integer vector representing the alternative support sizes. Only used when path_type = "seq".
s_min (int, optional, default=0) -- The lower bound of golden-section-search for sparsity searching.
s_max (int, optional, default=min(n, int(n/(log(log(n))log(p)))).) -- The higher bound of golden-section-search for sparsity searching.
group (int, optional, default=np.ones(p)) -- The group index for each variable.
alpha (float, optional, default=0) --
Constant that multiples the L2 term in loss function, controlling regularization strength. It should be non-negative.
If alpha = 0, it indicates ordinary least square.
fit_intercept (bool, optional, default=True) -- Whether to consider intercept in the model. We assume that the data has been centered if fit_intercept=False.
ic_type ({'aic', 'bic', 'gic', 'ebic', 'loss'}, optional, default='ebic') --
The type of criterion for choosing the support size if cv=1. The full name of each option:
'aic': Akaike information criterion
'bic': Bayesian information criterion
'gic': Generalized information criterion (see [2-4]). It refers to special information criterion (SIC) in [1].
'ebic': Extended Bayesian information criterion [5]
'loss': Loss value
ic_coef (float, optional, default=1.0) -- Constant that controls the regularization strength on chosen information criterion.
cv (int, optional, default=1) --
The folds number when use the cross-validation method.
If cv=1, cross-validation would not be used.
If cv>1, support size will be chosen by CV's test loss, instead of IC.
cv_score ({'test_loss', ...}, optional, default='test_loss') --
The score used on test data for CV.
All methods support {'test_loss'}.
LogisticRegression also supports {'roc_auc'}.
MultinomialRegression also supports {'roc_auc_ovo', 'roc_auc_ovr'}, which indicate "One vs One/Rest" algorithm, respectively.
thread (int, optional, default=1) --
Max number of multithreads.
If thread = 0, the maximum number of threads supported by the device will be used.
A_init (array-like, optional, default=None) -- Initial active set before the first splicing.
always_select (array-like, optional, default=None) -- An array contains the indexes of variables we want to consider in the model. For group selection, it should be the indexes of groups (start from 0).
max_iter (int, optional, default=20) -- Maximum number of iterations taken for the splicing algorithm to converge. Due to the limitation of loss reduction, the splicing algorithm must be able to converge. The number of iterations is only to simplify the implementation.
is_warm_start (bool, optional, default=True) -- When tuning the optimal parameter combination, whether to use the last solution as a warm start to accelerate the iterative convergence of the splicing algorithm.
screening_size (int, optional, default=-1) --
The number of variables remaining after screening. It should be a non-negative number smaller than p, but larger than any value in support_size.
If screening_size=-1, screening will not be used.
If screening_size=0, screening_size will be set as \(\\min(p, int(n / (\\log(\\log(n))\\log(p))))\).
primary_model_fit_max_iter (int, optional, default=10) -- The maximal number of iteration for primary_model_fit.
primary_model_fit_epsilon (float, optional, default=1e-08) -- The epsilon (threshold) of iteration for primary_model_fit.
splicing_type ({0, 1}, optional, default=0) -- The type of splicing: "0" for decreasing by half, "1" for decresing by one.
important_search (int, optional, default=128) -- The size of inactive set during updating active set when splicing. It should be a non-positive integer and if important_search=0, it would be set as the size of whole inactive set.
Examples
Results may differ with different version of numpy.
>>> ### Sparsity known >>> >>> from abess.linear import MultinomialRegression >>> from abess.datasets import make_multivariate_glm_data >>> import numpy as np >>> np.random.seed(12345) >>> data = make_multivariate_glm_data( >>> n = 100, p = 50, k = 10, M = 3, family = 'multinomial') >>> model = MultinomialRegression(support_size = 10) >>> model.fit(data.x, data.y) MultinomialRegression(support_size=10) >>> model.predict(data.x)[:10, ] array([0, 2, 0, 0, 1, 1, 1, 1, 1, 0])
>>> ### Sparsity unknown >>> >>> # path_type="seq" >>> model = MultinomialRegression(path_type = "seq") >>> model.fit(data.x, data.y) MultinomialRegression() >>> model.predict(data.x)[:10, ] array([0, 2, 0, 0, 1, 1, 1, 1, 1, 0]) >>> >>> # path_type="gs" >>> model = MultinomialRegression(path_type="gs") >>> model.fit(data.x, data.y) MultinomialRegression(path_type='gs') >>> model.predict(data.x)[:10, ] array([0, 2, 0, 0, 1, 1, 1, 1, 1, 0])
- coef_#
Estimated coefficients for the best subset selection problem.
- Type
array-like, shape(p_features, ) or (p_features, M_responses)
- intercept_#
The intercept in the model when fit_intercept=True.
- Type
float or array-like, shape(M_responses,)
- eval_loss_#
If cv=1, it stores the score under chosen information criterion.
If cv>1, it stores the test loss under cross-validation.
- Type
References
[1] Zhu, J., Wen, C., Zhu, J., Zhang, H., & Wang, X. (2020). A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52), 33117-33123.
[2] Tang, B., Zhu, J., Zhu, J., Wang, X., & Zhang, H. (2023). A Consistent and Scalable Algorithm for Best Subset Selection in Single Index Models. arXiv preprint arXiv:2309.06230.
[3] Zhu, J., Zhu, J., Tang, B., Chen, X., Lin, H., & Wang, X. (2023). Best-subset selection in generalized linear models: A fast and consistent algorithm via splicing technique. arXiv preprint arXiv:2308.00251.
[4] Zhang, Y., Zhu, J., Zhu, J., & Wang, X. (2023). A splicing approach to best subset of groups selection. INFORMS Journal on Computing, 35(1), 104-119.
[5] Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759-771.
- __init__(path_type='seq', support_size=None, s_min=None, s_max=None, group=None, alpha=None, fit_intercept=True, ic_type='ebic', ic_coef=1.0, cv=1, cv_score='test_loss', thread=1, A_init=None, always_select=None, max_iter=20, exchange_num=5, is_warm_start=True, splicing_type=0, important_search=128, screening_size=-1, primary_model_fit_max_iter=10, primary_model_fit_epsilon=1e-08)[source]#
- predict_proba(X)[source]#
Give the probabilities of new data being assigned to different classes.
- Parameters
X (array-like, shape(n_samples, p_features)) -- Sample matrix to be predicted.
- Returns
proba -- Returns the probability of given samples for each class. Each column indicates one class.
- Return type
array-like, shape(n_samples, M_responses)
- predict(X)[source]#
Return the most possible class for given data.
- Parameters
X (array-like, shape(n_samples, p_features)) -- Sample matrix to be predicted.
- Returns
y -- Predicted class label for each sample in X.
- Return type
array-like, shape(n_samples, )
- score(X, y, sample_weight=None)[source]#
Give new data, and it returns the prediction accuracy.
- Parameters
X (array-like, shape(n_samples, p_features)) -- Test data.
y (array-like, shape(n_samples, M_responses)) -- Test response (dummy variables of real class).
sample_weight (array-like, shape(n_samples,), default=None) -- Sample weights.
- Returns
score -- the mean prediction accuracy.
- Return type
- fit(X=None, y=None, is_normal=True, sample_weight=None, cv_fold_id=None, sparse_matrix=False, beta_low=None, beta_high=None)#
The fit function is used to transfer the information of data and return the fit result.
- Parameters
X (array-like of shape(n_samples, p_features)) -- Training data matrix. It should be a numpy array.
y (array-like of shape(n_samples,) or (n_samples, M_responses)) --
Training response values. It should be a numpy array.
For regression problem, the element of y should be float.
For classification problem, the element of y should be either 0 or 1. In multinomial regression, the p features are actually dummy variables.
For survival data, y should be a \(n \times 2\) array, where the columns indicates "censoring" and "time", respectively.
is_normal (bool, optional, default=True) -- whether normalize the variables array before fitting the algorithm.
sample_weight (array-like, shape (n_samples,), optional) -- Individual weights for each sample. Only used for is_weight=True. Default=np.ones(n).
cv_fold_id (array-like, shape (n_samples,), optional, default=None) -- An array indicates different folds in CV. Samples in the same fold should be given the same number.
sparse_matrix (bool, optional, default=False) -- Set as True to treat X as sparse matrix during fitting. It would be automatically set as True when X has the sparse matrix type defined in scipy.sparse.
- get_params(deep=True)#
Get parameters for this estimator.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it's possible to update each component of a nested object.- Parameters
**params (dict) -- Estimator parameters.
- Returns
self -- Estimator instance.
- Return type
estimator instance
- __new__(*args, **kwargs)#