.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/1-glm/plot_4_CoxRegression.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_1-glm_plot_4_CoxRegression.py: ================================= Survival Analysis: Cox Regression ================================= .. GENERATED FROM PYTHON SOURCE LINES 7-59 Cox Proportional Hazards Regression ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Cox Proportional Hazards (CoxPH) regression is to describe the survival according to several corvariates. The difference between CoxPH regression and Kaplan-Meier curves or the logrank tests is that the latter only focus on modeling the survival according to one factor (categorical predictor is best) while the former is able to take into consideration any covariates simultaneously, regardless of whether they're quantitive or categorical. The model is as follows: .. math:: h(t) = h_0(t)\exp(\eta). where, - :math:`\eta = x\beta.` - :math:`t` is the survival time. - :math:`h(t)` is the hazard function which evaluates the risk of dying at time :math:`t`. - :math:`h_0(t)` is called the baseline hazard. It describes value of the hazard if all the predictors are zero. - :math:`beta` measures the impact of covariates. Consider two cases :math:`i` and :math:`i'` that have different :math:`x` values. Their hazard function can be simply written as follow .. math:: h_i(t) = h_0(t)\exp(\eta_i) = h_0(t)\exp(x_i\beta), and .. math:: h_{i'}(t) = h_0(t)\exp(\eta_{i'}) = h_0(t)\exp(x_{i'}\beta). The hazard ratio for these two cases is .. math:: \frac{h_i(t)}{h_{i'}(t)} & = \frac{h_0(t)\exp(\eta_i)}{h_0(t)\exp(\eta_{i'})} \\ & = \frac{\exp(\eta_i)}{\exp(\eta_{i'})}, which is independent of time. Lung Cancer Dataset Analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We are going to apply best subset selection to the NCCTG Lung Cancer Dataset from https://www.kaggle.com/ukveteran/ncctg-lung-cancer-data. This dataset consists of survival information of patients with advanced lung cancer from the North Central Cancer Treatment Group. The proportional hazards model allows the analysis of survival data by regression modeling. Linearity is assumed on the log scale of the hazard. The hazard ratio in Cox proportional hazard model is assumed constant. First, we load the data. .. GENERATED FROM PYTHON SOURCE LINES 59-66 .. code-block:: Python import pandas as pd data = pd.read_csv('cancer.csv') data = data.drop(data.columns[[0, 1]], axis=1) print(data.head()) .. rst-class:: sphx-glr-script-out .. code-block:: none time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss 0 306 2 74 1 1.0 90.0 100.0 1175.0 NaN 1 455 2 68 1 0.0 90.0 90.0 1225.0 15.0 2 1010 1 56 1 0.0 90.0 90.0 NaN 15.0 3 210 2 57 1 1.0 90.0 60.0 1150.0 11.0 4 883 2 60 1 0.0 100.0 90.0 NaN 0.0 .. GENERATED FROM PYTHON SOURCE LINES 67-69 Then we remove the rows containing any missing data. After that, we have a total of 168 observations. .. GENERATED FROM PYTHON SOURCE LINES 69-73 .. code-block:: Python data = data.dropna() print(data.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (168, 9) .. GENERATED FROM PYTHON SOURCE LINES 74-75 Then we change the categorical variable ``ph.ecog`` into dummy variables: .. GENERATED FROM PYTHON SOURCE LINES 75-81 .. code-block:: Python data['ph.ecog'] = data['ph.ecog'].astype("category") data = pd.get_dummies(data) data = data.drop('ph.ecog_0.0', axis=1) print(data.head()) .. rst-class:: sphx-glr-script-out .. code-block:: none time status age sex ... wt.loss ph.ecog_1.0 ph.ecog_2.0 ph.ecog_3.0 1 455 2 68 1 ... 15.0 False False False 3 210 2 57 1 ... 11.0 True False False 5 1022 1 74 1 ... 0.0 True False False 6 310 2 68 2 ... 10.0 False True False 7 361 2 71 2 ... 1.0 False True False [5 rows x 11 columns] .. GENERATED FROM PYTHON SOURCE LINES 82-85 We split the dataset into a training set and a test set. The model is going to be built on the training set and later we will test the model performance on the test set. .. GENERATED FROM PYTHON SOURCE LINES 85-97 .. code-block:: Python import numpy as np np.random.seed(0) ind = np.linspace(1, 168, 168) <= round(168 * 2 / 3) train = np.array(data[ind]) test = np.array(data[~ind]) print('train size: ', train.shape[0]) print('test size:', test.shape[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none train size: 112 test size: 56 .. GENERATED FROM PYTHON SOURCE LINES 98-107 Model Fitting ~~~~~~~~~~~~~ The ``CoxPHSurvivalAnalysis()`` function in the ``abess`` package allows we to perform best subset selection in a highly efficient way. By default, the function implements the abess algorithm with the support size (sparsity level) changing from 0 to :math:`\min\{p,n/\log(n)p \}` and the best support size is determined by EBIC. You can change the tuneing criterion by specifying the argument ``ic_type`` and the support size by ``support_size``. The available tuning criteria now are ``"gic"``, ``"aic"``, ``"bic"``, ``"ebic"``. Here we give an example. .. GENERATED FROM PYTHON SOURCE LINES 107-113 .. code-block:: Python from abess import CoxPHSurvivalAnalysis model = CoxPHSurvivalAnalysis(ic_type='gic') model.fit(train[:, 2:], train[:, :2]) .. raw:: html
CoxPHSurvivalAnalysis(ic_type='gic')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 114-116 After fitting, the coefficients are stored in ``model.coef_``, and the non-zero values indicate the variables used in our model. .. GENERATED FROM PYTHON SOURCE LINES 116-119 .. code-block:: Python print(model.coef_) .. rst-class:: sphx-glr-script-out .. code-block:: none [ 0. -0.379564 0.02248522 0. 0. 0. 0.43729712 1.42127851 2.42095755] .. GENERATED FROM PYTHON SOURCE LINES 120-122 This result shows that 5 variables (the 2nd, 3rd, 7th, 8th, 9th) are chosen into the Cox model. Then a further analysis can be based on them. .. GENERATED FROM PYTHON SOURCE LINES 124-131 More on the results ~~~~~~~~~~~~~~~~~~~ Hold on, we haven’t finished yet. After getting the estimator, we can further do more exploring work. For example, you can use some generic steps to quickly draw some information of those estimators. Simply fix the ``support_size`` in different levels, we can plot a path of coefficients like: .. GENERATED FROM PYTHON SOURCE LINES 131-151 .. code-block:: Python import matplotlib.pyplot as plt coef = np.zeros((10, 9)) ic = np.zeros(10) for s in range(10): model = CoxPHSurvivalAnalysis(support_size=s, ic_type='gic') model.fit(train[:, 2:], train[:, :2]) coef[s, :] = model.coef_ ic[s] = model.eval_loss_ for i in range(9): plt.plot(coef[:, i], label=i) plt.xlabel('support_size') plt.ylabel('coefficients') plt.title('ABESS Path') plt.legend() plt.show() .. image-sg:: /auto_gallery/1-glm/images/sphx_glr_plot_4_CoxRegression_001.png :alt: ABESS Path :srcset: /auto_gallery/1-glm/images/sphx_glr_plot_4_CoxRegression_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 152-153 Or a view of evolution of information criterion: .. GENERATED FROM PYTHON SOURCE LINES 153-160 .. code-block:: Python plt.plot(ic, 'o-') plt.xlabel('support_size') plt.ylabel('GIC') plt.title('Model selection via GIC') plt.show() .. image-sg:: /auto_gallery/1-glm/images/sphx_glr_plot_4_CoxRegression_002.png :alt: Model selection via GIC :srcset: /auto_gallery/1-glm/images/sphx_glr_plot_4_CoxRegression_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 161-166 Prediction is allowed for all the estimated model. Just call ``predict()`` function under the model you are interested in. The values it return are :math:`\exp(\eta)=\exp(x\beta)`, which is part of Cox PH hazard function. Here we give the prediction on the ``test`` data. .. GENERATED FROM PYTHON SOURCE LINES 166-170 .. code-block:: Python pred = model.predict(test[:, 2:]) print(pred) .. rst-class:: sphx-glr-script-out .. code-block:: none [11.0015887 11.97954111 8.11705612 3.32130081 2.9957487 3.23167938 5.88030263 8.83474265 6.94981468 2.79778448 4.80124013 8.32868839 6.18472356 7.36597245 2.79540785 7.07729092 3.57284073 6.95551265 3.59051464 8.73668805 3.51029827 4.28617052 5.21830511 5.11465146 2.92670651 2.31996184 7.04845409 4.30246362 7.14805341 3.83570919 6.27832924 6.54442227 8.39353611 5.41713824 4.17823079 4.01469621 8.99693705 3.98562593 3.9922459 2.79743549 3.47347931 4.40471703 6.77413094 4.33542254 6.62834299 9.99006885 8.1177072 20.28383502 14.67346807 2.27915833 5.78151822 4.31221688 3.25950636 6.99318596 7.4368521 3.86339324] .. GENERATED FROM PYTHON SOURCE LINES 171-176 With these predictions, we can compute the hazard ratio between every two observations (by dividing their values). And, we can also compute the C-Index for our model, i.e., the probability that, for a pair of randomly chosen comparable samples, the sample with the higher risk prediction will experience an event before the other sample or belong to a higher binary class. .. GENERATED FROM PYTHON SOURCE LINES 176-181 .. code-block:: Python test[:, 1] = test[:, 1] == 2 cindex = model.score(test[:, 2:], test[:, :2]) print(cindex) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.6839080459770115 .. GENERATED FROM PYTHON SOURCE LINES 182-188 On this dataset, the C-index is about 0.68. We can also perform prediction in terms of survival or cumulative hazard function. Here we choose the 9th predictor and we want to see how positive or negative status would affect the survival function for each patient. .. GENERATED FROM PYTHON SOURCE LINES 188-207 .. code-block:: Python surv_fns = model.predict_survival_function(train[:, 2:]) time_points = np.quantile(train[:, 0], np.linspace(0, 0.6, 100)).astype(float) legend_handles = [] legend_labels = [] _, ax = plt.subplots(figsize=(9, 6)) for fn, label in zip(surv_fns, train[:, 8].astype(int)): line, = ax.step(time_points, fn(time_points), where="post", color="C{:d}".format(label), alpha=0.5) if len(legend_handles) <= label: name = "positive" if label == 1 else "negative" legend_labels.append(name) legend_handles.append(line) ax.legend(legend_handles, legend_labels) ax.set_xlabel("time") ax.set_ylabel("Survival probability") ax.set_title("Survival curve plot") ax.grid(True) .. image-sg:: /auto_gallery/1-glm/images/sphx_glr_plot_4_CoxRegression_003.png :alt: Survival curve plot :srcset: /auto_gallery/1-glm/images/sphx_glr_plot_4_CoxRegression_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 208-214 The graph above shows that patients with positive status on the 9th predictor tend to have better prognosis. The ``abess`` R package also supports CoxPH regression. For R tutorial, please view https://abess-team.github.io/abess/articles/v05-coxreg.html. sphinx_gallery_thumbnail_number = 3 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.970 seconds) .. _sphx_glr_download_auto_gallery_1-glm_plot_4_CoxRegression.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_4_CoxRegression.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_4_CoxRegression.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_