Interpretable machine learning for mortality modeling on patients with chronic diseases considering the COVID-19 pandemic in a region of Chile: A Shapley value based approach

Barría-Sandoval, Claudia Paz; Ferreira, Guillermo Patricio; Espinoza Venegas Maritza; Marchant, Vicente

Keywords: mortality, chronic diseases, shapley value, palliative care, machine learning, Interpretability, COVID-19

Abstract

The main objective of this study was to evaluate and interpret different machine learning models to predict the probability of mortality from chronic oncological diseases (COD), chronic non-oncological diseases (CNOD), and COVID-19 in the Biobío Region, Chile from 2016 to 2022. In this study, the causes of death attributed to COD and CNOD were recognized as conditions that would have necessitated palliative care. Retrospective cohort study of mortality data from the Chilean Ministry of Health. A total of 57,623mortality records due to chronic diseases were considered during the study years in the Biobío Region. Data characteristics included sociodemographic factors (age, gender, residence, place and date of death) and causes of death. Seven classification models were trained: Multinomial Regression, Random Forest, DecisionTree, Support Vector Machine, Naives Bayes, XGBoost and Neural Networks, to predict the probability of mortality from COD, CNOD and COVID-19 and the calibration, discrimination and accuracy of these models.Additionally, the Shapley Additive Explanations (SHAP) values were used to assess the interpretability of the models and the BorutaShap algorithm was used for variable selection. The XGBoost, Random Forest andMultinomial Regression models had the best prediction performances. In all prediction cases, XGBoost hada slight advantage over Random Forest and Multinomial Regression models, with an average global AUROC for repeated cross-validation of 0.624, 0.600, and 0.627, respectively. In addition, a global average Accuracy in favor of XGBoost of 0.642 compared to 0.641 and 0.633 of the models mentioned above. The variables selected by the BorutaShap method were age, place and date of death. XGBoost can be used to predict the probability of death from COD, CNOD and COVID-19 in mortality data from the Biobío Region. This model can be useful for allocating palliative care resources more effectively to the people who require it. Among the relevant variables for the prediction were the Date of death, Place of death and the Age.

Más información

Título de la Revista: RESEARCH IN STATISTICS
Volumen: 1
Editorial: Taylor & Francis
Fecha de publicación: 2023
Página de inicio: 1
Página final: 8
Idioma: Inglés
URL: https://doi.org/10.1080/27684520.2023.2240334