During the development and deployment of any machine learning method being able to gauge the real-world performance is a crucial step when deciding from the zoo of available options: For computer vision tasks ablation studies that omit parts of the network have become the standard; for tabular machine learning K-Fold cross validation is used, however for survival analysis methods simply taking the K-Fold values are not necessarily a good indicator for practical effectiveness. Hence in this paper we present a score-agnostic methodology based on training data manipulation, that allows thorough assessment and comparison of the performance and stability of machine learning survival models of various kinds in an explainable manner.
Cox proportional hazards, survival tree, random survival forest, gradient boosted survival analysis and mixture density neural network models were trained on multiple detests of varying size and ratio of censored patients. Subsequently each model was optimized with Bayesian hyperparameter optimization. Various metrics have been calculated: Uno's concordance index; integrated Brier score; mean cumulative/dynamic AUROC; Log-rank score. For each optimized model censoring sensitivity analysis was performed to test model robustness by artificially reducing the time horizon of the study and randomly increasing the number of censored patients.
To asses the quality degradation caused by the data manipulation likelihood-ratio tests were used between the original and the modified datasets, and the resulting heat maps provided a baseline, allowing a comparison between them and the ones produced by scoring the models. It has been found, that the time-horizon reduction effects models less, than it would be expected from the data-manipulation, while increasing the number of censored instances impacts models substantially more.