Establishing guidelines for prediction models in medical deep learning is essential

The increase in scientific publications on deep learning for cancer diagnostics in recent years is impressive, but the conversion of promising prototypes into automated systems for medical utilisation is still moderate. In a recent issue of the scientific journal "Nature Machine Intelligence", Paula Dhiman and colleagues published a comment highlighting the importance of planning evaluations of deep learning systems in advance by predefining study protocols.

Andreas Kleppe, Ole-Johan Skrede and Knut Liestøl from the Institute for Cancer Genetics and Informatics at Oslo University Hospital acclaim Dhiman and colleagues for highlighting the importance of good study design, and have now published a response to this comment in the January issue of "Nature Machine Intelligence".

Challenges in validations of prediction models

Prototypes for medical deep learning systems frequently claim to perform comparable with or better than clinicians. Even among the best studies evaluating external cohorts, few predefine the primary analysis, which can lead to over-optimistic results due to adaptations of the system, patient selection, or analysis methodology. The lack of stringent evaluation of external data and the development or evaluation of systems on narrow or inappropriate data for the intended medical setting are significant concerns. This over-promising will erode trust in the technology, and may hinder its adoption in the medical clinic. More concerning is the utilisation of prediction models that have not been properly tested, which may result in harm to patients due decisions made based on ill-founded evidence.

Recommended guidelines

In an article published in Nature Reviews Cancer in 2021, "Designing deep learning studies in cancer diagnostics", Kleppe et al. defined a list of recommended protocol items for external cohort evaluation of a deep learning system (PIECES). Among other recommendations, PIECES advocates explicit specification of the primary analysis and any pre-planned secondary analyses that authors wish to commit themselves to report on, and requests that authors describe precisely how the proposed system was developed and how its performance will be assessed.

Since the PIECES article was published, many publications have cited it in support of the need for predefined analyses and external cohort validation, and some have explicitly followed the guidelines.

By implementing these guidelines, medical utilization of deep learning systems can be enhanced, by the way of proper evaluation and translation of promising prototypes into verified systems in clinical practise. Kleppe and colleagues additionally suggest incentives that may increase the uptake of the practice — for example, through endorsement from investors, funders and publishers.