THE LANCET Digital Health, November 2019
Presentation of statistical metrics has been described previously in clinical research, epidemiology, and machine learning; however, in-depth discussion of the field of deep learning is scarce. Deep learning using convolutional neural networks has gained substantial traction in medicine; however, inconsistencies exist in the literature on how predicted results from deep learning are reported. Given the rapid pace of deep-learning research in the medical field, the over-reliance on specific statistical methods (eg, mean absolute error), in recent health-care-related deep-learning papers might not provide complete information. For example, in the case of using mean absolute error for evaluation, proportional bias cannot be detected, which is important when evaluating the agreement between predicted results from deep learning and ground truth measurement (appendix p 1). In this instance, additional methods, discussed in this Comment, should be considered to allow a more comprehensive analysis of the results. We hereby discuss considerations and recommendations that might be useful to better assess the quality and potential utility of deep learning algorithms for both continuous (eg, blood pressure and weight) and binary (eg, disease vs no disease) health outcomes.
Mean absolute error with or without scatter plots is frequently presented for comparisons between the actual ground truth values of continuous health parameters (eg, blood pressure and weight) and the deep learning predicted measurements. Mean absolute error takes the mean of absolute differences between deep learning predicted and ground truth values. This approach potentially masks other useful information—ie, mean absolute error does not provide a clear indication about whether there is an overestimation or underestimation, and by how much. For instance, in appendix p 1, the scatter plot shows the line of best fit obtained by linear regression and seems to suggest that the deep learning algorithm provides good prediction for blood pressure. However, important information such as systemic and proportional error (ie, when the difference between the two measurements is proportional to the magnitude of measurements) could not be determined from this scatter plot. By contrast, Bland-Altman plots allow evaluation of fixed and proportional biases, together with presentation of limits of agreement (the confidence limits of the difference between the deep learning-predicted and the ground truth measurements). Therefore, the Bland-Altman plot could be a useful complementary statistical and graphical method in addition to typical correlation analysis. However, mean absolute error might also be complemented with root mean square error (RMSE), which is more sensitive to outliers. To illustrate this point, we evaluated the correlation between ground truth and values predicted by deep learning for diastolic blood pressure in two scenarios: with and without obvious outliers (appendix p 2). The mean absolute error values are largely similar in both scenarios (7·54 mm Hg without outliers and 7·46 mm Hg with obvious outliers). However, the values are more noticeably different when evaluating using RMSE (9·58 mm Hg without outliers and 10·30 mm Hg with obvious outliers). This finding suggests that RMSE might be a useful method and complementary to mean absolute error.