Artifact-Based Domain Generalization of Skin Lesion Models
Deep Learning failure cases are abundant, particularly in the medical area. Recent studies in out-of-distribution generalization have advanced considerably on well-controlled synthetic datasets, but they do not represent medical imaging contexts. We propose a pipeline that relies on artifacts annotation to enable generalization evaluation and debiasing for the challenging skin lesion analysis context. First, we partition the data into levels of increasingly higher biased training and test sets for better generalization assessment. Then, we create environments based on skin lesion artifacts to enable domain generalization methods. Finally, after robust training, we perform a test-time debiasing procedure, reducing spurious features in inference images. Our experiments show our pipeline improves performance metrics in biased cases, and avoids artifacts when using explanation methods. Still, when evaluating such models in out-of-distribution data, they did not prefer clinically-meaningful features. Instead, performance only improved in test sets that present similar artifacts from training, suggesting models learned to ignore the known set of artifacts. Our results raise a concern that debiasing models towards a single aspect may not be enough for fair skin lesion analysis.
READ FULL TEXT