Mehmet Yigit Avci
November 24, 2023
min read

Guarding AI Safety in Medical Imaging: Data Shift Pitfalls

Watchdog – A Cutting-Edge Monitoring System to Ensure Consistent AI Model Performance and Reliable Patient Care


Ensuring AI safety is paramount in the domain of medical imaging. This research project delves into the challenges posed by data shift in the context of Multiple Sclerosis (MS) lesion segmentation, emphasizing the impact of varying MRI data acquisition parameters on AI model performance. To address this critical issue, we have developed an innovative tool called 'Watchdog' that continuously monitors data shifts and potential performance drops, safeguarding the reliability of AI-driven solutions. Our work showcases the significance of maintaining consistency in AI model performance for accurate patient care, and it has broader implications for enhancing AI safety in medical practice.


In recent times, an array of sophisticated deep learning solutions has successfully navigated rigorous regulatory approval processes [1]. These solutions have been instrumental in aiding radiologists across various aspects of the diagnostic process, including early cancer detection, trauma triaging, and streamlining routine tasks. However, the effectiveness and reliability of these emerging AI-driven medical solutions are intricately linked to the quality and faithful representation of their training data concerning the diverse spectrum of real-world clinical image data [2]. The proficiency of AI models hinges on their ability to be evaluated and applied using clinical data that closely mirrors the distribution patterns within the training dataset. Deviating from this congruence can lead to unforeseen model behaviors, potentially resulting in misleading outcomes—an outcome that is unequivocally unacceptable within the medical practice domain, particularly in terms of patient care. This challenge is commonly referred to as "data shift," and it can be attributed to various underlying factors. Notably, in the realm of medical imaging, alterations in data acquisition parameters, such as changes in echo time or slice thickness in MRI, represent one of the primary catalysts for data shift, significantly impacting the model's performance and predictions. For instance, if the training dataset predominantly comprises examples with a slice thickness of 1mm, yet real-world clinical data during deployment exhibits a different slice thickness, the model may yield disparate results, causing a deviation in its performance. Therefore, it is of paramount importance to maintain vigilant oversight of the input clinical data and AI model performance, preemptively alerting to any anticipated deviations in performance to avert potentially misleading diagnoses.

The reliability of cutting-edge AI-driven medical solutions hinges on the quality of their training data and their ability to address data shifts, safeguarding the accuracy and dependability of patient care.

In a recent research endeavor named "NeuroTEST", conducted in collaboration between deepc and Landshut University of Applied Sciences, we tackled the challenge of data shift. This project, funded by the German federal government to establish a platform for testing AI models with previously unseen data, involved an extensive exploration of how different data acquisition parameters impact the performance of AI models. Additionally, we delved into potential solutions for monitoring and mitigating this issue in real-world applications. To exemplify the significance of the data shift challenge, we selected Multiple Sclerosis (MS) lesion segmentation on T2 FLAIR Magnetic Resonance Imaging (MRI) images. The choice of this showcase is driven by the unique importance of addressing data shift in the MRI context [3]. MRI acquisition protocols encompass a wide range of sequence parameters, significantly affecting critical aspects such as image contrast, resolution, and signal-to-noise ratio (SNR). While this diversity empowers MR images to convey a broad spectrum of clinical information, it simultaneously introduces considerable heterogeneity across various radiology centers.

Our work encompasses three distinct levels of methodology and experimentation, as depicted in Figure 1, with each step briefly elucidated below:

  • Synthetic Data Generation (Red)
  • Stress Testing (Yellow)
  • Monitoring (aka. Watchdog) (Purple)

Synthetic Data Generation

Recognizing the costliness of repeatedly scanning the same patient with different acquisition parameters, we have devised a methodology to generate new image data with varying acquisition parameters using existing real data. This approach involves simulating acquisition shifts based on the MRI signal equation for a T2w FLAIR sequence, wherein the signal contributions of individual tissues are adjusted by their volume fractions and enhanced with a texture map. All factors, excluding the sequence (such as anatomical structures, DICOM scaling, or texture), are synthesized, as depicted by the blue box in Figure 2, starting from the genuine baseline scan prior to the simulation phase, illustrated in the green box in Figure 2.


We have successfully synthesized a 7x7 dataset, encompassing a grid of 7 Echo Time (TE) and 7 Inversion Time (TI) values, recognized as the most influential parameters affecting image contrast, as demonstrated in Figure 3. To ensure the fidelity of the simulated data, we further validate it by comparing against actual values extracted from real MRI scans.

Stress Testing

Subsequently, synthetic data is employed to subject state-of-the-art (SOTA) segmentation networks to rigorous stress tests, specifically targeting data shift challenges. For our experiments, we selected the nnUNet [4] and SegResNet [5] models as they are trained on a dataset characterized by heterogeneous contrast, originating from various field strengths and utilizing diverse acquisition protocols. The objective of these stress tests is to assess the extent to which these networks can withstand variations in image contrast. The dependency of the models' performance, quantified in terms of the Dice score measuring the correspondence between the model's segmentation predictions and ground truth, in response to changes in MRI protocol parameters (TI, TE), is mathematically modeled using a second-order polynomial function (as presented in Eq. 1). This modeling allows us to quantitatively assess the networks' resilience against shifts in acquisition parameters.


Figure 4 presents the performance of both SegResNet and nnUNet. As anticipated, the models exhibit improved performance with increasing TE and TI values, owing to enhanced contrast between lesion and normal brain tissue. Conversely, performance diminishes toward the extremities of the experimental grid. It is noteworthy that these extremities represent mathematical constraints, defined by the minimum and maximum combinations of TE and TI values in real sequences. The boundaries of the experimental grid may not align with those of the typical scan domain. These extreme data simulations are thus distinct from the training data, resulting in significant drops in the F1 value.

Figure 4: Performance of SegResNet and nnUNET for data with different acquisition parameters [6].

Continuous Monitoring/Watchdog

The term "Watchdog" pertains to a continuous monitoring system designed to assess the anticipated performance of the model. As illustrated in Figure 5, the user establishes a defined threshold, against which incoming data is scrutinized based on its parameters. By integrating these data parameters into Equation 1, we can ascertain whether the data falls outside the predefined safe zone. If it does, the system promptly alerts the user about the potential performance drop.

It is imperative to maintain a continuous vigil over the performance of the models to preempt any data-related issues. This vigilance is facilitated by our innovative solution, known as Watchdog.

In our demonstration, we maintain continuous real-time monitoring of MRI data to predict the segmentation models' performance. We achieve this by employing the second-order function derived from the results of the 7x7 dataset in step 2, enabling us to forecast the models' performance by inputting the TE and TI values of the incoming data.

Furthermore, we institute a performance threshold and notify the user if the expected performance falls below this predefined threshold. This ongoing performance monitoring serves to avert misleading outcomes and ensures that we extract the maximum benefit from our models. For more detailed technical insights, please refer to our collaborative work with HAW Landshut [6].

Figure 5

Figure 5: Watchdog concept. If incoming data deviates from the user-defined threshold, the system issues an alert, indicating potential performance below the set threshold.

Taking into consideration the aforementioned considerations and the significant steps already undertaken to tackle the data shift challenge, we've introduced an additional safety measure with a dual purpose. Firstly, it is aimed at enhancing the diagnostic process, guaranteeing the highest level of reliability and trustworthiness in the results. Secondly, it aligns with one of our core objectives - prioritizing AI safety within the deep learning domain.

By not only enhancing accuracy but also bolstering safety within the diagnostic process, this initiative seeks to empower radiologists to make more informed and dependable decisions. This, in turn, benefits patients by translating into improved diagnosis and treatment plans, aligning with our overarching commitment to providing the highest quality healthcare solutions.

Moreover, this proof of concept represents a significant milestone in our journey. Its implications extend beyond the current application of MS Lesion Segmentation, holding promise for a wide range of use cases. This has the potential to reshape and elevate various domains through the utilization of AI-driven insights.


  1. “ACR List of FDA cleared AI medical products,” AI Central, ACR Data Science Institute, Americal College of Radiology, 21 February 2022, (accessed 21 February 2022).
  2. P. Omoumi et al., “To buy or not to buy—evaluating commercial AI solutions in radiology (the ECLAIR guidelines),” Eur Radiol 31(6), 3786–3796 (2021) [doi:10.1007/s00330-020-07684-x].
  3. D. C. Castro, I. Walker, and B. Glocker, “Causality matters in medical imaging,” Nature Communications 11(1), 3673 (2020) [doi:10.1038/s41467-020-17478-w].
  4. Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
  5. Myronenko, A. (2019). 3D MRI Brain Tumor Segmentation Using Autoencoder Regularization. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds) Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2018. Lecture Notes in Computer Science(), vol 11384. Springer, Cham
  6. C. Posselt,  M.Y. Avci, M. Yigitsoy, P. Schuenke, C. Kolbitsch, T. Schaeffter, S. Remmele, (2023). Simulation of acquisition shifts in T2 Flair MR images to stress test AI segmentation networks. arXiv:
Subscribe to our newsletter!
Discover new products, the latest publications in Radiology AI and more