Medical Mayhem: How Poor Data Labeling is Sabotaging AI in Healthcare

Authors: Amina Khalpey, PhD, Kirtana Roopan, Jessa Deckwa, BS, and Zain Khalpey, MD, PhD, FACS


Data labeling is a critical facet of AI in healthcare. Its significance grows in parallel with our reliance on machine learning and other AI technologies. This paper illuminates the role of data labeling in healthcare, outlines the risks of poorly labeled data, and explores the potential of human-machine reinforced learning in building reliable AI platforms.


As the healthcare sector increasingly adopts AI, the accuracy and integrity of data labeling become pivotal. Accurate data labeling profoundly impacts the AI’s task efficiency and helps avert medical errors, potentially saving lives. While digital records have enhanced healthcare delivery, they’ve also introduced unanticipated safety risks and contributed to severe treatment mistakes, stemming from the usability issues of electronic health records (EHRs). Such errors can carry significant weight, particularly when related to medications—wrong drugs, overdoses, and delayed treatments (1).

What is Data Labeling?

Data labeling is considered the foundation for machine learning and artificial intelligence algorithm development. Data labeling requires the identification of raw data (i.e., images, text files, videos), and then the addition of one or more labels to that data to specify its context for the models, allowing the machine learning model to make accurate predictions (2). This labeling or cleaning process is similar to preparing a logistic regression data set to which it requires a human in the loop for confirmation of accurate labeling.

The Difference Between Labeled vs. Unlabeled Data in ML Models

The type of machine learning application or artificial intelligence model will determine if the dataset needs to be labeled or not. The labels identify the appropriate data vectors to be pulled in for model training, where the model, then, learns to make the best predictions (2).

Labeled Data

Advantages: Used in supervised learning and can be used to determine more actionable insights (forecasting tasks) (2).

Disadvantages: More difficult to acquire and store (time consuming and expensive) (2).

Unlabeled Data

Advantages: Easy to acquire and store, learning methods can help discover new clusters of data allowing for new categorizations when labeling (2).

Disadvantages: More limited in its usefulness (2).

The Detrimental Effects of Poorly Labeled Data

Poorly labeled data can have serious implications, leading to errors in diagnosis, treatment, and overall patient management. A few instances of medical errors due to poorly labeled healthcare data include:

Inaccessible Information:

In a case, a doctor ordered a blood pressure-lowering drug, amlodipine, for a patient via EHR. The physician added comments instructing not to administer the medication if the child’s blood pressure fell below a certain threshold. Regrettably, the nursing staff could not access this crucial directive, resulting in the drug being administered contrary to the doctor’s instructions (3).

Human-Machine Reinforced Learning: Paving the Way for Trustworthy AI

Human-machine reinforced learning can drastically mitigate the risk of medical errors due to poorly labeled data. This approach involves a collaborative interaction between humans and AI, allowing both to learn and improve from their interactions. As humans correct AI’s mistakes, the AI learns, thereby enhancing its performance.

Through this process, AI can become more reliable, leading to a trustworthy AI platform that can be relied upon. However, this necessitates a collaborative effort from health information technology developers, hospitals, and clinicians to design, implement, and rigorously test EHRs before and after deployment (1).

In the context of EHRs, the Office of the National Coordinator for Health Information Technology (ONC) could prioritize pediatric safety. Current federal requirements for testing EHRs do not adequately detect patient risks arising from usability issues. The ONC could mandate the inclusion of EHR features that focus on pediatric care risks and test those functions for their ability to effectively prevent medical errors (4).

Healthcare Data Labeling: Current State and Challenges

The integration of AI and Machine Learning tools in healthcare are currently used in limited ways for one of two reasons. The first is the cost of using EHR data. The per-patient cost in the 17 EHR-supported trials varied from US $44 to US $2,000, and total RCT costs from US $67,750 to US $5,026,000 (5). In the remaining 172 RCTs (91.0%), EHRs were used as a modality of intervention (5). The second challenge is the adaptation of using multisource data input for data analysis. Randomized controlled trials are frequently and increasingly conducted with the use of EHRs, but mainly as part of the intervention (5). In some trials, EHRs were used successfully to support recruitment and outcome assessment.

Multimodal Data Parameters in Which Data Labeling Could Have Significant Impact

The application of AI in healthcare typically requires labeled data, or data that has been tagged with information that helps the model to learn. The type of data that needs to be labeled depends largely on the specific use case. Here are some examples:

1. Medical Imaging Data: For AI applications in radiology or pathology, medical images such as X-rays, MRIs, CT scans, and pathology slides need to be labeled. For instance, in a chest X-ray, regions indicating a lung nodule or pneumonia need to be marked.

2. Electronic Health Records (EHR) Data: EHR data can be used for predicting disease progression, readmission risks, or for personalized treatment recommendations. The labeling might involve tagging the onset of a disease, patient symptoms, treatments administered, or other relevant clinical events.

3. Genomic Data: In precision medicine and genomics, genetic data needs to be labeled. This could involve tagging specific genetic variants or mutations that are associated with a particular disease.

4. Sensor Data: In applications that involve monitoring patient vital signs or physical activity (such as in a smartwatch or other wearable device), sensor data needs to be labeled. This could include labeling heart rate data to indicate periods of abnormal heart rhythms, or labeling accelerometer data to indicate different types of physical activity.

5. Medical Literature: For applications like automated medical literature analysis, scientific articles and clinical guidelines can be labeled to train the AI to understand and extract useful information.

The process of labeling data can be time-consuming and requires expert knowledge. For example, medical images often need to be labeled by experienced radiologists, while genomic data might need to be labeled by a geneticist. The quality of the labeled data is critical to the performance of the AI model: inaccurate or inconsistent labels can lead to a model that performs poorly or makes erroneous predictions.

The Profound Impact of Quality Data Labeling on Patient Safety

While it’s challenging to estimate the exact number of lives potentially saved by implementing effective AI systems in healthcare, research suggests that improving data labeling and usability of EHRs can greatly impact patient safety. Here are some additional examples and key findings which show limitations in dataset analysis using AI:

Improper Labeling

A study in which a data labeling error led to AI algorithm mistrust was performed by Hasimoto & et al. to evaluate chest x-ray imaging for possible pneumothorax diagnosis (6). The idea was that AI could make an accurate diagnosis of different diseases using chest x-rays by recognizing pathology on the images. The study used a publicly available NIH chest x-ray dataset to label and develop a model which was initially thought to be particularly accurate in diagnosing pneumothorax. Deeper analysis of the algorithm showed that most of the x-rays that had been labeled as containing pneumothorax also had a chest tube, bringing into question whether the AI was recognizing the pneumothorax or the chest tube as a result of improper labeling (7).

Small Dataset

IBM Watson Health’s cancer AI Algorithm was built on a small dataset of non-real cases with very limited input from actual oncologists, meaning that very little real data went into its design. This led to many incorrect treatment recommendations, like bevacizumab, a chemotherapy drug known to be associated with increased risk of bleeding, for a patient with a past medical history of severe bleeding (6).

Lessons Learned:

There are four broad categories of usability challenges (8):

1. Information display: EHRs may display data in confusing ways, or data may be hard to locate or missing.

2. Difficult data entry: Entering data in an EHR can be challenging, which may cause delays for orders and lead to clinicians using workaround solutions.

3. System feedback: In some situations, EHRs may not clearly communicate to users that an action has been taken, such as when a patient has already received medication.

4. Workflow support: Challenges can occur when clinicians must share information or tasks with others on the care team or across departments.

The Role of AI in Preventing These Errors:

AI can potentially help prevent these errors by providing more intuitive user interfaces, better decision support, and real-time error checking. For instance, AI can:

1. Improve information display: AI algorithms can help design more intuitive user interfaces that present information in a clear and easy-to-understand manner.

2. Simplify data entry: AI can assist in automating data entry processes or providing smart suggestions, reducing the chance of errors.

3. Enhance system feedback: AI can provide real-time feedback to ensure that all necessary actions have been taken and can alert clinicians if there is a potential error.

4. Support complex workflows: AI can help manage and coordinate complex workflows, ensuring that all team members are on the same page and reducing the chance of miscommunication.

Privacy Considerations in Data Labeling

Protecting patient data privacy is of paramount importance when labeling healthcare data for AI use. Medical records contain sensitive personal health information, and violating this privacy can lead to severe legal and ethical consequences. Therefore, privacy-preserving methods should be employed during data labeling. For instance, de-identification techniques can be used to remove personally identifiable information from the data. Also, advanced techniques like differential privacy, federated learning, or homomorphic encryption can enable data to be used and analyzed without revealing sensitive information.

However, these privacy-preserving methods are not foolproof and require careful implementation. In addition, they must be complemented by strong privacy policies and practices, including secure data storage and transfer, stringent access controls, and regular privacy audits.

Bias in Data Labeling

Data labeling in healthcare is subject to various types of bias, which can have significant consequences for the fairness and performance of AI models. Labeling bias can occur when the data labelers, consciously or unconsciously, favor certain outcomes over others. For example, if data labelers are more familiar with certain conditions or patient demographics, they might be more likely to label data related to these areas accurately, leading to biased training data.

To mitigate bias in data labeling, it’s crucial to educate data labelers about potential sources of bias and encourage them to label data objectively. Bias detection tools can also be used to identify and correct biased labels. Moreover, efforts should be made to diversify the group of data labelers, including a variety of medical experts with different backgrounds and expertise.

Human-Machine Collaboration in Data Labeling

Human-machine collaboration can significantly improve the efficiency and accuracy of data labeling. In this approach, data labeling is a joint effort between humans and AI, where humans provide initial labels and AI refines these labels based on learned patterns. The refined labels are then reviewed by humans to ensure their accuracy.

By leveraging the strengths of both humans and AI, this collaborative approach can minimize errors and reduce the time and effort required for data labeling. Moreover, as the AI learns from human corrections, it can continuously improve its labeling performance, leading to a virtuous cycle of learning and improvement.

Policy and Regulatory Considerations

Data labeling in healthcare must comply with a range of policies and regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) in the U.S. and the General Data Protection Regulation (GDPR) in the EU. These regulations impose strict requirements for data privacy and security, and non-compliance can result in hefty fines and legal penalties.

Additionally, regulatory bodies can play a significant role in improving data labeling practices in healthcare. For example, they could establish guidelines or standards for data labeling, promote best practices, and provide funding for research and development in this area.

Multidisciplinary Approach of Data Labeling inHealthcare

Data labeling is a critical component of AI in healthcare, with significant implications for the performance and fairness of AI models. By addressing the challenges of data labeling, including data privacy, bias, tooling, and regulation, we can unlock the full potential of AI in healthcare, leading to improved patient outcomes, cost savings, and overall healthcare system efficiency. However, achieving this goal requires a multi-disciplinary approach.

Cross-Disciplinary Collaboration for Efficient Data Labeling

Given the intricacy and importance of data labeling in healthcare, fostering cross-disciplinary collaborations is essential (5). Collaborations between data scientists, healthcare professionals, ethicists, and legal experts can help develop comprehensive strategies for efficient and ethical data labeling.

Healthcare professionals can provide the necessary clinical context and expertise for accurate labeling. Data scientists can create the tools and machine learning models to support and enhance the labeling process. Ethicists can help ensure that data labeling practices respect patient rights and societal values, while legal experts can ensure compliance with relevant laws and regulations.

Integration of AI into Clinical Workflow

For AI to have a substantial impact in healthcare, it needs to be integrated into the clinical workflow. This implies that the labeled data must not only be accurate and reliable but also relevant to the tasks clinicians perform. For instance, in radiology, an AI model trained to identify specific pathologies needs to have been trained on accurately labeled data that reflects the diversity of cases radiologists encounter in their day-to-day practice (7).

The integration of AI into clinical workflows also involves considering user interface and experience. AI-driven tools should be designed in a way that fits seamlessly into existing workflows, reducing clinician burden rather than adding to it.

Use of Synthetic Data

Synthetic data can be a powerful tool to overcome some of the challenges of data labeling in healthcare. Synthetic data refers to data that’s artificially created, rather than collected from real-world events. It can be used to augment the training data for AI models, particularly when real-world data is scarce or privacy concerns limit its use (9).

In the context of healthcare, synthetic patient data can mimic the complexity and variability of real patient data, without containing any sensitive personal information (9). This can enable more robust AI model training while preserving patient privacy. However, synthetic data must be generated carefully to ensure it is representative of real-world scenarios.

Need for Ongoing Monitoring and Evaluation

Even after an AI model has been trained and deployed, ongoing monitoring and evaluation are crucial. This is to ensure that the model continues to perform well as new data comes in, and that any drift in model performance can be detected and addressed promptly.

In the context of data labeling, ongoing monitoring can involve periodic review and validation of the labeled data, as well as recalibration of the AI model based on updated or corrected labels (10).

Future Developments in Data Labeling for Healthcare

In the future, we can expect several developments in data labeling for healthcare. Advances in AI and machine learning can lead to more efficient and accurate automated labeling methods. The increasing recognition of the importance of data privacy and security can lead to more robust privacy-preserving data labeling techniques.

Furthermore, the growing understanding of bias in AI can lead to better methods for detecting and correcting bias in labeled data (11). And finally, as AI continues to be integrated into healthcare, there will be a growing need for guidelines, standards, and best practices for data labeling, which can drive improvements in this area (11).

Ten-Step Process for Creating a Trustworthy AI with High-Quality Data Labeling in Healthcare

Please note that this is a simplified overview of a complex process, and each step can involve many sub-steps and considerations.


Data labeling is a complex and crucial process that plays a significant role in the efficacy of AI in healthcare. By addressing the challenges in data labeling, and embracing a collaborative, patient-centric approach, we can pave the way for AI to significantly improve healthcare delivery and outcomes. However, we must remain vigilant about the potential pitfalls and ethical considerations associated with data labeling, and strive for transparency, fairness, and privacy in all aspects of this process.

Key Takeaways:

1. The Importance of Data Labeling: Data labeling is a critical aspect of machine learning and artificial intelligence (AI), ensuring models are trained accurately and efficiently. In healthcare, this takes an even more profound significance as the quality of labeled data can directly impact patient care and outcomes.

2. Consequences of Incorrect Data Labeling: Poorly labeled data can lead to inaccurate model predictions, resulting in potential harm to patients (3,6). This could manifest as incorrect diagnoses, inappropriate treatment recommendations, or erroneous patient stratification.

3. Bias in Data Labeling: Bias is a significant concern in AI and can be introduced during data labeling. If the data used to train AI models is biased, this could lead to unfair or unequal treatment of certain patient groups.

4. Privacy Concerns: Data labeling in healthcare must be done with utmost care to preserve patient privacy. Any breach of this can lead to legal implications and loss of trust among patients and healthcare providers.

5. The Case of IBM Watson at MD Anderson: Watson’s failure at MD Anderson was due in part to a lack of contextual understanding and the inability to accurately interpret and label clinical data (7, 16). This highlights the need for robust, accurate data labeling that considers clinical context.


1. Sheller MJ., Edwards B., Reina GA., et al.. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data.Sci Rep 2020;10(1):12598.

2. International Business Machines Editorial Content Team. What is data labeling? International Business Machines. Accessed May 24th, 2023. What is Data Labeling? | IBM

3. Spasic I., Livsey J., Keane JA., et al.. Text mining of cancer-related information: review of current status and future directions.Int J Med Inform 2014;83(9):605–623.

4. Chen J., Ran X. Deep Learning With Edge Computing: A Review. IEEE 2019; 107(8):1655-1674.

5. Mc Cord KA, Ewald H, Ladanie A, et al. Current use and costs of electronic health records for clinical trial research: a descriptive study. CMAJ Open. 2019;7(1):E23-E32. Published 2019 Feb 3. doi:10.9778/cmajo.20180096

6. Hashimoto DA, Rosman G, Rus D, Meireles OR. Artificial Intelligence in Surgery: Promises and Perils. Ann Surg. 2018 Jul;268(1):70-76.

7.Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1), 44-56.

8. Jiang F, Jiang Y, Zhi H, et al. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2(4):230-243. Published 2017 Jun 21. doi:10.1136/svn-2017-000101

9. Perez, M. V., Mahaffey, K. W., Hedlin, H., Rumsfeld, J. S., Garcia, A., Ferris, T.,… & Desai, S. (2019). Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation. New England Journal of Medicine, 381(20), 1909-1917.

10. Cabitza, F., Rasoini, R., & Gensini, G. F. (2017). Unintended Consequences of Machine Learning in Medicine. JAMA, 318(6), 517–518.

11. Matheny, M., Israni, S. T., Ahmed, M., & Whicher, D. (2020). Artificial Intelligence in HealthCare: Anticipating Challenges to Ethics, Privacy, and Bias. Penn Bioethics Journal, 16(1).

12. Davenport T., Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J. 2019 Jun; 6(2):94-98.

13. Norgeot B, Glicksberg BS, Trupin L, et al. Assessment of a Deep Learning Model Based on Electronic Health Record Data to Forecast Clinical Outcomes in Patients With Rheumatoid Arthritis. JAMA Netw Open. 2019;2(3):e190606. Published 2019 Mar 1. doi:10.1001/jamanetworkopen.2019.0606

14. Beam AL., Kohane IS. Big Data and Machine Learning in Health Care. JAMA 2018 Apr; 319(13):1317–1318.

15. Rajkomar A., Dean J., Kohane I. Machine Learning in Medicine. N Engl J Med 2019 Apr; 380(14):1347-1358

16. Hao, K. The messy, secretive reality behind IBM’s AI promise. MIT Technology Review. 2020; The messy, secretive reality behind OpenAI’s bid to save the world | MIT Technology Review

17. AI Terms (AI Terminologies and Tools). Scale AI: AI Database Tools. Accessed 2023; What is Scale AI? – AI Terms

18. Litjens G., Kooi T., Bejnordi BE., et al. A survey on deep learning in medical image analysis. Med Image Anal 2017 Dec;42:60–88.

19. Esteva A., Robicquet A., Ramsundar B., et al. A guide to deep learning in healthcare.Nat Med 2019 Jan;25(1):24–29.

20. Sheller MJ., Edwards B., Reina GA., et al.. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci Rep 2020 Jul;10(1):12598.

21. Spasic I., Livsey J., Keane JA., et al. Text mining of cancer-related information: review of current status and future directions. Int J Med Inform 2014 Sep;83(9):605–623.