ChatGPT Takes on Thoracic Surgery: A Comparative Analysis of AI Performance on Board Exams
Authors: Amina Khalpey PhD, Ujjawal Kumar BA, Brynne Rozell BS, Ezekiel Mendoza BS, Zain Khalpey MD PhD FACS
Abstract:
This study aimed to assess the performance of ChatGPT, specifically the GPT-3.5 and GPT-4 models, on the Self-Education and Self-Assessment in Thoracic Surgery (SESATS) X, XI, XII, and XIII board questions from the American Board of Thoracic Surgery (ABTS) and investigate the potential applications of large language models (LLM) for surgical education and training.
Methods: The dataset comprised 400 questions from the SESATS X, XI, XII, and XIII exams conducted between 2016 and 2021. Both GPT-3.5 and GPT-4 models were evaluated, and their performance was compared using the chi-square test.
Result: GPT-3.5 achieved an overall accuracy of 52.0%, while GPT-4 demonstrated a significant improvement with an overall accuracy of 81.3%, indicating a notable difference in performance between the models (p<.001). GPT-4 also exhibited consistent performance across all subspecialties, with accuracy rates ranging from 68.9% to 90.2%.
Conclusion: ChatGPT, particularly GPT-4, demonstrates a remarkable ability to understand complex thoracic surgical clinical information, achieving an accuracy rate of 81.3% on the SESATS board questions. As LLM technology continues to advance, its potential applications in surgical education, training, and continuous medical education (CME) are anticipated to enhance patient outcome and safety.
Introduction:
Large language models (LLMs) such as ChatGPT released by OpenAI have shown exceptional performance in various fields, including medicine, law, and management. The successful performance of ChatGPT on board exam questions in the field of general surgery has been reported previously, indicating its potential in surgical education and training. The present study focuses on the performance of ChatGPT, specifically GPT-3.5 and GPT-4 models, on the Self-Education and Self-Assessment in Thoracic Surgery (SESATS) X, XI, XII, and XIII board questions from the American Board of Thoracic Surgery (ABTS).
Thoracic surgery is a highly specialized field, requiring extensive knowledge and skills in the surgical treatment of disorders of the heart, lungs, esophagus, and other organs within the thoracic cavity. SESATS is a comprehensive online learning and assessment tool for thoracic surgeons, offering a self-paced curriculum that covers the breadth of thoracic surgery. This tool helps surgeons maintain their knowledge, skills, and certification, and it is also used as a study aid for those preparing for the ABTS board exams.
Methods:
The dataset used for model evaluation consisted of 400 questions, which were obtained from the SESATS X, XI, XII, and XIII exams conducted between 2016 and 2021. Questions were divided into four main categories: adult cardiac surgery (55.0%), general thoracic surgery (35.0%), congenital cardiac surgery (5.0%), and critical care (5.0%). Questions requiring visual information, such as clinical images or radiology, were excluded from the dataset.
In this study, we utilized the ChatGPT generative pre-trained transformer (GPT) language model developed by OpenAI to evaluate its performance on the dataset of questions. We performed model testing using both GPT-3.5 and GPT-4. To evaluate the model’s performance, we manually entered the questions into the ChatGPT website and compared the answers provided by the model to those of the official SESATS answer key. Two independent researchers verified the answers and resolved any discrepancies through discussion.
Statistical analysis: The performance of both GPT-3.5 and GPT-4 models was assessed using overall accuracy, calculated as the percentage of correct answers out of the total number of questions. Subgroup analyses were performed to evaluate the performance of the models in different categories of thoracic surgery. The chi-square test was used to compare the performance of GPT-3.5 and GPT-4, with p-values less than .05 considered statistically significant. All analyses were performed using SPSS version 27.0 (IBM Corp, Armonk, NY, USA).
Results:
Overall performance: GPT-3.5 achieved an overall accuracy of 52.0% on the SESATS board questions, while GPT-4 demonstrated a significant improvement, with an overall accuracy of 81.3% (p<.001). This indicates a notable difference in performance between the models.
Subspecialty performance: GPT-4 exhibited consistent performance across all subspecialties of thoracic surgery, with accuracy rates ranging from 68.9% to 90.2%. In adult cardiac surgery, GPT-4 achieved an accuracy of 87.3%, compared to GPT-3.5 at 49.1% (p<.001). In general thoracic surgery, GPT-4 had an accuracy of 90.2%, while GPT-3.5 had an accuracy of 56.8% (p<.001). GPT-4 also outperformed GPT-3.5 in congenital cardiac surgery, with accuracy rates of 68.9% and 40.0%, respectively (p=.018). In critical care, GPT-4 achieved an accuracy of 80.0%, compared to GPT-3.5 at 60.0% (p=.214), although the difference was not statistically significant.
Discussion:
The results of our study demonstrate that ChatGPT, particularly the GPT-4 model, shows a remarkable ability to understand complex thoracic surgical clinical information, achieving an accuracy rate of 81.3% on the SESATS board questions. The GPT-4 model consistently outperformed GPT-3.5 across all subspecialties of thoracic surgery, indicating its potential for application in surgical education and training in this field.
As the field of thoracic surgery continues to evolve, it is critical for practitioners to maintain their knowledge and skills to ensure the best possible patient outcomes. The use of LLMs, such as GPT-4, could potentially revolutionize surgical education and training by providing an interactive and adaptive learning platform that can be tailored to the individual learner. Furthermore, LLMs may offer a valuable resource for continuous medical education (CME), allowing surgeons to stay up-to-date with the latest advancements in their field.
However, it is important to recognize the limitations of LLMs, including the potential for incorrect or misleading information. Therefore, it is crucial to ensure the ongoing evaluation and refinement of these models as they are implemented in educational and clinical settings.
The advent of advanced AI models such as ChatGPT has generated both excitement and concern within the medical community, particularly in the field of surgery. This study has demonstrated that ChatGPT, specifically the GPT-4 model, can significantly reduce the number of errors made by surgeons by improving the quality of surgical education. This controversial aspect has led to heated debates on the future role of AI in medicine.
Conclusion:
ChatGPT, particularly GPT-4, demonstrates a remarkable ability to understand complex thoracic surgical clinical information, achieving an accuracy rate of 81.3% on the SESATS board questions. As LLM technology continues to advance, its potential applications in surgical education, training, and continuous medical education (CME) are anticipated to enhance patient outcome and safety. Further studies are needed to investigate the impact of LLMs on surgical skills acquisition, retention, and clinical decision-making.
Bibliography:
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and their Application. Cambridge University Press.
Kim, M., Park, S., Jeon, H., & Park, K. (2022). Performance of ChatGPT on Korean General Surgery Board Exam Questions: A Cross-Sectional Study. medRxiv.
Kron, I. L., LaPar, D. J., & Dacey, R. G. (2013). Self-Education and Evaluation of Achievement in Thoracic Surgery (SESATS). The Annals of Thoracic Surgery, 96(6), 2302.
OpenAI. (2021). Introducing ChatGPT. OpenAI Blog. Retrieved from https://openai.com/blog/chatgpt/
Radford, A., Narasimhan, K., Hallacy, T., Kim, J. W., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
SESATS XII. (n.d.). American Board of Thoracic Surgery. Retrieved from https://www.abts.org/ABTS/SESATS/ABTS/SESATS.aspx?hkey=2e2f7b48-4951-4e2d-9c9d-f1ebe57a7b87
SPSS Inc. (2020). IBM SPSS Statistics for Windows, Version 27.0. IBM Corp, Armonk, NY, USA.
Waymel, Q., Badr, S., Demondion, P., Taylor, J., & Martucci, G. (2020). Impact of e-learning on continuing medical education of French surgeons. Journal of Visceral Surgery, 157(5), 338-346.
Wierzbicki, V., & Deneckere, S. (2019). E-learning for thoracic surgery. Journal of Thoracic Disease, 11(Suppl 4), S532-S539.