Comparative Performance of Large Language Models: ChatGPT o3-mini and Deepseek R1 in Pediatric Hematology/Oncology Questions

Sahel Sharifpoor Saleh *, Sara Mohammadi Kalashani ℗, Mohammad Zarbi

Comparative Performance of Large Language Models: ChatGPT o3-mini and Deepseek R1 in Pediatric Hematology/Oncology Questions

Code: G-1493

Authors: Sahel Sharifpoor Saleh *, Sara Mohammadi Kalashani ℗, Mohammad Zarbi

Schedule: Not Scheduled!

Tag: Intelligent Virtual Assistant

Download: Download Poster

Abstract:

Abstract

Background and aims: This study compares the performance of two large language models, ChatGPT o3-mini and Deepseek R1, in answering pediatric hematology/oncology multiple-choice questions. With the growing integration of artificial intelligence in clinical decision-making, it is essential to assess the ability of these models to accurately process complex medical queries. The study aims to evaluate their accuracy, response time, and reasoning capabilities under standardized conditions. Method: A cross-sectional analysis was performed using 100 self-assessment multiple-choice questions originally developed by the American Society of Pediatric Hematology/Oncology. Due to the inclusion of images in some items, eight questions were excluded, resulting in 92 paired items for evaluation. Both models were tested under controlled conditions with memory features disabled to ensure independent processing of each question. Performance metrics included the accuracy of response selection, response time measured from query submission to answer output, and qualitative assessment of the reasoning process. IBM SPSS version 27 was utilized for the statistical analyses. Statistical analyses, including McNemar’s test and the Mann-Whitney U test, were applied to identify significant differences between the models. Results: Deepseek R1 demonstrated a significantly higher accuracy rate (approximately 85.87%) compared to ChatGPT o3-mini (65.5%). Although ChatGPT o3-mini provided faster response times, its performance in processing complex clinical scenarios was less consistent. The statistical analyses confirmed that the differences in both accuracy and response times between the models were significant (p 0.001). Conclusion: The findings indicate that while both models show promise for supporting clinical decision-making in pediatric hematology/oncology, Deepseek R1 offers superior accuracy and more reliable clinical reasoning, despite its slower response time. These results suggest that further research is warranted to optimize the trade-off between speed and precision, and to evaluate the applicability of these models in real-world clinical settings.

Keywords

Artificial IIntelligence, Medicine, LLM, Hematology, Oncology

Feedback

What is your opinion? Click on the stars you want.

Comments (0)

No Comment yet. Be the first!

Post a comment