مقایسه عملکرد مدل های زبانی بزرگ ChatGPT، Google Gemeni و Deepseek در رویکرد تشخیصی و درمانی به فشار خون مقاوم به درمان

Fatemeh Zahra Seyed-Kolbadi ℗, Sara Ghazizadeh, Ali Khorram, Ghazal Rezaee, Fatemeh Ardali, Shahin Abbaszadeh *

مقایسه عملکرد مدل های زبانی بزرگ ChatGPT، Google Gemeni و Deepseek در رویکرد تشخیصی و درمانی به فشار خون مقاوم به درمان

کد: G-1890

نویسندگان: Fatemeh Zahra Seyed-Kolbadi ℗, Sara Ghazizadeh, Ali Khorram, Ghazal Rezaee, Fatemeh Ardali, Shahin Abbaszadeh *

زمان بندی: زمان بندی نشده!

برچسب: سیستم های تصمیم یار بالینی

دانلود: دانلود پوستر

خلاصه مقاله:

خلاصه مقاله

Background: Large Language Models (LLMs) Show Promise for Enhancing Medical Decision-Making, but Their Performance in Specialized Domains Like Hypertension Management Remains Understudied. We Evaluated Three LLMs - ChatGPT-4, Google Gemini, and DeepSeek - Using Guideline-Based Clinical Questions to Assess Their Accuracy and Reliability. Methods: Thirty Clinical Questions (15 Simple, 10 Moderate, 5 Hard Difficulty) Were Extracted From the 2024 ESC Guidelines for the Management of Elevated Blood Pressure and Hypertension, With Emphasis on Resistant Hypertension. A Senior Cardiologist Scored Responses Using a 4-Point Scale: (1) Completely Incorrect Answer, (2) Correct Answer Containing Incorrect Information, (3) Correct but Inadequate Answer, and (4) Correct and Adequate Answer. We Summarized Non-Normally Distributed Data As Median With Interquartile Range (IQR) and Performed Nonparametric Analyses Using Stata 18 (StataCorp LLC), Including Friedman Tests With Post Hoc Wilcoxon Signed-Rank Tests (Bonferroni-Corrected) to Compare Model Performance Across Difficulty Levels. Results: ChatGPT-4 Achieved the Highest Median Scores for All Question Types (Hard: 4 [IQR 4-4]; Moderate: 4 [3-4]; Simple: 4 [3.5-4]), With Statistically Superior Performance Over Gemini and DeepSeek for Moderate Questions (p=0.03). Differences Were Nonsignificant for Hard (p=0.3) and Simple Questions (p=0.62), Though ChatGPT Showed Marginally Better Overall Performance (p=0.059). Gemini and DeepSeek Performed Comparably but Scored Lower Than ChatGPT (Median Range: 3-3.5). Response Consistency (IQRs) Was Similar Across Models. Conclusion: While ChatGPT-4 Outperformed Peers in Moderate-Difficulty Questions, the Absence of Significant Differences in Hard Questions - Coupled With the Small Sample of Hard Queries - Highlights the Need for Larger, More Balanced Datasets to Robustly Evaluate LLMs in Complex Clinical Scenarios. These Findings Underscore the Importance of Context-Specific Validation When Deploying LLMs in Medicine.

کلمات کلیدی

LLMs, Hypertension, AI Validation

بازنشر:

بازگشت

بازخورد

نظر شما چی هست؟ بر روی ستاره های مورد نظرتون کلیک کنید.

4.8

میانگین نمرات

دیدگاه ها (0)

تاکنون دیدگاهی منتشر نشده است. شما اولین نفر باشید!

AIMS

دومین کنگره بین المللی هوش مصنوعی در علوم پزشکی

مقایسه عملکرد مدل های زبانی بزرگ ChatGPT، Google Gemeni و Deepseek در رویکرد تشخیصی و درمانی به فشار خون مقاوم به درمان