Comparative Analysis of AI vs. Human Reasoning in Clinical Trial Interpretation: A Study Using Large Language Models

Authors

  • Arshjeet Singh Indiana University School of Medicine https://orcid.org/0009-0001-6538-5226
  • Gordon Mao Department of Neurological Surgery, Indiana University School of Medicine
  • Barnabas Obeng-Gyas Indiana University School of Medicine
  • Anoop Chinthala Indiana University School of Medicine
  • William Snyder Indiana University School of Medicine
  • Ethan Brown Indiana University School of Medicine
  • Kyle Ortiz Department of Neurological Surgery, Indiana University School of Medicine

DOI:

https://doi.org/10.18060/29686

Abstract

Introduction: The emergence of advanced LLMs has created new possibilities for medical research interpretation and clinical decision support. While these models have demonstrated impressive capabilities in various medical tasks, their ability to independently reason through clinical trial data and arrive at sound medical recommendations remains understudied. Understanding the alignment between AI and human expert reasoning is crucial for the future integration of AI tools in medical research and practice.

Methods: We found 20 landmark clinical trials that were published in the last 30 years in the New England Journal of Medicine. We took those articles and removed all the text from it leaving only the clinical trial data and graphs. We then plugged the data into four different AI platforms (ChatGPT, Gemini, Grok 3, and Claude) with the standardized prompt of : “With the tables and figures provided, interpret the following: evidence, statistics, clinical relevance, limitations, and practical applicability. Do not reference the original paper this was obtained from.” We then had two people compare the AI analysis of the data to the conclusions drawn in the article and rate each of the sections on a scale of 1-5.

Results: ChatGPT and Gemini were successful in accurately analyzing the clinical data and drew similar conclusions as the article. Claude and Grok 3 were less accurate in its conclusions and oversimplified much of the data in its interpretation. ChatGPT was generally the most accurate in its conclusion with 97.9% accuracy followed by Gemini with an 88.3% accuracy. Grok3 was 78.2% accurate in its conclusions while Claude was only 64.5% accurate in its conclusions.

Conclusion: ChatGPT and Gemini were accurate in analyzing the clinical data and drawing conclusions that were almost identical to the ones in the original article. This opens the door for AI to be used in healthcare to accurately summarize clinical data.

Downloads

Published

2026-03-30

Issue

Section

Abstracts