Regulate Artificial Intelligence Adoption in Medicine, not Artificial Intelligence Research and Development

Oct. 23, 2024, 4:57 p.m.

Blog Image

While artificial intelligence (AI) in medicine has been explored for more than 50 years, it has, until now, been largely unreachable for clinicians. The release of chat-based large language models (LLM) in 2022 and 2023 has broken down the barriers to AI use for clinicians. Using a simple web-based chat interface, clinicians can now query AI models directly and in real time.

Initial results with using generative AI in various clinical settings were very encouraging. For instance, a group of Dutch researchers found that ChatGPT was highly accurate at producing a differential diagnosis for patients presenting to the emergency department.(1) Other uses of ChatGPT in clinical medicine have included patient inquiries, decision-making, trial enrollment, data management, decisions support, research support, scribing patient charts, and patient education.(2)

What's the Worry?

However, much of the initial excitement of the use of generative AI has not stood up to rigid evaluation. For instance, a recent study testing thirty-two different queries regarding bystander CPR found that nearly half of those queries were answered with information unrelated to CPR, often "constituting grossly inappropriate responses."(3) Our own recent study entitled "Repeatability, Reproducibility, and Diagnostic Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Emergency Department Triage Using the Canadian Triage and Acuity Scale" found that among 10980 simulated triages only 47.5% were correct.(4) Similar results were found when ChatGP was tested for disaster triage.(5)

This lack of accuracy should not be surprising in the context of how these models are trained. Large Language Models, such as ChatGPT are based on pre-trained prediction models. These are trained to predict text based on probability and are not intrinsically knowledgeable in the topic. Initial training of most models was based on human responses to text prompts and has led to certain limitations in the models such as bias, toxicity, and confabulation (also called hallucination). While the field is rapidly evolving, and many steps are being taken to decrease these errors, they do persist.

Perhaps the most concerning and persistent error encountered in LLM is hallucination. Hallucination occurs when the language model gives a grammatically correct response, but factually incorrect. Interestingly, as the tools work on probability methods, they often do not “know” that the response is incorrect. Instead, they typically present the response as a fact. Hallucinations are also known as confabulation, or, as simply "getting it wrong." Furthermore, large language models do not always detect inconsistencies in input text. For instance, when asked to triage a patient who is "ambulatory and unconscious," ChatGPT frequently fails to recognize the inconsistency in this statement and instead simply assigns a triage code.

Unfortunately, the low barrier to the use of chat-based AI has also led to an almost casual integration into patient workflow. Unlike the use of new medications or medical devices, which usually require certification through an institutional process before being used for clinical care, chat-based AI tools are easily available through a web browser to anyone with internet access, making them ideal for clinical use in any setting. Frequently, these tools are used without any investigation into the issues of efficacy, safety, and privacy.

The rapid pace of AI development and the uncertainty of its safety has led to some health experts proposing a slowing down, or even halt, of research and development of AI in medicine.(6) Many fear that the sudden influx of AI research and development will proceed unchecked and lead to unexpected and dangerous results.

What did COVID-19 Teach Us?

In fact, a sudden influx of new technology and new treatment ideas is not a new phenomenon. During the COVID-19 pandemic, new technology arrived quickly, and was rapidly assimilated into the medical ecosystem.

For example, while the concept of mRNA vaccines has been explored since the early 21st century, the rapid research and development of the COVID-19 vaccine was unprecedented.(7) Well designed clinical trials demonstrated the safety and effectiveness of the mRNA vaccine, and, its influence on the course of the pandemic cannot be overestimated.(6) Quickly these vaccines were adopted worldwide. Doubtlessly, the rapid development of these vaccines was the most important medical intervention in controlling the pandemic.

mRNA Vaccine Development
COVID-19 Demanded Rapid Development of mRNA Vaccines

However, not all initial discoveries led to meaningful changes. During the COVID-19 pandemic, we learned that even though initial enthusiasm, and empiric results, may suggest a treatment to be safe and efficacious, further research may reveal otherwise. For instance, Ivermectin was initially found anecdotally to be efficacious for treatment of COVID-19 and was widely used off-label for treatment.(8) However, following rigorously controlled trials and a formal systematic review, little evidence for its effectiveness was found.(8) Many other treatments found in early published investigational reports to have effectiveness against COVID-19 were later, in more rigorous studies, shown to be ineffective. This led to many journal retractions. In fact, the COVID-19 pandemic era has led to a record number of scientific article retractions in 2023, where retractions are rising at a rate that outpaces scientific papers.(9)

What's the Rush?

Unlike during the COVID-19 pandemic, there is no hurry to implement AI guided tools in medicine. Rather than try to slow the research and development, we should carefully moderate adoption. Rapid research and development of new tools should be encouraged, supported, and funded. However, implementation should be moderated by the same methods used for generations to adopt new medications, devices, procedures, and technologies. AI guided tools should undergo the same statistically rigorous evaluation as one would use for a new drug or new medical device. Statistical tools for this type of research are well known to the medical community, and many options for succinct and concise study designs are available. For instance, we used the well documented statistical technique of Gage Repeatability and Reproducibility in our evaluation of the accuracy of ChatGPT guided triage.(4,5) Numerous other tools exist in the medical and industrial literature. Furthermore, institutions should be encouraged to issue clear, direct policy on how AI solutions can be used, and what procedures should be followed to confirm their safety and effectiveness.

AI Researcher
AI Development Can be Approached with Caution

In summary, AI guided tools offer tremendous potential in many aspects of medicine. Research and development of these tools should be encouraged. However, adoption should be moderated. AI guided tools should undergo the same statistically rigorous evaluation as one would use for a new drug or new medical device. Institutions should provide clear guidelines outlining acceptable uses of AI and the requirements for adoption.

References

  1. Berg HT, van Bakel B, van de Wouw L, Jie KE, Schipper A, Jansen H, et al. ChatGPT and Generating a Differential Diagnosis Early in an Emergency Department Presentation. Ann Emerg Med. 2024;83(1):83-6.
  2. Garg RK, Urs VL, Agarwal AA, Chaudhary SK, Paliwal V, Kar SK. Exploring the role of ChatGPT in patient care (diagnosis and treatment) and medical research: A systematic review. Health Promot Perspect. 2023;13(3):183-91.
  3. Murk W, Goralnick E, Brownstein JS, Landman AB. Quality of Layperson CPR Instructions From Artificial Intelligence Voice Assistants. JAMA Netw Open. 2023;6(8):e2331205.
  4. Franc JM CL, Hart A, Hata R, Hertelendy A. Repeatability, Reproducibility, and Diagnostic Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Emergency Department Triage Using the Canadian Triage and Acuity Scale. CJEM. 2023.
  5. Franc JM, Hertelendy AJ, Cheng L, Hata R, Verde M Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage of Simulated Patients Using the Simple Triage and Rapid Treatment (START) Protocol: Gage Repeatability and Reproducibility Study J Med Internet Res 2024;26:e55648
  6. Federspiel F, Mitchell R, Asokan A, Umana C, McCoy D. Threats by artificial intelligence to human health and human existence. BMJ Glob Health. 2023;8(5).
  7. Polack FP, Thomas SJ, Kitchin N, Absalon J, Gurtman A, Lockhart S, et al. Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine. N Engl J Med. 2020;383(27):2603-15.
  8. Marcolino MS, Meira KC, Guimaraes NS, Motta PP, Chagas VS, Kelles SMB, et al. Systematic review and meta-analysis of ivermectin for treatment of COVID-19: evidence beyond the hype. BMC Infect Dis. 2022;22(1):639.
  9. Van Noorden R. More than 10,000 research papers were retracted in 2023 - a new record. Nature. 2023.

By: Jeffrey Franc

Categories: Artificial Intelligence Research

Views: 252

If you liked this post, please share it on Twitter or Facebook.


Comments

You must be registered as a user to add a comment.

Please login or create a FREE account here


Are You Ready to Increase Your Research Quality and Impact Factor?

Sign up for our mailing list and you will get monthly email updates and special offers.

Protected by reCAPTCHA
WhatsApp

1-587-410-3498

Contact
Checklists
Privacy
Security
FAQ