This article contains affiliate links, meaning Modern MedEd may earn a commission at no cost to you should you decide to make a purchase through such a link. This helps support our free content. Thank you!
AI as a Clinical Decision Support Tool in Psychiatry
Show Infographic
- Key Points
- An estimated two in three clinicians now use AI tools, with medical-specific LLMs (like OpenEvidence and DoxGPT) utilizing Retrieval Augmented Generation (RAG) to anchor responses in cited literature.
- Beyond clinical Q&A, modern platforms are increasing utility by integrating ambient scribes, drug interaction checkers, and awarding CME credit and tracking directly into the workflow.
- While user satisfaction is high, actual performance varies; general-purpose models (like GPT-4) have demonstrated low accuracy in complex real-world scenarios, and no RCTs yet prove these tools improve patient outcomes.
- If the underlying training data or medical literature contains historical prejudices, models can scale these biases and may recommend inferior treatments for marginalized demographics.
AI Clinical Decision Tools in Psychiatry
While general population's uptake of AI is high, the proportion of physicians and other clinicians using these tools is even greater, an estimated two in three, representing a 78% increase in just two years, according to the AMA. This article will focus on LLMs as clinical decision support tools, including the way they are used, how they can augment your practice, and pitfalls to beware.
Clinical decision support (CDS) systems are nothing new (see VisualDx and MDCalc), and are very useful in day to day practice. However, the way LLMs can be used to find answers and get actionable recommendations at the point of care is nothing short of amazing.
On the surface, many of these tools appear grossly similar. OpenEvidence, ReachRx, DoxGPT (which incorporates Pathway, another CDS LLM), and other medical LLMs all offer the same visual ‘front door’ into the application with an “Ask Anything” chat form.
Vendors differentiate their products from each other by (a) curated evidence bases, (b) integration into clinician workflows, (c) mechanisms to improve provenance (citation, peer-review overlays), and (d) additional tools such as ambient scribes, drug interaction checkers, and other communication features built into the platform. Another perk is that many of these are free for verified clinicians to use.
These tools differ from standard LLMs because they are strengthened with RAG, or retrieval augmented generation, a process that supplements the model’s internal knowledge with live searches of curated medical literature, reducing hallucinations and enabling responses that are anchored to identifiable primary sources.
Despite their easy availability, these tools remain very much under development. Tech companies are on a hiring spree trying to get this technology right in healthcare. PS – If you're interested, we've posted some of these AI jobs for healthcare professionals on our job board.
Want the full infographic in a high-quality PDF format? Join our email list and I'll send it over right away!
Benefits of an AI CDS Tool
If you are reading this, you are probably the kind of person who would already be using one or more of these tools.
I'm certainly not judging – I use them all the time too. Quick answers, free unlimited use, CME credits, and extra features certainly make LLMs as a clinical decision support tool pretty attractive.
Quick Answers
Does adding bupropion to Auvelity (bupropion-dextromethorphan) further inhibit the metabolism of dextromethorphan? Does lamotrigine affect bone density? What is the proposed mechanism of low dose naltrexone?
These are just a few questions I've found myself asking OpenEvidence (and other medical LLMs) in between patient visits in recent weeks. I got near instant answers that took me just a few seconds to read and understand. I verified the answers I got (they were correct, this time) at a later time, when I had a bit more breathing room to look stuff up.
No doubt you have had similar experiences.
Free & Easy Category 1 CME Credits
If you are a regular reader, you’ll know this is one of my favorite perks. Platforms like OpenEvidence (free) and VisualDx (paid) award you CME credits every time you search for and read up on a query. The process is simple and painless, and awards you the credit you deserve for your ongoing learning.
Extra Features & Tools
DoxGPT is part of the Doximity web and mobile app, which also features Doximity Dialer, Video visit platform, digital fax, and AI scribe. You are probably already on Doximity (80% of clinicians are, according to their numbers), so this is a huge advantage. Open the app on your phone and any of these tools you need are right at your fingertips.
OpenEvidence also has a HIPAA compliant dialer and AI scribe built into the web and mobile app versions. This makes it easy to use multiple tools with limited context switching, which contributes to mental fatigue. They too have an easy-to-use web app and free Android and iOS app. OpenEvidence is betting its content partnerships with top journals will increase trust (and therefore adoption). Doximity recently acquired Pathway, which was an established medical LLM that strengthens its product. It's a great medical app to add to your device.
A Review of the Literature
We all know about hallucinations, so the question for us is: can we trust a medically fine-tuned LLM to act as a true clinical decision support tool, or is it confidently spewing out nonsense?
Reliability & Validity
This is probably the most important factor for widespread adoption of LLMs in healthcare.
A 2025 study of healthcare workers found that 40% of respondents used LLMs like ChatGPT at least weekly and 70% rated their experiences as positive. However, many healthcare providers in the study expressed concerns about the accuracy and reliability of the models.
Studies of general medicine LLMs (OpenEvidence in particular) tend to show the platform as accurate in medical scenarios, but did not substantially impact clinical decision making or modify plans, though it did tend to reinforce plans.
General purpose LLMs often perform relatively poorly in real world medical situations, despite their ability to pass the USMLE. A 2024 Nature study evaluating GPT-3.5-turbo and GPT-4-turbo found accuracy rates of 8% and 24%, respectively, when asked to provide clinical recommendations based on emergency department records.
A scoping review (an overview of a topic less targeted than a systematic review) published in 2025 identified several studies that attempted to evaluate LLM performance as a clinical decision support (CDS) tool in psychiatry.
I personally don’t think the LLMs did very well here – in one study using case based vignettes (which lack the full context of a real patient), ChatGPT 3.5 achieved a “Grade A” rating in only 61 out of 100 cases. There was no clinical validation performed on this analysis either.
Another study in this review evaluating GPT-4’s performance against community clinicians showed the model selected appropriate bipolar depression treatment in 50.8% of cases (which was slightly higher than that of community clinicians). This was another vignette-based prompt series comparing the outputs of LLMs under various conditions with expert opinion on the same vignettes.
No strong studies (and exactly zero RCTs that I could dig up) found that any LLMs improve patient outcomes, increase diagnostic accuracy or treatment selection, or reduce adverse events in psychiatric practice. Most trials evaluate LLM performance with standardized test sets, vignettes, and clinician surveys, so the strength of these findings remains up for debate. However, these are still the early days of AI. That said, none of this inspires particular confidence (or dread) that psychiatrists and therapists will be replaced by AI anytime soon.
Bias & Stigma
Bias is not a new problem in psychiatry, but LLMs risk scaling existing biases extremely easily. Large language models are trained on enormous volumes of text, including published research, clinical guidelines, and of course, disturbingly unfiltered internet content (and have you seen the internet lately – yikes).
Even when “medically fine-tuned,” these models inherit the assumptions, omissions, and structural biases present in their training data. Multiple peer-reviewed studies demonstrate that AI tools can exhibit demographic bias in clinical contexts. In JAMA Network Open, a cross-sectional vignette study showed that AI chatbots provided different recommendations based on patient gender, race, and socioeconomic status, underscoring the risk that algorithmic outputs may reproduce or amplify inequities present in healthcare systems.
A qualitative comparison of four LLMs showed that the models often proposed inferior treatments when the vignette explicitly or implicitly indicated the patient's race was black.
Importantly, retrieval-augmented generation (RAG) does not eliminate bias. It may improve factual accuracy, but if the underlying literature is biased (and much of psychiatry’s literature historically is), the model will faithfully reproduce those biases with impressive efficiency.
There is also a bias called ‘automation bias,’ clinicians should be aware of. This is the human tendency to trust and rely upon outputs generated by automated systems (like AI tools) while undervaluing our own critical judgment. Over-reliance on these systems could lead to significant consequences in psychiatry.
Lack of Regulatory Standards & Potential Liability
At present, most clinician-facing medical LLMs exist in a regulatory gray zone.
Many vendors are careful to label their products as “informational” or “educational,” and “not medical devices,” even when they clearly function as clinical decision support systems. This allows them to avoid FDA oversight while still being marketed directly to clinicians as point-of-care tools.
From a regulatory standpoint, this places the burden of responsibility squarely on the user. There is nothing new about this, even CME programs do not assume any liability for their content.
Key issues here include:
- Liability: If an AI-generated recommendation contributes to patient harm, responsibility almost certainly rests with the clinician, not the vendor. There is little legal precedent to suggest otherwise.
- Transparency: Most platforms do not fully disclose their training data, fine-tuning processes, or update cadence, making independent validation difficult.
- Version drift: Models can change behavior over time with updates, sometimes without clear documentation, raising concerns about consistency and auditability.
- Institutional governance: Many clinicians are using these tools outside formal health system approval or oversight, particularly when accessing them via mobile apps or personal accounts.
From a medico-legal perspective, using an LLM as a reference is defensible; using it as a decision-maker is not. Until regulatory frameworks mature, clinicians should assume that AI recommendations offer zero liability protection and full personal risk.
Best Practices
1. Select tools fine-tuned with medical information
Not all LLMs are created equal. General-purpose chatbots are optimized for linguistic fluency, not clinical accuracy. If you are going to use AI as a second opinion, prioritize tools that are explicitly trained or fine-tuned on medical literature and clinical workflows. Purpose-built medical LLMs are not perfect, but they tend to hallucinate less often and anchor responses to established guidelines and literature.
That alone is reason enough to avoid using consumer-grade chatbots for clinical questions when better options exist at the same price point (free).
2. Treat outputs as suggestions, not directives
LLM outputs are effectively digital curbside consults. As we mentioned above, the buck stops with you for any decisions you make, regardless of where the suggestion came from.
3. Understand its limitations
LLMs are particularly flawed when they have incomplete or conflicting information, or when decisions are heavily influenced by the patient’s own values (risk tolerance, side effect non-negotiables, etc). The way you frame your prompt also impacts the way the model will respond.
4. Stay up to date
These tools are evolving rapidly, and their behavior may change over time as they are updated. They may become better at some tasks and worse at others. New platforms will emerge with their own benefits and weaknesses. One way to make sure you are getting updates is to join our email list, where you’ll see the latest in industry trends and jobs as it relates to healthcare professionals.
Modern MedEd Takeaway
This is one of the most promising areas of AI in my opinion. Keep an eye out for exponential improvements in this technology over the coming months and years. Try out a few medical LLMs for yourself and see how you like them. But keep your clinical reasoning sharp and don’t outsource your mind to these tools.
Additional Citations
American Medical Association. (2023). 2 in 3 physicians are using health AI, 78% in 2023. Retrieved from https://www.ama-assn.org/practice-management/digital-health/2-3-physicians-are-using-health-ai-78-2023
Bouguettaya, A., Stuart, E.M. & Aboujaoude, E. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models. npj Digit. Med. 8, 332 (2025). https://doi.org/10.1038/s41746-025-01746-4
Hua, Y., Na, H., Li, Z. et al. A scoping review of large language models for generative tasks in mental health care. npj Digit. Med. 8, 230 (2025). https://doi.org/10.1038/s41746-025-01611-4
Hurt RT, Stephenson CR, Gilman EA, et al. The Use of an Artificial Intelligence Platform OpenEvidence to Augment Clinical Decision-Making for Primary Care Physicians. J Prim Care Community Health. 2025;16:21501319251332215. doi:10.1177/21501319251332215
Kim J, Cai ZR, Chen ML, Simard JF, Linos E. Assessing Biases in Medical Decisions via Clinician and AI Chatbot Responses to Patient Vignettes. JAMA Netw Open. 2023;6(10):e2338050. doi:10.1001/jamanetworkopen.2023.38050
Perlis, R.H., Goldberg, J.F., Ostacher, M.J. et al. Clinical decision support for bipolar depression using large language models. Neuropsychopharmacol. 49, 1412–1416 (2024). https://doi.org/10.1038/s41386-024-01841-2
Shah, N., Pfeffer, M., Liang, P., et al. (2025). Holistic evaluation of large language models for medical applications. Stanford HAI. Retrieved from https://hai.stanford.edu/news/holistic-evaluation-of-large-language-models-for-medical-applications
Sumner J, Wang Y, Tan SY, Chew EHH, Wenjun Yip A. Perspectives and Experiences With Large Language Models in Health Care: Survey Study. J Med Internet Res. 2025;27:e67383. Published 2025 May 1. doi:10.2196/67383
Williams, C.Y.K., Miao, B.Y., Kornblith, A.E. et al. Evaluating the use of large language models to provide clinical recommendations in the Emergency Department. Nat Commun 15, 8236 (2024). https://doi.org/10.1038/s41467-024-52415-1






