Millions of Americans use the Internet to answer questions about their health. The public availability of powerful artificial intelligence models like ChatGPT will only accelerate these trends.
In a large survey, more than half of American adults reported entering their health information into large-scale language models (LLMs). And these models can bring real value to these people, as in the case of a mother who visited 17 doctors but was never diagnosed for her son with chronic pain. There is reason to believe that. I entered the MRI report and additional medical history into ChatGPT. The diagnosis was tethered syndrome, which was later confirmed by a neurosurgeon and surgery was performed.
This story is not unique. Missed or delayed diagnoses negatively impact patients every day. Each year, an estimated 795,000 Americans die or become permanently disabled due to misdiagnosis. And these misdiagnoses aren't just for rare “zebras” like cord-tethered syndrome. While many of them are common, such as heart disease and breast cancer, just 15 or so diseases account for half of the serious toll. The more severe the individual's illness, the higher the risk. And these mistakes become more common. In a recent study of people who were admitted to hospital and then transferred to intensive care as their symptoms worsened, 23% had diagnostic errors that affected the cases. 17% of those errors caused serious harm or death.
Although many factors contribute to diagnostic errors (many of which are outside of a physician's control), human cognition plays a major role. These problems have long been recognized by the medical community. The Institute of Medicine published his landmark 1999 report, “To Err is Human,” which included comprehensive recommendations for addressing diagnostic errors. But 25 years later, diagnostic errors remain persistent.
Many people might imagine that doctors approach diagnosis like Sherlock Holmes (or Dr. House), diligently gathering facts and comparing them to an encyclopedic knowledge of the disease. The reality is much more mundane. Decades of psychological research, influenced by the pioneering work of Danny Kahneman and Amos Tversky, have shown that diagnosis, like other areas of knowledge, is subject to predictable biases and heuristics. . For example, if an emergency room physician mentions heart failure in triage information, he or she may have a pulmonary embolism (a blood clot in the lungs), even if objective data and documented symptoms suggest pulmonary embolism. ) are less likely to be tested. This suggests that doctors fixate on the information they are initially given, a problem called anchoring bias.
Doctors are inadequate at estimating how likely a patient is to have a disease and how testing changes that probability, and general-purpose language models quickly outperform. Decades of research have similarly shown that other cognitive biases, such as availability bias, confirmation bias, and early termination, are widely involved in the diagnostic process.
Since ChatGPT became publicly available in late 2022, there have been hundreds of demonstrations of the diagnostic inference capabilities of general-purpose large-scale language models and other AI models for a wide range of common diagnostic tasks, some of which are It was carried out in collaboration with a number of collaborators. . We believe there is compelling evidence that AI, safely integrated into clinical workflows, can help address some of the limitations of human cognition in medical diagnosis today. In particular, AI could be used as a “second opinion” service within hospitals to assist doctors and other medical professionals with difficult medical cases or check blind spots in diagnostic reasoning. Second opinion services provided by human doctors have already shown, albeit on a small scale, that they can provide real value to patients.
What would this actually look like?
Building second-opinion systems that leverage large-scale language models is no longer the realm of science fiction. As doctors who treat patients (AR) and medical AI researchers (AM), we are envisioning a system that allows treating doctors to “order” using electronic medical records. But instead of choosing a diagnostic test, doctors summarize their clinical questions about their patients in the same way they would talk to a colleague. When you submit your order, your question, along with your entire chart, is sent to a secure computing environment and processed by your LLM to provide recommendations for possible diagnoses, blind spots, and treatment options.
Just as the diagnosis of tether syndrome was confirmed by a neurosurgeon, as in the case at the beginning, recommendations derived from the model can be made by the doctor acting as the human in the loop to prevent obvious mistakes and hallucinations. (if AI does, the model will often confidently state factual inaccuracies). After this review, the second opinion is sent back to the requesting physician, entered in the medical record, and reviewed by the ordering physician.
Similar to a second opinion in humans, the requesting physician does not have to follow the recommendations from the LLM. However, simply considering other options can reduce diagnostic errors. Also, unlike human second opinion services, the cost of running a model can be measured in cents, and the model can serve large numbers of clinicians and their patients in parallel.
Indeed, in early research that closely involves humans, there are obvious risks that need to be mitigated. LLMs contain ethnic, racial, and gender biases in the data based on training that can influence second opinions in unpredictable and harmful ways. LLMs can also hallucinate. Humans also make mistakes, but AI's hallucinations may be worse and more likely to cause harm. Particularly in early research, the involvement of human experts is absolutely essential.
However, now is the time to begin researching these technologies, as the risks of maintaining current diagnostic error rates are very high, and other attempts to reduce errors have failed to have meaningful impact. We believe that this is the case. It is human nature to imitate old sayings and make mistakes, so AI has to give its opinion.
Adam Rodman is a practicing internal medicine physician at Beth Israel Deaconess Medical Center and an assistant professor at Harvard Medical School. Arjun K. Manrai is an assistant professor of biomedical informatics at Harvard Medical School and a founding associate editor of NEJM AI.