AI in Medicine: Discovering GPT-4's Potential to Revolutionize Healthcare through Real-Life Examples

AI in Medicine: Discovering GPT-4's Potential to Revolutionize Healthcare through Real-Life Examples
Photo by Luke Chesser / Unsplash

July 28, 2023

Hi friends,

I recently read a book called The AI Revolution in Medicine: GPT-4 and Beyond by Peter Lee, Carey Goldberg, and Isaac Kohane. I really enjoyed reading the book, and I wanted to share what stood out to me and the areas of medicine and clinical research where large language models (LLMs) can excel. To conclude, I share my main takeaways and briefly touch on AI ethics, GPT-4's limitations, and a novel way to detect hallucinations with a verifier instance of GPT-4.

Context

But first, some background info about the book and its co-authors. Peter Lee, PhD, is currently the head of Microsoft Research. Formerly, he led computing projects at DARPA and chaired the computer science department at Carnegie Mellon University. Carey Goldberg is a medical and science journalist with over 32 years of experience. She served as the Boston bureau chief for The New York Times and Bloomberg News. And Isaac Kohane, MD, PhD, is the inaugural Chair of Harvard Medical School's Department of Biomedical Informatics and has worked on medical AI since the 1990s.

The book features examples of real conversations with GPT-4 ranging from having GPT-4 help doctors with their paperwork to successfully diagnosing a 1 in a million rare disease case study. The co-authors evaluate the quality and accuracy of GPT-4's responses while flagging its current limitations and discussing ethical concerns.

What stood out to me?

What stood out to me was GPT-4's apparent superhuman ability to diagnose rare diseases on the order of one in a million.

In chapter 4, Dr. Isaac Kohane shared a true story from his time as a pediatric endocrinologist. A couple of years ago, he was called by another pediatrician to consult on a newborn baby with hypospadias and a small phallus. He confirmed the diagnosis and found no other abnormalities, but could not feel the usual testicles in the scrotum. In the present day, Dr. Kohane gave GPT-4 all the details he had known about the newborn and asked for help with diagnosing the baby's disease. I share GPT-4's complete response from the book because it captures what seems to be GPT-4's deep expertise according to Dr. Kohane.

GPT-4's response (in italics):

There are a few possible differential diagnoses for this boy with small phallus, hypospadias, and non-palpable gonads, such as:

  • Congenital adrenal hyperplasia (CAH): this is a group of inherited disorders that affect the adrenal glands and their ability to produce steroid hormones. One of the most common forms is 21-hydroxylase deficiency, which causes excess androgens and salt-wasting in affected newborns. Some of the features of CAH include ambiguous genitalia, hypospadias, micropenis, and cryptorchidism in males, and virilization, clitoromegaly, and labial fusion in females. CAH can be detected by newborn screening, and confirmed by measuring serum 17-hydroxyprogesterone, cortisol, and electrolytes.
  • Androgen insensitivity syndrome (AIS): this is a genetic disorder that affects the development and function of the male reproductive system. It is caused by mutations in the androgen receptor (AR) gene, which impair the ability of the body to respond to androgens. Depending on the degree of AR dysfunction, AIS can present with complete, partial, or mild phenotypes. Complete AIS (CAIS) results in a female external appearance with a blind-ending vagina, cryptorchid testes, and absent uterus and fallopian tubes. Partial AIS (PAIS) can cause variable degrees of virilization, ranging from female to ambiguous to male genitalia, with hypospadias, micropenis, bifid scrotum, and gynecomastia. Mild AIS (MAIS) can cause subtle features such as small phallus, infertility, or reduced body hair. AIS can be diagnosed by measuring serum testosterone, luteinizing hormone (LH), and anti-Müllerian hormone (AMH), and by genetic testing of the AR gene.
  • Gonadal dysgenesis: this is a heterogeneous group of disorders that affect the development and function of the gonads. It can be caused by chromosomal abnormalities, such as Turner syndrome (45,X), mixed gonadal dysgenesis (45,X/46,XY), or 46,XX testicular or ovotesticular disorder of sex development (DSD), or by gene mutations, such as SRY, NR5A1, MAP3K1, or WT1. Gonadal dysgenesis can cause various phenotypes, such as female, ambiguous, or male genitalia, with or without hypospadias, micropenis, cryptorchidism, or streak gonads. Gonadal dysgenesis can be diagnosed by karyotype analysis, hormonal evaluation, and molecular testing. Pituitary or hypothalamic dysfunction: this is a rare cause of small phallus, hypospadias, and cryptorchidism in newborn males. It can be due to congenital or acquired defects in the hypothalamic-pituitary axis, which regulate the production and secretion of gonadotropins and other hormones. Some of the causes include genetic syndromes, such as Kallmann syndrome, Prader-Willi syndrome, or septo-optic dysplasia, or structural anomalies, such as hypophyseal aplasia or hypoplasia, pituitary tumors, or cysts. Pituitary or hypothalamic dysfunction can be diagnosed by measuring serum gonadotropins, prolactin, growth hormone, thyroid-stimulating hormone, and cortisol, and by imaging studies of the brain and pituitary.

When Dr. Kohane then asked GPT-4 for the next steps (i.e. what diagnostics to run) to narrow down the differential diagnosis, GPT-4 responded in the same way he would have with all his years of experience as a specialist and expert in his field. Dr. Kohane then told GPT-4 some hormonal levels and that the ultrasound showed the baby had a uterus in its abdomen.

GPT-4's response:

Based on these new results, I have to revise my differential diagnosis and now consider CAH as the most likely diagnosis for this baby. The elevated 17-hydroxyprogesterone, ACTH (a hormone made by the pituitary that stimulates the production of steroids by the adrenal gland) and 11-deoxycortisol levels are indicative of a defect in the steroidogenic pathway that leads to cortisol deficiency and androgen excess. The most common defect is 21-hydroxylase deficiency, which accounts for about 95% of CAH cases, but the normal newborn screening for 17-hydroxyprogesterone argues against this diagnosis. Therefore, I would suspect a less common form of CAH, such as 11-beta-hydroxylase deficiency or 3-beta-hydroxysteroid dehydrogenase deficiency, which can cause similar features, but are not detected by the newborn screening. The normal AMH levels and the XX karyotype suggest that the baby has ovaries as their gonads, and that their virilized genitalia are due to the prenatal exposure to excess androgens. To confirm the diagnosis, I would perform a molecular analysis of the genes involved in these forms of CAH, such as CYP11B1 or HSD3B2, and I would start glucocorticoid therapy to restore the cortisol levels and suppress the ACTH and androgen production…

Dr. Kohane confirmed that "in real life, further genetic testing did reveal a mutation in the gene for 11-beta-hydroxylase. GPT-4 had just diagnosed a disorder so rare it affects fewer than 1 in every 100,000 babies."

If this story captivated you, be sure to get the book and finish chapter 4 in which Dr. Kohane also shared the story of a boy he encountered in his line of work with the Undiagnosed Disease Network (UDN). Through genomic sequencing, a team of expert clinicians identified the one gene that was responsible for his disease and treated the boy with the missing neurotransmitters, which allowed him to walk again. This one in a million case is practically unique, and yet when fed the relevant medical data about the boy, GPT-4 generated a list of genes to investigate for mutations and its reasoning. That list included that actual POLR3A gene that was responsible for the boy's disease!

I’ve known all too many doctors with a superb bedside manner, who were beloved by their patients, and who dispensed incorrect advice and therapeutic plans with confidence. Certainly, great bedside manner at a societal scale would be one of the century’s major medical milestones — but only if coupled with reliable decision-making. -Dr. Isaac Kohane

Opportunities for Large Language Models (LLMs) in healthcare and clinical research

First, GPT-4 has the potential to significantly reduce the administrative burden of clinicians, reducing burnout and job dissatisfaction.

Several major studies have shown that paperwork – writing medical notes, filling out insurance claims, requesting prior authorization from insurance companies for lab tests and prescriptions, and more – are now taking up over 49 percent of the working day of many doctors and nurses. -Peter Lee

The issue of excessive paperwork is so prevalent in the healthcare industry that insiders have coined the phrase "pajama time" to describe when doctors and nurses spend a couple of hours in their pajamas before bedtime to finish the paperwork they accumulated throughout the day. The alternative is to do the paperwork during their consults, preventing them from giving their full attention to the patient. Doctors and nurses have already sacrificed so much in pursuing their profession that it feels unfair to ask them to sacrifice even more of their personal time to do paperwork.

GPT-4 can help with the excessive paperwork by generating medical notes from transcripts of their conversations with their patients. However, GPT-4 is not ready yet to be used without supervision and review in this context because GPT-4 has the tendency to make things up. Or it might not. This uncertainty and inconsistency are what make it challenging to trust the accuracy and truthfulness of GPT-4's responses.

For instance, when given the transcript of a conversation between a physician and a patient with an eating disorder, GPT-4 generated the medical notes for the patient, including the patient's BMI. However, there was no mention of the patient's weight or height in the transcript. Without the height and weight, it is impossible to calculate the patient's BMI, which means GPT-4 made up the patient's BMI. Thus, medical professionals should always carefully review the medical notes and summaries produced by GPT-4 for accuracy before signing off.

This problematic behavior is called a "hallucination", and this is a problem OpenAI is actively working to fix. In the last section of this article, I will discuss a possible solution to this problem proposed by the authors of the book: the AI-verifier triad.

Second, GPT-4's medical diagnosis, interpretation, and translation capabilities have the potential to democratize access to healthcare expertise in both the developed and the developing world. As shown earlier, GPT-4's medical expertise was put to the test by Dr. Kohane and it performed beyond expectations, helping diagnose a virtually unique genetic disease.

To demonstrate GPT-4's medical interpretation capability, the book features an example of GPT-4 helping patients understand their lab test results and explanation-of-benefits (EOB) insurance claim statements. Here, GPT-4 explains to the patient what his blood test results mean and gives the patient a line-by-line breakdown of his EOB statement, including what each CPT code means.

In terms of medical translation capabilities, you can ask GPT-4 to translate in and out of different health record data standards. In the book, they ask GPT-4 to create HL7 FHIR (Fast Healthcare Interoperability Resources) resources for the lab tests they ordered and to output them in JSON format so that they can feed them into the hospital's Epic EHR system. GPT-4 successfully outputted the JSON code.

Considering that GPT-4 was not specifically trained for medicine, is it not fascinating that GPT-4 can help diagnose rare diseases, explain blood test results, and translate in and out of idiosyncratic different health data standards?

One thought that occurred to me was that GPT-4 medical interpretation capabilities could very well disrupt the business models of certain medical diagnostics startups. Take for example startups, like WellnessFX and InsideTracker, created to analyze people's blood test results and provide them with specific, actionable ways to improve their health based on those results. With the arrival of GPT-4, people might stop paying for this service costing anywhere from $300 to $700 per blood test. Instead, they might try feeding GPT-4 their blood test results and asking for feedback and advice. I certainly would.

Third, GPT-4 can accelerate the speed of clinical trials, thereby significantly reducing the cost of running a trial. To conduct a clinical research study, a time-consuming step involves selecting and enrolling patients that match the eligibility criteria. This requires a person to spend hours recruiting and reviewing each patient's clinical history to see whether their disease, age, and many other criteria match what the research study is looking for. This filtering step is both time-consuming and costly because it is currently being done by humans - usually research assistants. The book puts forward an interesting use case for GPT-4.

Preparing for a trial may involve humans reading tens of thousands of clinic notes. Conservative estimates place the cost of reading all the relevant notes for a single patient at between $150 and $1,000. What if we could just have a large language model go through the entire electronic health record to look for eligible patients and exclude those that do not meet criteria? Having that capability could cut months to years from the timeline. It’s been estimated that a delay of one month can cost a pharmaceutical company between $600,000 and $8 million. -Peter Lee

Accelerating the speed of clinical trials has implications beyond just cost savings. If an effective therapy or drug for cancer can be given to patients earlier because GPT-4 helped speed up the clinical trial and gain approval a month early, then more lives will be saved.

Finally, I highly recommend you read the prologue which showcases all of GPT-4 healthcare capabilities in one succinct science-fiction-like story. It will give you a glimpse of the future of healthcare with GPT-4 by taking you through a fictional day in the life of a second-year resident physician as she works with GPT-4 to get through her rounds. The dialogue between the resident and GPT-4 gave me flashbacks to how Tony Stark would talk to his AI assistant J.A.R.V.I.S. (Just A Rather Very Intelligent System) in the Iron Man movies. The entirety of the prologue is available for free by accessing the book sample on Amazon.

My Takeaways

My main takeaway from this book was the need to be sober-minded about GPT-4's shortcomings, limitations, and potential biases in spite of its seemingly superhuman reasoning and expertise. This is especially important in healthcare where wrong decisions and information in certain contexts can seriously harm a person's health and livelihood.

Currently, one of GPT-4's limitations is the absence of active learning. In healthcare, it is important for doctors to stay up to date on the latest medical literature and best practices. However, once a large language model like GPT-4 is trained and deployed, it is impractical to update GPT-4's knowledge base every time a significant medical development or update comes out because retraining a large language model is a process that takes a lot of time and money.

GPT-4's lack of long-term memory is another limitation. Every time you start a new conversation with GPT-4, it does not remember anything you told it previously. This means if you instruct GPT-4 to correct a mistake it made, GPT-4 won't remember the correction the next time you start a new conversation. In addition to its tendency to "hallucinate", GPT-4 also sometimes forgot details about a patient even within the same conversation, which is also concerning.

One practical solution to GPT-4's hallucinations and forgetfulness is to use what the book calls an AI verifier. This is when you run another "verifier" instance of GPT-4 in parallel to the main instance of GPT-4. Effectively, you are now running two instances of GPT-4: the AI assistant and the AI verifier. Now, you interact normally with the "assistant" GPT-4 while the "verifier" GPT-4 is running behind and checking the correctness and accuracy of the "assistant" GPT-4's responses. Surprisingly, this protocol has also been shown to be effective at catching when GPT-4 hallucinates, like when GPT-4 made up a patient's BMI earlier! However, it is important to create a separate instance of GPT-4 for the AI verifier because GPT-4 has trouble admitting its own mistakes.

In machine learning, another issue is the potential for a model to be biased. GPT-4 is no exception. One potential source of bias is the data GPT-4 was trained on. This question is difficult to answer because GPT-4 is a general-purpose model that was trained on a large corpus of text from many different sources including the internet, so it is difficult to pinpoint what information GPT-4's "medical training" is based off of. If GPT-4 was trained from medical data and case studies from hospital systems based in the US, the probabilities GPT-4 assigns to its differential diagnoses might not, for instance, be generalizable to the medical data and statistical distribution of diseases in African countries. One could imagine a situation where GPT-4 would confidently misdiagnose a patient in Africa with something relatively benign when in fact they might have something more serious like malaria because the likelihood of certain diseases like malaria in the US is low relative to Africa.

Similarly, understanding the appropriate use of AI in healthcare is critical. In the book, John Halamka, president of the Mayo Clinic Platform and a researcher on medical information systems, divides the use cases of healthcare AI into two buckets: low-risk and high-risk. In his opinion, low-risk tasks like filling out insurance claims require little new regulation while high-risk tasks, namely those directly affecting patients, would always require a human review.

However, as recent events demonstrate, the implementation of AI in healthcare is not without its perils. Poorly designed policies and protocols around the use of AI models by hospital administrators across America have already begun to have dire consequences for patients. Just last month, a news article by Lisa Bannon from the Wall Street Journal detailed how nurses have been feeling pressure from hospital administrators to defer to AI models' diagnosis, leading to the preventable death of a retired professional tennis player at a Kaiser hospital. The hospital was fined $3 million. The WSJ article explained how some hospitals require nurses to get a doctor's approval to override the AI model. At other hospitals, while the nurses did not need a doctor's approval to override the algorithm, they were held liable and penalized when they made the wrong decision in overriding the AI model.

Whether a nurse is confident enough to trust her own judgment to override an algorithm often depends on hospital policy. Clinicians who are penalized for making a wrong decision may start deferring to the computer, nurses say. -Lisa Bannon, WSJ reporter

When policies and protocols around the use of AI are poorly designed, nurses are incentivized to go against their judgment honed over years of practice and to instead defer to the AI model, which ultimately ends up harming the patient they are caring for.

This brings us to another crucial point: even the seemingly low-risk uses of AI in healthcare can seriously harm or negatively impact a person's life. Consider the possibility that insurance companies might already be using AI and GPT-4 to automate insurance claim approvals and denials. Is this use case considered low risk or high risk? At first, it might seem low risk because an incorrectly denied claim is not a matter of life or death. But one might also argue that this is high risk because a denied insurance claim directly affects the patient, namely their finances and their future access (or lack thereof) to quality medical care. A denied claim might lead a patient to postpone or stop their medical treatment. A denied claim can also lead to financial stress, resulting in a decline in mental health which then exacerbates their pain levels. The fine line between low and high-risk becomes blurry when considering the severe financial repercussions and health complications that a denied insurance claim can create.

In March 2023, STAT News wrote an investigative journalism piece called Denied by AI: How Medicare Advantage plans use algorithms to cut off care for seniors in need. In the opening paragraphs, we are told about an algorithm predicting a rapid recovery for 85-year-old Frances Walter, who had a shattered left shoulder and a pain medication allergy. Despite her clinical notes noting her ongoing pain and inability to perform basic tasks, her insurer relied on the algorithm and cut off payment for her care on the 17th day of her treatment because the algorithm predicted she should have healed in 16.6 days. It took a federal judge ruling against the insurance company for her to get reimbursed a year later, but during that year she had to use up her life savings to continue her medical care, not knowing whether she would ever get reimbursed.

Conclusion

The implications of GPT-4's limitations and potential biases are clearly significant. As illustrated, its incapacity for active learning, lack of long-term memory, tendency to hallucinate, and unknown bias arising from the training data pose serious problems in the healthcare sector. These issues are compounded by the poorly designed policies around the use of AI models by hospital and insurance administrators, resulting in real harm to the sick and vulnerable.

Furthermore, many questions and ethical concerns remain unanswered. Is GPT-4 capable of causal reasoning? Can it truly understand what it reads and writes? Where is its "medical training" coming from? Should it consider the cost of the treatment plan it proposes? For people with no access to medical care in the developing world, is no advice better than unsupervised medical advice from an AI model that is sometimes incorrect?

Yet, GPT-4 has shown incredible promise in the field of medicine and clinical research with its superhuman medical diagnosis, interpretation, and translation capabilities complemented by its poise in handling the huge administrative burden that currently rests on the weary shoulders of healthcare professionals.

Now, given these realities, the crucial question becomes, how can we harness the full potential of GPT-4 to revolutionize healthcare in a responsible and ethical manner? One thing is certain though, GPT-4 is here to stay and unless we work hard to address GPT-4's shortcomings and to build a comprehensive, ethically sound regulatory framework around its use, more people will get hurt as our collective trust in AI slowly erodes.

Jony Hu

Subscribe

There were so many more interesting examples and stories from the book that I was not able to share, but I hope this blog post convinced you to go out and read the entire book. Most of all, thank you to Peter Lee, Carey Goldberg and Dr. Isaac Kohane for the hard work they put into writing this book, without which this article would not have been possible!

If you enjoyed this read and would like to receive my bi-weekly blog posts, sign up for the email list by clicking on the blue Subscribe button!

I like to write about healthcare, science, and technology. I also write tech tutorials for people in the 50s and feature articles about pressing social issues. To learn more about my blog, read my introductory article here!