We’re excited to finally share the details of LLMD!
LLMD – an LLM trained like an MD – is our AI model for interpreting a patient’s medical records. Check out our prior post for a discussion of how we approach the problem of modeling a patient’s medical history from their records. And also be sure to check out our discussion of what makes this such a difficult and interesting AI problem to tackle.
The paper we posted today reveals how we built LLMD and how it compares to other LLMs that show promising results in the medical domain.
In a lot of ways, the results speak for themselves – we proudly show off state-of-the-art text responses to PubMedQA, the best benchmark of medical AI performance used today, and more importantly show that there’s a huge performance gap between today’s most powerful LLMs and LLMD when working with patient records. Along the way, we found a few fascinating insights that shed light on how LLMs should be trained to work with real-world data.
First, yes, getting LLMs to (approximately) capture medical knowledge works! But getting them to recall it – to interpret and respond to a question – is harder, and few models do that well today. We saw that in their text responses, most models trained to maximize medical knowledge struggled to handle variations in context and questions. In fact, the most production-hardened general models such as GPT-4o responded better, even if they knew less about medicine. This general behavior is critical because medical records are so messy – and it highlights that tolerance to noise, variation, contradiction, etc. is as critical as medical knowledge when building an AI model to work with real-world patient data.
When it comes to the data available to power real-world medical applications today, narrative text in medical records is the richest source of information out there – but medical records exacerbate the shortcomings of many other LLMs. Interpreting record contents is even harder than responding to questions because records are written in their own "language" that differs doctor to doctor and facility to facility; and beyond that, any record is just a partial slice of your healthcare journey, and so only when an AI model looks at records longitudinally can it accurately piece together someone’s medical history. We trained LLMD directly to do that: we performed a large continued-pretraining step that tailored Llama3.1 to the patterns and peculiarities of medical records, while imbuing the model with medical knowledge needed downstream. Then we trained it on structuring tasks that get data into a form suitable for modeling, and abstraction tasks that mimic how doctors read records. Together, these steps open up a huge gap between LLMD and other models we compared it to when working with records.
The results are worth it. We’ll have a beta in the coming months showing off how compelling our AI has made our patient and virtual care focused products – stay tuned. But in the meantime, check out our paper! And check out some of the researcher and patient testimonials describing how this work helps patients and researchers to improve healthcare today using the PicnicHealth platform.