What we mean by "messy" real-world data
In this post, we go deeper into what medical records look like in the wild and what happens when we try to apply AI techniques to them.
In our prior post, we introduced structuring and abstraction, two processes needed to get real-world medical records into a form that we can draw insights from. We also explained how those processes map to the latest AI techniques. In this post, we go deeper into what medical records look like in the wild and what happens when we try to apply AI techniques to them.
Real-world medical records quickly break common assumptions used to develop AI techniques in academic settings, and it’s these real-world issues that make our job difficult and interesting. They’re also the reason that “pretty good on a benchmark” has never translated into a great solution for the tasks we train on.
Noise
Medical records are filled with mistakes. Among our hemophilia patients, each of whom has one of two subtypes of the disease A or B, we see that nearly 30% have a contradictory diagnosis somewhere in their records. This could represent a true misdiagnosis, but more often reflects mundane realities like a provider picking the wrong option from a drop down menu and other folks copy/pasting that result forward in EHR software. My own record from a leading healthcare institution had my body temperature listed at 98 degrees C, which would be unpleasant if true. In another case, we received a faxed record with some unfortunate artifacts mangling the word “Not” in the context of printing “Not Detected” for the results of a disease test-panel.
Tolerance to these types of errors is table stakes for working with records. For us, it most often means looking at records longitudinally to make sure we have a consistent picture of patient health given all of the evidence. No one record tells the story of a patient’s healthcare journey. In practice, this means training an LLM that looks at snippets from multiple records together, and considers who wrote a piece of information, when, and how it connects to other evidence, before drawing a conclusion.
It also means building a multi-layered QC system that lets us define both data conformance and plausibility rules. These rules make sure a disease-modifying therapy didn’t start before the patient was diagnosed, or a condition only possible in females is not identified for a male patient.
And most importantly, it means having a technical feedback loop. From the very start of PicnicHealth, it’s been clear that we can’t anticipate all of the artifacts and corner cases we’ll see when working with medical data. Instead, what is most important is building systems to identify mistakes, fix them, and incorporate that result into all of the future processing we do.

Records are big, so LLM context generation and provenance matter
The biggest single record we’ve received was over 24,000 pages. Yup. It combined documents from visits spanning decades within a large hospital system for a patient with Sickle Cell Disease, a population that tends to have many long tables of lab results in their records. Below, we show the distribution of record sizes in our data today, noting that even if we only processed 1,000 records a week, we’d see two with more than 1k pages each week, and usually one with >10k pages.

LLMs are improving in their ability to use and consistently latch onto information in long context windows, and meanwhile exciting new techniques like Selective State Space models might come fully online to work with more input data. But today, to use the most battle-tested foundational models, we have to manage context window limitations carefully. For our average record of about 30 pages, with about 280 words per page, we need a 32k context window to grok one record at a time (if we assume 3.5 tokens per word). For an easy patient with 20 records, we would need at least 600k context size to find patterns that occurred over time in their care journey.
Identifying relevant context is particularly important for abstraction tasks. It is both a practical matter and an appropriate representation of how clinicians flip through a stack of records when their time is limited. When it comes to validating abstracted information (-- and building software interfaces to make that easy), well-designed context generation mechanisms become a powerful mechanism for illuminating how we uncovered the evidence behind a conclusion.
We find that provider specialty and dates are important first-pass guideposts and lend themselves to easily explainable provenance for the data we produce. Searching for individual concepts also helps, but only once we have aligned across the many variations of how they’re written and the many semantically-different but clinically-similar concepts. Connections between concepts – a multiple sclerosis patient must have their diagnosis confirmed by a neurologist to be definitive, a patient with Paroxysmal nocturnal hemoglobinuria (PNH) will always have a particular panel of labs drawn – also help us build the right LLM inputs and iteratively refine answers.
Nuance
When we say nuance, we mean uses of language in medical records that we don’t expect to find often in other corpora – and most importantly, in the pre-training dataset for common LLMs.
For example, simple things like the date of a visit – an obvious question from a patient’s perspective – can be absolutely buried within a record. In one recent example we looked at, a document had the date it was printed, the date and time the physician note was first written, the date and time it was amended, the date and time it was signed, the date labs were ordered, the date and time the samples were received at a laboratory (after the visit had concluded), the date and time when those results became available, etc.
In this sea of similar data, the dates are easy to spot with off the shelf Named Entity Recognition (NER) models, but the meaning of those dates requires interpretation. In the example we looked at, the dates were mapping out the workflow of that particular facility for that particular out-patient visit type. When we ask even a powerful model like GPT-4 to answer a simple question like “when did this appointment happen [as the patient thinks of it]?” we find that models not trained directly on labeled records data easily miss the nuances in the data they’re looking at. We find this same behavior often when navigating the many names of providers, technicians, and other support staff documented in records, as illustrated below.

It’s in cases like this where the power of our training dataset shines – extensively labeling records (for almost 10 years!) has given us more than enough examples of the esoteric patterns of how medical encounters are documented. Our data captures patterns by facility, by specialty, among clusters of similarly-run practices, etc. and these get incorporated in our LLM when we fine tune on such a large volume of data. This level of complexity also demonstrates how much further models need to go than off-the-shelf NER to perform even basic real-world applications that require interpreting records.
Clinical Knowledge Matters in the Long Tail
Another tricky dynamic of medical records data is that what matters most is not always apparent in the statistics of the data. That lab panel we mentioned for PNH only happens once per patient and, across the full population of our users, only among a few patients – but it is of the utmost importance. Even an LLM with 99% recall may miss it, and that is a problem we have to solve.
For us, the practical place to start was by getting clinicians, epidemiologists, biostatisticians, and AI experts to work together all day, every day. PicnicHealth has teams that intermingle all of these specialties, and we’ve built technical systems to encode domain expertise into even the lowest levels of our structuring tasks. This means tools to help quickly examine data (e.g. to report precision and recall on obscure concepts), QC rules that automatically flag plausibility and conformance issues for human review, and workflows to amplify and augment specific training examples so that our data distribution reflects importance rather than simply prevalence. It also means making sure we can flag and manually review records that may be affected by a blindspot.
Today, our LLM powers both patient-facing products and research facing products. For patients, we provide a platform that empowers them with ownership of their data and gives them tools for curating records, managing and coordinating care, and access to our own care services. For researchers, we help run far-more efficient observational studies, in part by developing software that helps abstractors generate regulatory-grade Case Report Form data quickly and accurately from the information already contained in patient records. This side-steps the need to run physical sites that are expensive, complicated, and ultimately a burden to all involved. The features and products that our AI enables are universally exciting, but what we do today is just the beginning.