By Dr. Cole Stanley

I’m sitting here at 36000 feet somewhere over the Pacific, on a long-haul flight with no WIFI, and I can finally sustain some focus. My team has given me no easy task this month: write an “engaging” article about the role of our Clinical Data Stewardship Committee (CDSC). Hmmm, maybe we could start by calling it something less yawn-worthy? Perhaps I’m feeling a little snarky due to travel-related lack of sleep. But wait, maybe I can make use of that attitude? My mind drifts to a truly engaging, once-in-a-lifetime experience I had in Whistler a couple years back, amid social distancing requirements and “no dancing!”. The guests were mandatorily seated in the large high-ceilinged conference centre hall, where some of the more cerebral elements of the show had room to go way over my head. But still, the performer dazzled the audience. She proceeded through a loose interpretation of Dante’s Inferno, replete with a musically-numbered 9 circles of hell, and given it was a drag show, snarkiness was set to max.[1]

Here’s my attempt to, in true QI fashion, shamelessly steal that format, and repurpose it to convince you of this:

The perils and pitfalls of healthcare data are many, and we need teams who function like the CDSC to save us from these circles of data hell.

A short disclaimer before we start our ghoulish tour – I am coming at this from the perspective of a family physician who believes that we need to embed quality improvement (QI) within our daily work, and that measurement and healthcare data should be used as a tool for these efforts. I’m lucky to work in a team-based primary care clinic where we are putting this into practice. I’m also a Physician Quality Improvement (PQI) coach, have experience with team-based and Collaborative QI initiatives, and have spent a fair amount of time thinking about the intricacies of measurement using healthcare data (specifically at HDC and with the Measurement System for Physician Quality Improvement). These are where I have come across the pitfalls, sometimes through firsthand experience.

What I do NOT want you to take away is that I think other system players’ perspectives (eg. researchers’, data scientists’, policymakers’) aren’t important, and that their purposes for measurement are somehow less valid. Instead, we need to work together and be mindful of these pitfalls, so that we can measure safely and not let one group’s good intentions pave the road to, well you know.

Now let’s start our descent to the 12 Circles of Healthcare Data Hell.

Circle 1: Blind Judgement

Circle 2: Hallucinations

Circle 3: Perversion

Circle 4: Human Targets

Circle 5: Submit to the Central Data Committee

Circle 6: Drowning by Data

If a picture contains a thousand words, then you could say a graph of healthcare data might contain a thousand explanations.
In this first circle, when it comes to the data, judgements swirl, stories are spun, and the fundamental attribution error is in full effect (where we discount contextual factors and blame the individual for the results we see). This is a fitting first circle, right below the surface, since it’s so easy for our human brains to slip to this level. Remember, when that truck on the freeway cuts you off, making you miss your exit, there are many potential explanations for the data you are seeing, and only a subset includes the fact that can feel so certain of in your mind– “that guy is an @$!%#!!” (from firsthand experience, I’d suggest a cooldown period before reminding your road-raging partner of this). Similarly with healthcare data, we can quickly jump to an explanation for what we are seeing, then double down on it with confirmation bias and fail to consider context and other possible explanations. I asked GPT-4 to help me demonstrate, with eerily plausible output:

COLE: A graph of data from a primary care clinic shows that the proportion of the patient panel with diabetes and an A1c test result within target range has been decreasing over the few years it has been measured. From the perspective of a different individuals in the list below, give a few examples of possible explanations they might have for seeing this. Use direct quotes from these individuals: program funders, diabetic patients, patient family members, physicians, front desk staff, nurses, and an online troll looking at the graph

GPT-4:

Program Funders

  1. “It appears there’s a decline in the quality of care provided, which is concerning considering the resources we’ve allocated. Perhaps the clinic isn’t utilizing funds effectively.”
  2. “This graph shows that our investments in diabetes management aren’t yielding the expected results. Maybe it’s time to reevaluate our funding strategies and potentially invest in alternative approaches.”

Diabetic Patients

  1. “I’ve been coming here for years, and it feels like the doctors are always rushed now. Maybe that’s why fewer people are reaching their A1c targets.”
  2. “They started charging for parking recently. It might not sound like much, but it’s another hurdle to coming in for regular checks.”

Patient Family Members

  1. “My mom has to wait forever to get an appointment. By the time she’s seen, her blood sugar’s all over the place.”
  2. “They hardly ever follow up with us about test results. We have to chase them down, and by that time, it’s too late to adjust medications properly.”

Physicians

  1. “We’ve had an increase in new patients with already advanced diabetes; it skews the overall control numbers down.”
  2. “With the increasing administrative demands, we’ve had less time to focus on individualized patient care, affecting our performance metrics.”

Front Desk Staff

  1. “People have been missing their appointments more frequently. Maybe they’re going somewhere else, or they’re just not as committed to their health as before.”
  2. “We’ve had issues with the appointment scheduling software, causing double bookings and delays. That probably affects patient care.”

Nurses

  1. “Our staffing levels haven’t kept up with patient numbers, so we’re spread pretty thin. That could be affecting patient outcomes.”
  2. “We switched to a new brand of A1c test kits recently; maybe they’re not as accurate.”

Online Troll Looking at Graph

  1. “Ha, looks like this place is going down the drain. Good luck getting any decent care here!”
  2. “These guys can’t even keep their diabetics healthy. Must be a money-making scheme rather than healthcare.”

The scary part happens when these explanations turn into judgements that prompt decisions. Funding could be cut, patients could switch clinics, and real improvements in care could be cast aside, masked simply because they are unseen with the data as presented. To avoid this, we must look for context, and think about how to test alternative hypotheses.

This is where the CDSC comes in. As a group including frontline experienced physicians, we have seen how lack of context can have disastrous effects, and so a core value of the group has been that we promote “context dependence”. In other words, we take steps to protect the data from being inappropriately judged. This includes a repeated message that our measures are best used as part of a suite, and that clinicians who know the clinical context agree that this suite of measures can be a useful tool for measuring improvement. We debate how widely available the measures should be and worry about the risks of certain audiences reviewing the results without proper context. We promote an exploratory approach, where the measures are initially reviewed with the question “what could be the possible explanations for what I am seeing, and how can I test these hypotheses?”

This team is a unique arena for physicians to develop nuanced understandings of the challenges of metrics selection & data stewardship in healthcare. ~Dr. Lawrence Yang, CDSC member

CDSC exaltations

  • Remember that context matters – involve people close to the problem
  • Test alternative hypotheses for what you are seeing
  • Don’t overfocus on a single measure, use a suite
  • Be careful about presenting data to audiences less familiar with the context

Return to list of 12 Circles

Seeing a rhino in the clouds, a looming set of eyes on the floor tiling, or Jesus Christ himself on a grilled cheese sandwich? Now down to our next circle, where much like the first, our brains are up to no good. This time though, starved for causal explanations for what we are seeing, we identify patterns amongst randomness. And again the real problem happens when we use these to move to judgement and decision. How common is it for us to overcall variation we see in the data? Was that 2 pounds weight loss really because of the new diet you started, or is that change just part of the normal day-to-day fluctuations you experience? Or a throwback to 2020-2021: the news is telling me COVID-related deaths are up today vs. yesterday, and they are implying things must be getting worse. Well not necessarily, because day-to-day counts occur within the complex adaptive system of healthcare, where we expect variation. To respond to this, the Institute for Healthcare Improvement (IHI) started using control chart methodology to detect meaningful change in the numbers, but this remains mostly beyond that of the mainstream approach to data [2].

In medicine, we are well-versed in research methodologies in which we compare our data before and after an intervention to detect statistical significance. But this before-and-after approach can lead us astray. I may attribute my 10% decrease in No Show rate this week to the appointment reminder emails we sent, while missing the fact that the No Show rate normally varies week to week by much more than this. I’ve hallucinated the effectiveness of what I did.

Again, the CDSC helps us out. With the guidance of our QI-trained physicians, we are careful to not overcall the changes we see in the data, and we value looking at data over time. We use the same question when looking at changes in the data: “what could be the possible explanations for what I am seeing, and how can I test these hypotheses?”. Was this just normal “common cause” variation over time, or maybe it’s because the part of my system I am measuring is not yet stable enough to detect meaningful changes?

CDSC exaltations

  • Enlist support of people trained in QI methodology and understanding of variation (common vs. special cause)
  • Look at data over time (e.g. on a run chart) and avoid limiting to before-and-after

Return to list of 12 Circles

Our next ring down is rife with strange behaviours. Elementary school teachers are furiously erasing wrong answers from their pupils’ standardized tests, inflating scores for their school. Combat athletes are making use of the hellish temperature, sweating buckets in their garbage bag outfits, ensuring they hit their weight class. Wannabe social media influencers are being over-the-top, outrageous, and even self-harming, all in the name of getting more likes and follows. Social media companies hack human cognition to keep their users mindlessly scrolling, in the name of more “engagement” (read ad revenue). Local residents in an area where cobras are real pests have actually started breeding them, so they can collect more of the bounty money offered by the colonial government. What’s going on here? Its measurement gone wrong again, as we’re assigning some value judgement to our numbers, which can unleash human creativity (less charitably, some truly perverse behaviour). This has sometimes been called The Cobra Effect, named after the example above that actually happened when the British government did this during India’s colonial rule [3] .

Before we start measuring something, we usually have an underlying purpose in mind. The measure is a means to an end, but too often it can become an end in itself, where we forget about the underlying purpose. We are reduced to a puppet controlled by the all-powerful measure. Healthcare is rife with examples of where measures have led to perverse behaviour. Measuring and rewarding primary care with fee-for-service leads to quicker lower quality visits, as we assign value to volume. In acute care, incentivizing reduced length of stay can lead to increased readmissions and excess pressure to discharge before it’s safe. Paying for performance on quality metrics like A1c control in diabetes can overfocus generalists’ efforts on this problem, leaving too little time left to manage problems that are less easily measured but more important to the patient in front of them. We focus on the problems we can easily see, and if we measure something it usually brings the problem into the spotlight. The low priority easily measurable problem supplants the more important yet more difficult to measure problem (similar to the idea in Monkeys and Pedestals below).

CDSC guidance on our use of measures can again help us out, preventing some of this perversion. We view our measures as tools to help those closest to the problem (including patient voices) do QI to improve, rather than as something primarily used to assign value to our work and get us extrinsic rewards (e.g. praise, funding, etc). In QI, we stay focused on the underlying problem and purpose of the work and use balancing measures for early detection of perverse behaviours. Discussion of outcomes needs to go beyond whether we hit our numeric goals (see Human Targets below) or got some extrinsic rewards from our QI work. Instead we focus on team learnings and impacts (quantitative AND qualitative) on quality of care for our patients.

CDSC exaltations

  • Use QI methodology and a suite of measures
  • Prioritize QI efforts with high potential impacts, instead of those easiest to measure
  • Our measures are primarily tools for improvement work, and not meant to be the goals themselves
  • Consider how you could “game the system” to get your outcome measures to improve without actually fixing the problem you want to, then add balancing measures to detect inklings of this
  • Beware of extrinsic motivating factors and separate out performance evaluation from QI (see The Wrong Toolkit below)

Return to list of 12 Circles

Next to this circle that’s somewhat of a subset of Perversion above, where we set numeric goals or “targets”. The classic QI example of weight loss fits here, so let’s call up some imagery of weight loss class from hell, a group of 10 individuals of various genders, ages, shapes, and sizes with elevated BMI. This group of motivated people believes in the power of social accountability, and so have come together to collectively lose weight. In discussion, most agree the underlying purpose of their efforts is to benefit their long-term health, but this is difficult to measure, so they choose the group’s total pounds lost as their surrogate outcome and want to lose 100lbs by 6 months from now. Not only that, but the group’s coach will also get a monetary bonus if they reach their goal. Cue the human ingenuity. The coach decides to focus on the most overweight individuals to get the most efficient losses, and others are left with no attention paid. The group chooses extreme calorie restriction diets, and avoids strength training, since they worry that putting on muscle mass will get in the way of their goal. A participant is shunned after coming back a few pounds heavier from an Alaska cruise, and decides to stop coming back to the group. When that happens, it’s only 4 months in, but it’s celebration time. They’ve already hit their goal! The coach gets his bonus, and the group feels the pressure is off. Soon after though, people stop coming back. In general, their diets and losses aren’t sustained. Some individuals lost muscle mass and one fell and broke their hip soon after. In sum, the numeric goal was reached, but we got almost the exact opposite of what we aimed to improve.

This rather extreme example is used to illustrate that reaching the numeric goal should not be the only thing we consider, and that setting these goals at all can be fraught. Though we may see real improvements, we can also perversely incentive behaviour we don’t want. This can include distortions to processes that aren’t real improvements (e.g. losing weight by starving), distortions to the measurement (e.g. changing time of day of weight check), and some nasty human tendencies to shame and blame. In addition, we can leave some improvements unrealized if the goal is hit early, or we can lose motivation and start to feel hopeless if it looks like we might not reach our goal. Instead of continuously improving, we can also see “sandbagging”, where we intentionally lower expectations and coast, allowing us to set lower targets and achieve “better than expected” results.

It’s interesting to me that one of the founders of QI thinking, W. Edwards Deming, was not a fan of numeric goals.

…if management sets quantitative targets and makes people’s job depend on meeting them, “they will likely meet the targets – even if they have to destroy the enterprise to do it.”

W. Edwards Deming, quoted in Profits Beyond Measure by H. Thomas Johnson (in the forward to the book)

We see a phenomenon is sometimes known as Goodhart’s Law, from when it was originally noted in an Economics context that when a measure becomes a target, it ceases to be a good measure. It’s difficult to square this in my mind given how central our SMART (specific, measurable, achievable, realistic, time-bound) aims are to the Model for Improvement at the centre of QI methodology. I’ve now come to a point where I still believe SMART aims are useful, but we need to be more mindful of their pitfalls and not forget the underlying purpose of our improvement efforts. And we need to remember that a QI project that does not meet its’ numeric goal should not be called a failure, as we neglect learnings and improvements that may have been made.

CDSC exaltations

  • Remember that hitting a numeric target is not the primary purpose of your efforts
  • Beware of targets tied to extrinsic rewards, as they can especially incentivize perverse behaviour
  • Value continuous learning and improvement over hitting numeric targets

Return to list of 12 Circles

Descending to our next ring, we can conjure some Soviet era central planning. Here, it’s assumed that we need everything to happen at the centre and then direction to spread by decree to the masses. Data flows one way, from the frontlines to the higher-ups, from the “do-ers” to the “knowers”, who then use their “advanced analytics”, “data linkage”, “data science”, and other fancy methods beyond that of the lowly frontline worker. The result? Useful data is locked away in a vault and few people have the keys. The frontline worker is reduced to a cog in the machine, getting a paycheque to follow directions reliably set by the higher-ups. Now it’s not to say that this can’t work up to a point, as it did during the industrial revolution and with the teachings of Frederick Winslow Taylor. But by treating people like cogs, we destroy the morale of the workforce and waste the number one resource we have, brainpower capable of learning and adapting rapidly at the frontlines (see my article Cogwheel Counterfactual). The system becomes slow to learn, and often strips data of meaningful clinical context, resulting in decrees coming down too late and not fitting the current need.

Yes, data centralization has its benefits and I fully support it for certain use cases. I am not saying that setting this up is a bad idea. The error occurs when we become monomaniacal, putting all our eggs in the proverbial central basket. With that approach, we expose ourselves to some unneeded risks, and lose out on opportunities to improve our system and engage our frontline workers. That is to say, we should not sit and wait for the ultimate central data repository to solve all of our problems, but instead should be pushing data-informed decision-making capability to the periphery.

To reiterate, it is not that creating central data repositories is inherently bad. Instead, we need to be realistic about the risks associated with this approach, and realize that we can accomplish much more and faster if we include complementary approaches. For example, as part of the initial phases of setup of the Measurement System for Physician Quality Improvement (MSPQI), a group of family physicians came up with an initial set of measures. The vast majority of these it turns out did not need any central repository or data linkage, as the data already exists in the primary care EMR. HDC has worked on making some of these available. With this in mind, we can promote uptake of these measures as tools for improvement NOW, and not miss the opportunity to get to the hard work of improving care while we wait for the perfect central repository.

CDSC exaltations

  • We can accomplish system improvements NOW without needing to wait for a central data repository with linkage
  • We should ensure measures are available and timely for frontline users to do improvement work
  • We can reduce privacy risks by focusing on de-identified aggregate data in many use cases

Return to list of 12 Circles

A descent to our next level reveals a vast and treacherous sea where too many have gone overboard. It’s never been easier to access vast amounts of data, with the ability to draw on new sources and link sets together. Now with the advent of widespread generative AI, the amount of information available to us is expanding faster than ever before. Here we risk heeding the siren song of newer fancier data sources, analysis techniques, and systems, and can start to think that these will be the solution to our problems. But too often, we may be unable to separate the truly useful from distraction and noise, and end up lost at sea. I often get the feeling I’m in something akin to this circle of hell when I pick up my phone, presumably wanting to do something useful. Fast forward 15 minutes and I have flitted between various messaging apps, scrolled a bit, read a few brief anxiety-provoking headlines, and promptly forgotten why I picked up the phone. It was all interesting, surely bathing my brain with some pleasing dopamine as I scrolled  but it fully distracted from what I was otherwise trying to do.

In healthcare, a similar phenomenon can happen. EMR queries can be run, administrative data sets can be linked, visualization tools make it easier than ever to display vast data sets, and entire departments are dedicated to Decision Support. In short, it’s now relatively easy to gather and display more data, and we are flooded by it as a result. I’ve seen this in clinical operations reports in the past, where monthly graphs of many process measures are prepared, but when asked “what decisions do we make with this data” there isn’t any clear answer. A little like ordering a lab test when we know the result won’t change our management, we end up having data and measures looking for a problem.

When faced with a hard problem, we often steer towards an easier problem nearby and solve that instead (stay tuned to a future circle where we are struggling to get our troop of monkeys to dance on their pedestals).

Easier: get more data and create more measures

Harder: use these measures to inform decisions and detect improvements

It’s relatively easy for us to brainstorm more and more measures,  put them in reports, and give them to our frontline providers.  I start to wonder though, are we doing this at the expense of focusing on the hard problem, which in this case could be getting embedded QI in our daily frontline work. Though I agree that “limited access to data” can be a barrier for some QI projects, has it has become an excuse to drag our heels on getting started? After all, in QI we don’t need “the perfect measure” and I believe it’s usually better to get started with the understanding, learning, and improving sooner rather than later.

So how do we decide which data to look at and which to ignore? Well, instead of starting with the data set itself and unstructured exploration, it might be more useful to first think about what clinical problems are most pressing; where our focused efforts may have the most impact. In short, starting with “what’s important to our clinical care?” and then seeing if there are useful data sources relevant to these topics may be the better approach – a sort of Quality Planning process. Here we prevent ourselves from being led astray by data that is easy to see, instead focusing on the most important problems to us.

At CDSC, we have debated how much guidance we should give HDC Discover users when they are considering using our measures to do QI. Some advocate for a more “paint by numbers” approach, where the topic is pre-selected, while others worry that we may be distracting users from working on what is most important for them. The compromise may be that our included measures set has been informed by practicing primary care clinicians on the CDSC, and that we attempt to prioritize the addition of measures relevant to problems common to primary care. We have started selecting some common problems to develop more structured QI guidance, while also advocating that users choose problems that are important and potentially impactful for their practice. When learning QI, the most impactful thing to do may be to first choose a small-scale project with this structured guidance.

Aside from this, we’ve recently been paying greater attention to the value of measures, focusing more on building out those tied to real QI projects (e.g. UBC CPD Mental Health modules, planetary health,   heart failure QI Collaborative, opioid use disorder measures). Here we are trying to get closer to solving the hard problem of having our measures used in embedded QI, instead of solely making more measures available.

CDSC exaltations

  • Consider starting with a discussion on what important problems could require focus in your clinical context, then only after this see what data is available
  • Beware of analysis paralysis, getting lost in the vast amount of data available
  • If you are new to QI, starting with a structured small-scale QI project with some measurable clinical topic may be the most impactful given the learning potential (consider PSP support)
  • HDC is especially interested in developing measures that will be actively used in QI work

Return to list of 12 Circles

Curious to know what’s the subsequent Circle? Check back in the new year as we continue our descent.

Footnotes:

1. Sincere apologies to my friends and co-workers who are Drag Race superfans, as it took some online digging for me to figure out it was BenDeLaCreme’s Inferno A-Go-Go Show.
2. https://www.ihi.org/Topics/COVID-19/Pages/COVID-19-Data-Dashboard.aspx
3. Bad Data: Why We Measure the Wrong Things and Often Miss the Metrics that Matter by Peter Schryvers (Prometheus 2020)