I’ve been thinking a lot about “big data” and how it is going to affect the practice of medicine. It’s not really my area of expertise– but here are a few thoughts on the tricky intersection of data mining and medicine.
First, some background: these days it’s rare to find companies that don’t use data-mining and predictive models to make business decisions. For example, financial firms regularly use analytic models to figure out if an applicant for credit will default; health insurance firms can predict downstream medical utilization based on historic healthcare visits; and the IRS can spot tax fraud by looking for fraudulent patterns in tax returns. The predictive analytic vendors are seeing an explosion of growth: Forbes recently noted that big data hardware/software and services will grow at a compound annual growth rate of 30% through 2018.
Big data isn’t rocket surgery. The key to each of these models is pattern recognition: correlating a particular variable with another and linking variables to a future result. More and better data typically leads to better predictions.
It seems that the unstated, and implicit belief in the world of big data is that when you add more variables and get deeper into the weeds, interpretation improves and the prediction become more accurate.
This is a belief system, or what some authors have termed the religion of big data. Gil Press wrote a wonderful critique on “The Government-Academia Complex and the Big Data Religion” in Forbes:
Bigger is better and data possesses unreasonable effectiveness. The more data you have, the more unexpected insights will rise from it, and the more previously unseen patterns will emerge. This is the religion of big data. As a believer, you see ethics and laws in a different light than the non-believers. You also believe that you are part of a new scientific movement which does away with annoying things such as making hypotheses and the assumptions behind traditional statistical techniques. No need to ask questions, just collect lots of data and let it speak.
I’m hesitant to say this (since doctors are always convinced that medicine is somehow an exceptional industry) but, I’m not convinced that more and better computer models will necessarily lead to better diagnoses or an improved day-to-day practice of medicine.
That’s not to say that big data won’t revolutionize healthcare. I’m not referring to things like the personalization of genomic medicine where data analysis will be essential. Or to computerized clinical aids such as Isabel that has cracked all of complex cases that Dr. Lisa Sanders’s published in her Diagnosis column in the New York Times, beating many physicians.
But, there are many things that data will never do well. For certain things, physician heuristics may lead to better decisions than any predictive model.
Heuristics are shortcuts, based on experience and training that allow doctors to solve problems quickly. They are pattern maps that physicians are trained to recognize. But, heuristics have a reputation for leading to imperfect answers: Wikipedia notes that heuristics lead to solutions that “(are) not guaranteed to be optimal, but good enough for a given set of goals…. (they) ease the cognitive load of making a decision.” Humans use them because we simply can’t process information in sequential binary fashion the way computers do.
It would be a mistake to call heuristics a sad substitute for big data. Some cognitive scientists have made the argument, and I think they’re right, that heuristics aren’t simply a shortcut for coming to good-enough answers. For the right kinds of problems, heuristically generated answers are often better than the those generated by computers.
How can this be?
I often think of the following cartoon in Randall Munroe’s superb recent book, What If? Serious Scientific Answers to Absurd Hypothetical Questions. In trying to compare human and computer thinking, he rightly notes that each excels at different things. In this cartoon, for example, humans can quickly determine what they thought happened. Most people can tell you that the kid knocked over the vase and the cat is checking it out, without going through millions of alternate scenarios. Monroe notes that most computers would struggle to quickly come to the same conclusion.
So, from the perspective of an emergency doctor, here are the three leading problems with the applied use of complex analytics in the clinical setting:
- 1. The garbage in, garbage out problem. In short, humans regularly obfuscate their medical stories and misattribute causality. You need humans to guide the patient narrative and ignore red herrings.
- 2. If we want to be able to diagnose, screen and manage an ER full of runny-nosed kids with fevers, we simply can’t afford the time it takes for computers to sequentially process millions of data points. The challenge is at one simple and nuanced: allowing 99% of uncomplicated colds to go home while catching the one case of meningitis. It’s not something that a computer does well: it’s a question of balancing sensitivity (finding all true cases of meningitis among a sea of colds) and specificity (excluding meningitis correctly) and doctors seem to do better than computers when hundreds of cases need to be seen a day.
- 3. There is a problem with excess information, where too much data actually opacifies the answer you’re looking for. Statisticians call this “overfitting” the data. What they mean is that as you add more and more data points to an equation or regression model, the variability of random error around each point gets factored in as well, creating “noise”. The more variables, the more noise.
The paradox is that ignoring information often leads to simpler and ultimately better decisions.
Here’s a great example from the Journal of Family Practice that I found in a super review article from a group at the Center for Adaptive Behavior and Cognition, Max Planck Institute for Human Development in Germany.
In 1997, Drs. Green and Mehr, at the family practice service of a rural hospital in Michigan, tried to introduce a complex algorithm (the Heart Disease Predictive Instrument, HDPI) to residents deciding whether to admit a patient to the cardiac care unit or the regular hospital floor. While this expert system, which relied on residents entering probabilities and variables into a calculator, did lead to better allocation decisions than before the tool was introduced, the physicians found it cumbersome.
Drs. Green and Mehr went on to developed a simple tree based on only a three yes-no questions: Did the patient have ST changes? Did he have chest pain? Were there five associated EKG changes?
The simple heuristic led to far better medical decisions: more patients were appropriately assigned to the coronary unit even though the heuristic used a fraction of the available information. Here is a chart showing the outcomes from before the issue was examined, from when the HDPI was used, and when the simple heuristic was introduced. The heuristic performed better in sensitivity and false positive (1-specificity) ratios than the probability algorithm or blind decisions.
I don’t know how improved big data tools would fare today. It may be that the HDPI wasn’t as advanced as a predictive algorithm used today. But, it may also be that simple tools, intuition and experience led to better and more timely decisions that any computer. These physician heuristics represent the “adaptive unconscious” that Malcolm Gladwell writes (in his excellent book, Blink) often leads to surprisingly good and rapid decisions.
They challenge, going forward, will be to benefit from big data while not becoming a slave to it. The implicit promise of better clinical pictures through more and more pixels– may simply be false.
Images: Idaho National Laboratory, Flikr via cc. Cartoon from Randall Munroe, What If? Serious Scientific Answers to Absurd Hypothetical Questions