Machine learning (ML) programs computers to learn the way we do – through the continual assessment of data and identification of patterns based on past outcomes. ML can quickly pick out trends in big datasets, operate with little to no human interaction and improve its predictions over time. Due to these abilities, it is rapidly finding its way into medical research.
People with breast cancer may soon be diagnosed through ML faster than through a biopsy. Those suffering from depression might be able to predict mood changes through smart phone recordings of daily activities such as the time they wake up and amount of time they spend exercising. ML may also help paralyzed people regain autonomy using prosthetics controlled by patterns identified in brain scan data. ML research promises these and many other possibilities to help people lead healthier lives.
But while the number of ML studies grow, the actual use of it in doctors’ offices has not expanded much past simple functions such as converting voice to text for notetaking.
The limitations lie in medical research’s small sample sizes and unique datasets. This small data makes it hard for machines to identify meaningful patterns. The more data, the more accuracy in ML diagnoses and predictions. For many diagnostic uses, massive numbers of subjects in the thousands would be needed, but most studies use smaller numbers in the dozens of subjects.
But there are ways to find significant results from small datasets if you know how to manipulate the numbers. Running statistical tests over and over again with different subsets of your data can indicate significance in a dataset that in reality may be just random outliers.
This tactic, known as P-hacking or feature hacking in ML, leads to the creation of predictive models that are too limited to be useful in the real world. What looks good on paper doesn’t translate to a doctor’s ability to diagnose or treat us.
These statistical mistakes, oftentimes done unknowingly, can lead to dangerous conclusions.
To help scientists avoid these mistakes and push ML applications forward, Konrad Kording, Nathan Francis Mossell University Professor with appointments in the Departments of Bioengineering and Computer and Information Science in Penn Engineering and the Department of Neuroscience at Penn’s Perelman School of Medicine, is leading an aspect of a large, NIH-funded program known as CENTER – Creating an Educational Nexus for Training in Experimental Rigor. Kording will lead Penn’s cohort by creating the Community for Rigor which will provide open-access resources on conducting sound science. Members of this inclusive scientific community will be able to engage with ML simulations and discussion-based courses.
“The reason for the lack of ML in real-world scenarios is due to statistical misuse rather than the limitations of the tool itself,” says Kording. “If a study publishes a claim that seems too good to be true, it usually is, and many times we can track that back to their use of statistics.”
Such studies that make their way into peer-reviewed journals contribute to misinformation and mistrust in science and are more common than one might expect.
A recent publication grabbed Kording’s attention. The study, which used machine learning on data from MRI scans of the brain, claimed to have created a model that could detect suicidal ideation with an accuracy of 91 percent – a model that would surely transform certain diagnostic procedures.
But upon repeating the data analysis of this study, Kording and colleague Tim Verstynen, Associate Professor of Psychology at the Neuroscience Institute at Carnegie Mellon University, found many instances of feature hacking that would have led the researchers to cherry pick data points to create a highly specific predictive model.
“With only 34 patients, their study started off with too small of a sample size to result in sound science,” says Kording. “The data they used were a combination of words relating to mortality and corresponding regions of the brain that lit up in MRI scans. Instead of using all of the data from each patient, they chose specific words and regions.”
Those choices led to the creation of a very high-performing model when used and tested against that specific dataset. But, if used to predict suicidal ideation in real patients, it would not be accurate. Research based on the manipulated data points of 34 people won’t serve mental health practitioners looking for diagnostic tools. After Kording’s re-analysis, the study was retracted from the journal.
To make meaningful advancements in the field of ML in biomedical research, it will be necessary to raise awareness of these issues, help researchers understand how to identify them and limit them, and create a stronger culture around scientific rigor in the research community.
Kording aims to communicate that just because incorporating machine learning into biomedical research can introduce room for bias doesn’t mean scientists should avoid it. They just need to understand how to use it in a meaningful way.
The Community for Rigor aims to address challenges of the field with specific plans to create a module on machine learning in biomedical research that will guide participants through datasets and statistical tests and pinpoint exact locations where bias is commonly introduced.
The Community is still in its infancy but Kording and colleagues plan to publish resources as soon as the fall. One of the first ways to get involved in this effort is to follow The Community for Rigor on Twitter and join the conversation by anonymously sharing your own scientific rigor mistakes and challenges.
“While it would be extremely helpful to have easy and accurate ways of diagnosing and treating medical conditions, our own human bias can get in the way of what the data is actually saying or not saying,” warns Kording. “That’s what this community aims to improve.”