• Research

    Science Saturday: New standards and open access can help natural language processing

Clinical notes in medical records are rich sources of data about human health. But tapping them for medical research can be challenging because these data come from various sources—and they all look different.

Sunyang Fu, Ph.D.

"There's no standardization in how data is organized and classified across medical records systems," says Sunyang Fu, Ph.D., a Mayo Clinic biomedical informatics researcher.

Even the language people use to talk about health can insert discrepancies in how data are recorded. "If a patient had a recent fall, they might say they 'went down,' 'flipped backwards,' 'tripped on a rug,' or 'hit the back of their head,'" says Dr. Fu. Scientists are learning how to make sense of such widely varied data using natural language processing, known as NLP. NLP is a discipline related to artificial intelligence (AI) that teaches computers how to understand human language. Scientists design NLP algorithms to transform disparate information into structured data in a standardized format that can be analyzed.

Studies that use NLP have demonstrated promise to benefit patients, says Dr. Fu, but there’s a problem. When publishing their NLP research, scientists don’t always share all the "how to" instructions, sometimes because algorithms are protected as intellectual property. This makes it difficult for other scientists to validate or reproduce a study, one of the hallmarks of good science.

A Troubling Trend

"The absence of ‘how to’ instructions is becoming a troubling trend," observes Dr. Fu.

Dr. Fu is the first author of a recently published review that found a wide range of inconsistent reporting practices in NLP-assisted observational research. In journal articles published from January 2009 to September 2021, researchers found many studies did not report the methodology they used to develop an algorithm, nor did they report the study’s evaluation design. In addition, more than half of studies failed to report the type of dictionary, lexicon or other language model used, and nearly three-quarters did not report the techniques used to normalize data to improve data integrity and reduce redundancy.

Dr. Fu and his co-authors noticed that most studies used NLP to extract information on medical risk factors and outcomes, looking for statistical associations. "Invalid results in this type of research can lead to systemic bias, measurement error and misclassification, which negatively affects the validity of the research and any clinical guidelines that might be derived from it," says Dr. Fu.

Ultimately, they assert, these problems can damage the integrity of the research and erode public trust in science.

The 'Open' Path Forward

Hongfang Liu, Ph.D.

"There is a huge potential in the leveraging of NLP, AI, and real-world data to advance clinical research and transform health care," says Hongfang Liu, Ph.D., director of Biomedical Informatics in Mayo Clinic's Center for Clinical and Translational Science and senior author of the study. However, she warns, "The potential of this work to benefit patients heavily depends on the ability to ensure the research is scientifically rigorous, ethical and transparent."

The researchers advocate for the development and wide adoption of people-centric, value-added and evidence-based NLP standards. They also make recommendations for the field that focus on transparency and scientific rigor. For instance, to address reporting inconsistencies, one recommendation is for scientists to specify the type of language model and data normalization techniques used, and to provide references and access to any generic text processor or statistical models.

The research team also encourages the development of open NLP communities and a team-based approach to science. To protect the integrity of NLP-assisted observational research, Dr. Liu says that more of it needs to occur within open, collaborative, trustworthy environments. 

The researchers point to the success of open, collaborative efforts, such as the National COVID Cohort Collaborative, which have enabled teams of researchers to mobilize tools and best practices quickly to address urgent public health needs. Mayo Clinic has been a leader in many such efforts.

Dr. Liu also supports open collaboration as a necessary step to reduce health disparities.

"Open collaboration ensures access to these resources for everyone," says Dr. Liu. "This fosters research that equitably advances health for all people."