Combining Regular Expressions and Supervised Algorithms for Clinical Text Classification

Flores, Christopher; VERSCHAE-TANNENBAUM, RODRIGO ANDRES

Abstract

Clinical text classification allows assigning labels to content-based data using machine learning algorithms. However, unlike other study domains, clinical texts present complex linguistic diversity, including abbreviations, typos, and numerical patterns that are difficult to represent by the most-used classification algorithms. In this sense, sequences of character strings and symbols, known as Regular Expressions (RegExs), offer an alternative to represent complex patterns from the texts and could be used jointly with the most commonly used classification algorithms for accurate text classification. Thus, a classification algorithm can label test texts when RegExs produce no matches. This work proposes a method that combines automatically-generated RegExs and supervised algorithms for classifying clinical texts. RegExs are automatically generated using alignment algorithms in a supervised manner, filtering out those that do not meet a minimum confidence threshold and do not contain specific keywords for the classification problem. At prediction time, our method assigns the class of the most confident RegEx that matches a test text. When no RegExs matches a test text, a supervised algorithm assigns a class. Three clinical datasets with textual information on obesity and smoking habits were used to assess the performance of four classifiers based on Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB), and Bidirectional Encoder Representations from Transformers (BERT). Classification results indicate that our method, on average, improved the classifiers’ performance by up to 12% in all performance metrics. These results show the ability of our method to generate confident RegExs that capture representative patterns from the texts for use with supervised algorithms.

Más información

Título según SCOPUS: ID SCOPUS_ID:85177809834 Not found in local SCOPUS DB
Título de la Revista: Lecture Notes in Computer Science
Volumen: 14404 LNCS
Editorial: Springer
Fecha de publicación: 2023
Página de inicio: 381
Página final: 392
DOI:

10.1007/978-3-031-48232-8_35

Notas: SCOPUS