Active Learning in Biomedical Text Classification Using a Bag-of-Regular-Expressions Approach

Flores C.A.; Verschae R.

Keywords: regular expressions, active learning, biomedical text classification

Abstract

Biomedical text classification requires using data annotated by experts, a costly and time-consuming process. To reduce annotation efforts, Active Learning (AL) arises as an alternative to select the most informative texts from an extensive collection of unlabeled data. On the other hand, biomedical texts are characterized by complex patterns, such as numerical features, abbreviations, and typos, which could be effectively captured using Bag-of-Words (BoW) representation or its extensions, such as Term Frequency & Inverse Document Frequency (tfidf), instead of embedding representations if the appropriate features are extracted. In this context, character sequences known as Regular Expressions (RegExes) could be used to generate a feature space representative of the texts. We propose analyzing the AL process in traditional classifiers using a Bag-of-RegExes. The performance of BoW-based classifiers based on Naïve Bayes (NB), Random Forest (RF), eXtreme Gradient Boosting (XGB), and Support Vector Machine (SVM) were evaluated on biomedical texts labeled for obesity, obesity types, and smoking classification problems. The results indicate that the classifiers improved performance using a feature space based on RegExes. Notably, XGB achieved an F-measure (F1) of over 90 % while requiring only 51 % to 66 % of the total training samples. These results highlight the potential of using a Bag-ofRegExes to represent complex patterns, improve interpretability, and outperform state-of-the-art classifiers such as Bidirectional Encoder Representations from Transformers (BERT) and Sentence Transformer Fine-tuning (SetFit) while reducing the need for labeled data. © 2025 IEEE.

Más información

Título según WOS: Active Learning in Biomedical Text Classification Using a Bag-of-Regular-Expressions Approach
Título según SCOPUS: Active Learning in Biomedical Text Classification Using a Bag-of-Regular-Expressions Approach
Título de la Revista: Proceedings - IEEE Symposium on Computer-Based Medical Systems
Editorial: Institute of Electrical and Electronics Engineers Inc.
Fecha de publicación: 2025
Página de inicio: 208
Página final: 213
Idioma: English
DOI:

10.1109/CBMS65348.2025.00051

Notas: ISI, SCOPUS