Active Learning in Biomedical Text Classification Using a Bag-of-Regular-Expressions Approach
Keywords: regular expressions, active learning, biomedical text classification
Abstract
Biomedical text classification requires using data annotated by experts, a costly and time-consuming process. To reduce annotation efforts, Active Learning (AL) arises as an alternative to select the most informative texts from an extensive collection of unlabeled data. On the other hand, biomedical texts are characterized by complex patterns, such as numerical features, abbreviations, and typos, which could be effectively captured using Bag-of-Words (BoW) representation or its extensions, such as Term Frequency & Inverse Document Frequency (tfidf), instead of embedding representations if the appropriate features are extracted. In this context, character sequences known as Regular Expressions (RegExes) could be used to generate a feature space representative of the texts. We propose analyzing the AL process in traditional classifiers using a Bag-of-RegExes. The performance of BoW-based classifiers based on Naïve Bayes (NB), Random Forest (RF), eXtreme Gradient Boosting (XGB), and Support Vector Machine (SVM) were evaluated on biomedical texts labeled for obesity, obesity types, and smoking classification problems. The results indicate that the classifiers improved performance using a feature space based on RegExes. Notably, XGB achieved an F-measure (F1) of over 90 % while requiring only 51 % to 66 % of the total training samples. These results highlight the potential of using a Bag-ofRegExes to represent complex patterns, improve interpretability, and outperform state-of-the-art classifiers such as Bidirectional Encoder Representations from Transformers (BERT) and Sentence Transformer Fine-tuning (SetFit) while reducing the need for labeled data. © 2025 IEEE.
Más información
| Título según WOS: | Active Learning in Biomedical Text Classification Using a Bag-of-Regular-Expressions Approach |
| Título según SCOPUS: | Active Learning in Biomedical Text Classification Using a Bag-of-Regular-Expressions Approach |
| Título de la Revista: | Proceedings - IEEE Symposium on Computer-Based Medical Systems |
| Editorial: | Institute of Electrical and Electronics Engineers Inc. |
| Fecha de publicación: | 2025 |
| Página de inicio: | 208 |
| Página final: | 213 |
| Idioma: | English |
| DOI: |
10.1109/CBMS65348.2025.00051 |
| Notas: | ISI, SCOPUS |