Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions

Flores, Christopher A.; Figueroa, Rosa L.; Pezoa, Jorge E.

Abstract

Biomedical text classification algorithms, which currently support clinical decision-making processes, call for expensive training texts due to the low availability of labeled corpus and the cost of manual annotation by specialized professionals. The active learning (AL) approach to classification heavily lessens such cost by reducing the number of labeled documents required to achieve specified performance. This article introduces a query strategy and a stopping criterion that transform CREGEX, a regular-expressions-based text classification algorithm, in an AL biomedical text classifier. The query strategy samples the training dataset, trading off the greedy learning achieved by the regular expressions classification precision and the conservative learning induced by text sequence alignment classification. The sustained reduction in the variance of the query strategy scores is used as a stopping criterion. The AL classifier was compared with Support Vector Machine (SVM), Naive Bayes (NB), and a classifier based on Bidirectional Encoder Representations from Transformers (BERT), using three datasets with biomedical information in Spanish on smoking habits, obesity, and obesity types. The learning curve results indicate that AL in CREGEX allowed to efficiently reduce the number of training examples for equal performance than the rest of the classifiers, obtaining areas under the learning curve greater than 85% in all cases. The stopping criterion applied to the AL process allowed to use, on average, approximately 32% to 50% of the total training examples with differences in performance concerning the maximum value of the learning curve not exceeding 2%. This performance demonstrates the effectiveness of using AL in a biomedical text classifier based on regular expressions, which is attributable to such expressions' ability to represent intricate sequential patterns in training texts considered most informative.

Más información

Título según WOS: Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions
Título de la Revista: IEEE ACCESS
Volumen: 9
Editorial: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Fecha de publicación: 2021
Página de inicio: 38767
Página final: 38777
DOI:

10.1109/ACCESS.2021.3064000

Notas: ISI