CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

Flores, CA; Figueroa R.L.; Pezoa, J. E.

Abstract

High accuracy text classifiers are used nowadays in organizing large amounts of biomedical information and supporting clinical decision-making processes. In medical informatics, regular expression-based classifiers have emerged as an alternative to traditional, discriminative classification algorithms due to their ability to model sequential patterns. This article presents CREGEX (Classifier Regular Expression), a biomedical text classifier based on an automatically generated regular-expressions-based feature space. We conceived an algorithm for automatically constructing an informative and discriminative regular-expressions-based feature space, suitable for binary and multiclass discrimination problems. Regular expressions are automatically generated from training texts using a coarse-to-fine text aligning method, which trades off the lexical variants of words, in terms of gender and grammatical number, and the generation of a feature space containing a large number of noisy features. CREGEX carries out feature selection by filtering keywords and also computes a confidence metric to classify test texts. Three de-identified datasets in Spanish, with information on smoking habits, obesity, and obesity types, were used here to assess the performance of CREGEX. For comparison, Support Vector Machine (SVM) and Na & x00EF;ve Bayes (NB) supervised classifiers were also trained with consecutive sequences of tokens (n-grams) as features. Results show that, in all the datasets used for evaluation, CREGEX not only outperformed both the SVM and NB classifiers in terms of accuracy and F-measure (p-value & x003C;0.05) but also used a fewer amount of training examples to achieve the same performance. Such a superior performance is attributed to the regular expressions; ability to represent complex text patterns.

Más información

Título según WOS: CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions
Título según SCOPUS: CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions
Título de la Revista: IEEE ACCESS
Volumen: 8
Editorial: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Fecha de publicación: 2020
Página de inicio: 29270
Página final: 29280
Idioma: English
DOI:

10.1109/ACCESS.2020.2972205

Notas: ISI, SCOPUS