CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions
Abstract
High accuracy text classifiers are used nowadays in organizing large amounts of biomedical information and supporting clinical decision-making processes. In medical informatics, regular expression-based classifiers have emerged as an alternative to traditional, discriminative classification algorithms due to their ability to model sequential patterns. This article presents CREGEX (Classifier Regular Expression), a biomedical text classifier based on an automatically generated regular-expressions-based feature space. We conceived an algorithm for automatically constructing an informative and discriminative regular-expressions-based feature space, suitable for binary and multiclass discrimination problems. Regular expressions are automatically generated from training texts using a coarse-to-fine text aligning method, which trades off the lexical variants of words, in terms of gender and grammatical number, and the generation of a feature space containing a large number of noisy features. CREGEX carries out feature selection by filtering keywords and also computes a confidence metric to classify test texts. Three de-identified datasets in Spanish, with information on smoking habits, obesity, and obesity types, were used here to assess the performance of CREGEX. For comparison, Support Vector Machine (SVM) and Na & x00EF;ve Bayes (NB) supervised classifiers were also trained with consecutive sequences of tokens (n-grams) as features. Results show that, in all the datasets used for evaluation, CREGEX not only outperformed both the SVM and NB classifiers in terms of accuracy and F-measure (p-value & x003C;0.05) but also used a fewer amount of training examples to achieve the same performance. Such a superior performance is attributed to the regular expressions; ability to represent complex text patterns.
Más información
| Título según WOS: | CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions |
| Título según SCOPUS: | CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions |
| Título de la Revista: | IEEE ACCESS |
| Volumen: | 8 |
| Editorial: | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
| Fecha de publicación: | 2020 |
| Página de inicio: | 29270 |
| Página final: | 29280 |
| Idioma: | English |
| DOI: |
10.1109/ACCESS.2020.2972205 |
| Notas: | ISI, SCOPUS |