Academic text classification based on lexical-semantic content Clasificación de textos académicos en función de su contenido léxico-semántico

Venegas, R

Keywords: Academic discourse Naïve Bayes Support Vector Machine Vectorial model

Abstract

The aim of this research is to classify, using and comparing two automatic classification methods, the academic texts included in the PUCV-2006 Corpus belonging to the Fondecyt 1060440 research project. The methods are based on shared lexical-semantic content words present in a corpus of academic texts used in four professional carriers at the Pontificia Universidad Católica de Valparaíso, Chile. The research corpus, nowadays, is constituted by 652 texts with 96.288.874 words. For our purposes, we use a sample of 216 texts (30.886.081 words) divided, as following: 26 used in Construction Engineering, 31 used in Chemistry, 64 used Social Work, and 95 used in Psychology. The classification methods compared in this research are Multinomial Naïve Bayes and Support Vector Machine, both permits to identify a small group of shared words that permit, according statistical weights, to classify a new text into the four disciplinary areas. The results allows us to establish that Support Vector Machine classify in a efficient way academic texts, with high precision and recall values. With this method we are able to identify automatically the disciplinary domain, with a high percentage of accuracy (93,9%), of a new academic text in a query. We project to use this method as part of a more detailed multidimensional analysis of the PUCV-2006 Corpus.

Más información

Título según SCOPUS: Academic text classification based on lexical-semantic content Clasificación de textos académicos en función de su contenido léxico-semántico
Título de la Revista: REVISTA SIGNOS
Volumen: 40
Número: 63
Editorial: PONTIFICIA UNIVERSIDAD CATÓLICA DE VALPARAÍSO<BR> INSTITUTO DE LITERATURA Y CIENCIAS DEL LENGUAJE
Fecha de publicación: 2007
Página de inicio: 239
Página final: 271
Idioma: eng
Notas: SCOPUS