Embedded feature selection for spam and phishing filtering using support vector machines

Maldonado S.; L'Huillier G.

Keywords: binary, information, selection, support, recognition, classification, machines, sets, gradient, extraction, messages, errors, pattern, computer, data, formulations, internet, nonlinear, crime, vector, methods, dissemination, descent, results, dual, feature, (of, information), minimal, Embedded, Spam, High-dimensional, anisotropic, Content-based, RBF, Phishing, End-users

Abstract

Today, the Internet is full of harmful and wasteful elements, such as phishing and spam messages, which must be properly classified before reaching end-users. This issue has attracted the pattern recognition community's attention and motivated to determine which strategies achieve best classification results. Several methods use as many features as content-based properties the data set have, which leads to a high dimensional classification problem. In this context, this paper presents a feature selection approach that simultaneously determines a nonlinear classification function with minimal error and minimizes the number of features by penalizing their use in the dual formulation of binary Support Vector Machines (SVM). The method optimizes the width of an anisotropic RBF Kernel via successive gradient descent steps, eliminating features that have low relevance for the model. Experiments with two real-world Spam and Phishing data sets demonstrate that our approach accomplishes the best performance compared to well-known feature selection methods using consistently a small number of features.

Más información

Título de la Revista: Unknown (9789898425980)
Volumen: 2
Editorial: Unknown
Fecha de publicación: 2012
Página de inicio: 445
Página final: 450
URL: http://www.scopus.com/inward/record.url?eid=2-s2.0-84862194866&partnerID=40&md5=29722b83b32d2d18d4c526630ca9e367