Advanced techniques in web data pre-processing and cleaning

Dell R.F.; Roman P.E.; Velásquez J.D.

Abstract

Central to successful e-business is the construction of web sites that attract users, capture user preferences, and entice them into making a purchase. Web mining is diverse data mining applied to categorize both the content and structure of web sites with the goal of aiding e-business. Web mining requires knowledge of the web site structure (hyperlink graph), the web content (vector model) and user sessions (the sequence of pages visited by each user to a site). Much of the data for web mining can be noisy. The origin of the noise comes from many sources, for example, undocumented changes to the web site structure and content, a different understanding of the text and media semantic, and web logs without individual user identification. There may not be any record of the number of times a specific page has been visited in a session as page is stored on a proxy or web browser cache. Such noise presents a challenge for web mining. This chapter presents issues with and approaches for cleaning web data in preparation for web mining analysis. © 2010 Springer-Verlag Berlin Heidelberg.

Más información

Título de la Revista: INTELLIGENT DISTRIBUTED COMPUTING VII
Volumen: 311
Editorial: SPRINGER-VERLAG BERLIN
Fecha de publicación: 2010
Página de inicio: 19
Página final: 48
URL: http://www.scopus.com/inward/record.url?eid=2-s2.0-77956537722&partnerID=q2rCbXpz