Measuring the Effects of Summarization in Cluster-based Information Retrieval

Curiel, Arturo; Gutierrez-Soto, Claudio; Soto-Borquez, Pablo-Nicolas; Galdames Patricio A.


Summarization is an integral part of modern Internet. In social networks, which have become primary information sources, users have grown accustomed to condense their writing. Content providers routinely publish short textual excerpts to these platforms as well. However, with larger quantities of small documents becoming constantly available, search engines now have less data to index, classify and retrieve relevant information. In this regard, more research is needed to show how reliable the current Information Retrieval (IR) algorithms are when confronted to collections of exclusively short documents, such as the ones arising from social media.This paper explores the semantic proximity between human summaries and queries through cluster analysis, and how it relates to IR. Roughly, the k-means algorithm was used to cluster two collections of summaries by their semantic similarity: one in English and one in Spanish. This, to measure how summarization may affect information content in cluster-based IR. Furthermore, the same algorithm was used to measure how documents grouped around a set of artificially generated queries.The results show that, regardless of the language, providing the algorithm with previous category knowledge may contribute to increase the accuracy of cluster-based document classification. Furthermore, some evidences points to the effect of summary quality in retrievability: summaries created by specialized summarizers induced more distinguishable clusters than summaries created by university students. Future work in this area may serve to adapt existing algorithms to big collections of short documents, improving IR performance in cases where machine learning techniques are not available.

Más información

Fecha de publicación: 2020
Año de Inicio/Término: November 16-20, 2020
Página de inicio: 1
Página final: 8