Indexes for highly repetitive document collections

Claude F.; Fariña A.; Martinez-Prieto M.A.; Navarro G.

Keywords: compression, information, list, management, length, knowledge, magnitude, collections, data, collection, methods, retrieval, orders, of, Repetitive, Inverted, highly, Indices, Document, Versioning, Grammar-based, Run

Abstract

We introduce new compressed inverted indexes for highly repetitive document collections. They are based on run-length, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection. We also introduce compressed self-indexes in the comparison. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower. © 2011 ACM.

Más información

Título de la Revista: 1604-2004: SUPERNOVAE AS COSMOLOGICAL LIGHTHOUSES
Editorial: ASTRONOMICAL SOC PACIFIC
Fecha de publicación: 2011
Página de inicio: 463
Página final: 468
URL: http://www.scopus.com/inward/record.url?eid=2-s2.0-83055168142&partnerID=q2rCbXpz