Efficient Compressed Indexing for Approximate Top-k String Retrieval

Ferrada, Hector; Navarro, Gonzalo; Moura, E; Crochemore, M

Abstract

Given a collection of strings (called documents), the top-k document retrieval problem is that of, given a string pattern p, finding the k documents where p appears most often. This is a basic task in most information retrieval scenarios. The best current implementations require 20-30 bits per character (bpc) and k to 4k microseconds per query, or 12-24 bpc and 1-10 milliseconds per query. We introduce a Lempel-Ziv compressed data structure that occupies 5-10 bpc to answer queries in around k microseconds. The drawback is that the answer is approximate, but we show that its quality improves asymptotically with the size of the collection, reaching over 85% of the accumulated term frequency of the real answer already for patterns of length 4-6 on rather small collections, and improving for larger ones.

Más información

Título según WOS: Efficient Compressed Indexing for Approximate Top-k String Retrieval
Título de la Revista: BIO-INSPIRED SYSTEMS AND APPLICATIONS: FROM ROBOTICS TO AMBIENT INTELLIGENCE, PT II
Volumen: 8799
Editorial: SPRINGER INTERNATIONAL PUBLISHING AG
Fecha de publicación: 2014
Página de inicio: 18
Página final: 30
Notas: ISI