New space/time tradeoffs for top-k document retrieval on sequences
Abstract
We address the problem of indexing a collection D = {T-1, T-2,..., T-D)} of D string documents of total length n, so that we can efficiently answer top-k queries: retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using O(n) words, that answer such queries in optimal O (p + k) time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document T-i), an index occupying vertical bar CSA vertical bar + o(n) bits answers the query in time O (t(search)(p) + klg(2) klg(epsilon) n), where CSA is a compressed suffix array indexing D, t(search)(p) is its time to find the suffix array interval of P, and epsilon > 0 is any constant. (2) With the same measure of relevance, an index occupying vertical bar CSA vertical bar + n lg D o(n lg sigma + n lg D) bits answers the query in time O (t(search) (p) + k lg* k), where lg* k is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying vertical bar CSA vertical bar + O (n lg lg n) bits answers the query in O (t(search) (p) + k t(SA)) time, where t(SA) is the time the CSA needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest. (C) 2014 Elsevier B.V. All rights reserved.
Más información
Título según WOS: | New space/time tradeoffs for top-k document retrieval on sequences |
Título según SCOPUS: | New space/time tradeoffs for top-k document retrieval on sequences |
Título de la Revista: | THEORETICAL COMPUTER SCIENCE |
Volumen: | 542 |
Número: | C |
Editorial: | ELSEVIER SCIENCE BV |
Fecha de publicación: | 2014 |
Página de inicio: | 83 |
Página final: | 97 |
Idioma: | English |
DOI: |
10.1016/j.tcs.2014.05.005 |
Notas: | ISI, SCOPUS |