Compressed representations of sequences and full-text indexes
Keywords: binary, compression, information, management, tree, sequences, trees, time, indexes, text, data, transform, transformations, wavelet, query, methods, processing, mathematical, burrows, Computational, Indexing, (of, information), (mathematics), counting, Wheeler, Integers
Abstract
Given a sequence S = s1s2 . . . sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq , as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r , we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r/ log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet. Specifically, we design a variant of the FM-index that indexes a string T [1, n] within nHk (T)+o(n) bits of storage, where Hk (T) is the kth-order empirical entropy of T . This space bound holds simultaneously for all k = a log n, constant 0 < a < 1, and = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log1+e n) time for any constant 0 < e < 1; and reports a text substring of length in O(+ log1+e n) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size = O(n), for any 0 < β < 1, by paying o(n log) extra space and multiplying all query times by O(log/log log n). © 2007 ACM.
Más información
Título según SCOPUS: | Compressed representations of sequences and full-text indexes |
Título de la Revista: | ACM TRANSACTIONS ON ALGORITHMS |
Volumen: | 3 |
Número: | 2 |
Editorial: | ASSOC COMPUTING MACHINERY |
Fecha de publicación: | 2007 |
Idioma: | eng |
URL: | http://www.scopus.com/inward/record.url?eid=2-s2.0-34250171723&partnerID=q2rCbXpz |
DOI: |
10.1145/1240233.1240243 |
Notas: | SCOPUS |