Toward a Definitive Compressibility Measure for Repetitive Sequences
Abstract
While the k th order empirical entropy is an accepted measure of the compressibility of individual sequences on classical text collections, it is useful only for small values of k and thus fails to capture the compressibility of repetitive sequences. In the absence of an established way of quantifying the latter, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate repetitiveness. The size b ? z of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute, and it is not monotone upon appending symbols. Recently, a more principled measure, the size ? of the smallest string attractor, was introduced. The measure ? ? b lower-bounds all the previous relevant ones, while length- n strings can be represented and efficiently indexed within space O(? log n/?), which also upper-bounds many measures, including z. Although ? is arguably a better measure of repetitiveness than b, it is also NP-complete to compute and not monotone, and it is unknown if one can represent all strings in o(? log n) space. In this paper, we study an even smaller measure, ? ? ?, which can be computed in linear time, is monotone, and allows encoding every string in O(? log n/?) space because z = O(? log n/?). We argue that ? better captures the compressibility of repetitive strings. Concretely, we show that (1) ? can be strictly smaller than ?, by up to a logarithmic factor; (2) there are string families needing ? (? log n/?) space to be encoded, so this space is optimal for every n and ?; (3) one can build run-length context-free grammars of size O(? log n/?), whereas the smallest (non-run-length) grammar can be up to ?(log n/log log n) times larger; and (4) within O(? log n/?) space, we can not only represent a string but also offer logarithmic-time access to its symbols, computation of substring fingerprints, and efficient indexed searches for pattern occurrences. We further refine the above results to account for the alphabet size ? of the string, showing that ?(? log n log ?/? log n) space is necessary and sufficient to represent the string and to efficiently support access, fingerprinting, and pattern matching queries. © 1963-2012 IEEE.
Más información
| Título según WOS: | Toward a Definitive Compressibility Measure for Repetitive Sequences |
| Título según SCOPUS: | Toward a Definitive Compressibility Measure for Repetitive Sequences |
| Título de la Revista: | IEEE Transactions on Information Theory |
| Volumen: | 69 |
| Número: | 4 |
| Editorial: | Institute of Electrical and Electronics Engineers Inc. |
| Fecha de publicación: | 2023 |
| Página de inicio: | 2074 |
| Página final: | 2092 |
| Idioma: | English |
| DOI: |
10.1109/TIT.2022.3224382 |
| Notas: | ISI, SCOPUS |