Toward a Definitive Compressibility Measure for Repetitive Sequences

Abstract

While the k th order empirical entropy is an accepted measure of the compressibility of individual sequences on classical text collections, it is useful only for small values of k and thus fails to capture the compressibility of repetitive sequences. In the absence of an established way of quantifying the latter, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate repetitiveness. The size b ≤ z of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute, and it is not monotone upon appending symbols. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ ≤ b lower-bounds all the previous relevant ones, while length- n strings can be represented and efficiently indexed within space O(γ log n/γ), which also upper-bounds many measures, including z. Although γ is arguably a better measure of repetitiveness than b, it is also NP-complete to compute and not monotone, and it is unknown if one can represent all strings in o(γ log n) space. In this paper, we study an even smaller measure, δ ≤ γ, which can be computed in linear time, is monotone, and allows encoding every string in O(δ log n/δ) space because z = O(δ log n/δ). We argue that δ better captures the compressibility of repetitive strings. Concretely, we show that (1) δ can be strictly smaller than γ, by up to a logarithmic factor; (2) there are string families needing Ω (δ log n/δ) space to be encoded, so this space is optimal for every n and δ; (3) one can build run-length context-free grammars of size O(δ log n/δ), whereas the smallest (non-run-length) grammar can be up to Θ(log n/log log n) times larger; and (4) within O(δ log n/δ) space, we can not only represent a string but also offer logarithmic-time access to its symbols, computation of substring fingerprints, and efficient indexed searches for pattern occurrences. We further refine the above results to account for the alphabet size σ of the string, showing that Θ(δ log n log σ/δ log n) space is necessary and sufficient to represent the string and to efficiently support access, fingerprinting, and pattern matching queries.

Más información

Título según WOS: Toward a Definitive Compressibility Measure for Repetitive Sequences
Título según SCOPUS: Toward a Definitive Compressibility Measure for Repetitive Sequences
Título de la Revista: IEEE Transactions on Information Theory
Volumen: 69
Número: 4
Editorial: Institute of Electrical and Electronics Engineers Inc.
Fecha de publicación: 2023
Página de inicio: 2074
Página final: 2092
Idioma: English
DOI:

10.1109/TIT.2022.3224382

Notas: ISI, SCOPUS