Toward a Definitive Compressibility Measure for Repetitive Sequences

Abstract

While the kth order empirical entropy is an accepted measure of the compressibility of individual sequences on classical text collections, it is useful only for small values of k and thus fails to capture the compressibility of repetitive sequences. In the absence of an established way of quantifying the latter, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate repetitiveness. The size b <= z of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute, and it is not monotone upon appending symbols. Recently, a more principled measure, the size gamma of the smallest string attractor, was introduced. The measure gamma <= b lowerbounds all the previous relevant ones, while length-n strings can be represented and efficiently indexed within space O(gamma log n/gamma), which also upper-bounds many measures, including z. Although gamma is arguably a better measure of repetitiveness than b, it is also NP-complete to compute and not monotone, and it is unknown if one can represent all strings in o(gamma log n) space. In this paper, we study an even smaller measure, delta <= gamma, which can be computed in linear time, is monotone, and allows encoding every string in O(delta log n/delta) space because z = O(delta log n/delta). We argue that delta better captures the compressibility of repetitive strings. Concretely, we show that (1) delta can be strictly smaller than gamma, by up to a logarithmic factor; (2) there are string families needing Omega(delta log n/delta) space to be encoded, so this space is optimal for every n and delta; (3) one can build run-length context-free grammars of size O(delta log n/delta), whereas the smallest (non-run-length) grammar can be up to Theta(log n/ log log n) times larger; and (4) within O(delta log n/delta) space, we can not only represent a string but also offer logarithmic-time access to its symbols, computation of substring fingerprints, and efficient indexed searches for pattern occurrences. We further refine the above results to account for the alphabet size sigma of the string, showing that Theta(delta log n log sigma / delta log n) space is necessary and sufficient to represent the string and to efficiently support access, fingerprinting, and pattern matching queries.

Más información

Título según WOS: Toward a Definitive Compressibility Measure for Repetitive Sequences
Título de la Revista: IEEE TRANSACTIONS ON INFORMATION THEORY
Volumen: 69
Número: 4
Editorial: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Fecha de publicación: 2023
Página de inicio: 2074
Página final: 2092
DOI:

10.1109/TIT.2022.3224382

Notas: ISI