PDI - Resultado de Búsqueda

Abstract

While the k th order empirical entropy is an accepted measure of the compressibility of individual sequences on classical text collections, it is useful only for small values of k and thus fails to capture the compressibility of repetitive sequences. In the absence of an established way of quantifying the latter, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate repetitiveness. The size b â¤ z of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute, and it is not monotone upon appending symbols. Recently, a more principled measure, the size Î³ of the smallest string attractor, was introduced. The measure Î³ â¤ b lower-bounds all the previous relevant ones, while length- n strings can be represented and efficiently indexed within space O(Î³ log n/Î³), which also upper-bounds many measures, including z. Although Î³ is arguably a better measure of repetitiveness than b, it is also NP-complete to compute and not monotone, and it is unknown if one can represent all strings in o(Î³ log n) space. In this paper, we study an even smaller measure, Î´ â¤ Î³, which can be computed in linear time, is monotone, and allows encoding every string in O(Î´ log n/Î´) space because z = O(Î´ log n/Î´). We argue that Î´ better captures the compressibility of repetitive strings. Concretely, we show that (1) Î´ can be strictly smaller than Î³, by up to a logarithmic factor; (2) there are string families needing Î© (Î´ log n/Î´) space to be encoded, so this space is optimal for every n and Î´; (3) one can build run-length context-free grammars of size O(Î´ log n/Î´), whereas the smallest (non-run-length) grammar can be up to Î(log n/log log n) times larger; and (4) within O(Î´ log n/Î´) space, we can not only represent a string but also offer logarithmic-time access to its symbols, computation of substring fingerprints, and efficient indexed searches for pattern occurrences. We further refine the above results to account for the alphabet size Ï of the string, showing that Î(Î´ log n log Ï/Î´ log n) space is necessary and sufficient to represent the string and to efficiently support access, fingerprinting, and pattern matching queries.

Más información

Título según WOS:	Toward a Definitive Compressibility Measure for Repetitive Sequences
Título según SCOPUS:	Toward a Definitive Compressibility Measure for Repetitive Sequences
Título de la Revista:	IEEE Transactions on Information Theory
Volumen:	69
Número:	4
Editorial:	Institute of Electrical and Electronics Engineers Inc.
Fecha de publicación:	2023
Página de inicio:	2074
Página final:	2092
Idioma:	English
DOI:	10.1109/TIT.2022.3224382
Notas:	ISI, SCOPUS

Toward a Definitive Compressibility Measure for Repetitive Sequences

Abstract

Más información