REmatch: a novel regex engine for finding all matches

Riveros, Cristian; Jan, Nicolas Van Sint; Vrgoc, Domagoj

Abstract

In this paper, we present the RE match system for information extraction. REmatch is based on a recently proposed enumeration algorithm for evaluating regular expressions with capture variables supporting the all-match semantics. It tells a story of what it takes to make a theoretically optimal algorithm work in practice. As we show here, a naive implementation of the original algorithm would have a hard time dealing with realistic workloads. We thus develop a new algorithm and a series of optimizations that make REmatch as fast or faster than many popular RegEx engines while at the same time being able to return all the outputs: a task that most other engines tend to struggle with.

Más información

Título según WOS: ID WOS:001059181900010 Not found in local WOS DB
Título de la Revista: PROCEEDINGS OF THE VLDB ENDOWMENT
Volumen: 16
Editorial: ASSOC COMPUTING MACHINERY
Fecha de publicación: 2023
Página de inicio: 2792
Página final: 2804
DOI:

10.14778/3611479.3611488

Notas: ISI