A framework for annotating CSV-like data

Arenas, Marcelo; Maturana, Francisco; Riveros, Cristian; Vrgoc, Domagoj

Abstract

In this paper, we propose a simple and expressive framework for adding metadata to CSV documents and their noisy variants. The framework is based on annotating parts of the document that can be later used to read, query, or exchange the data. The core of our framework is a language based on extended regular expressions that are used for selecting data. These expressions are then combined using a set of rules in order to annotate the data. We study the computational complexity of implementing our framework and present an efficient evaluation algorithm that runs in time proportional to its output and linear in its input. As a proof of concept, we test an implementation of our framework against a large number of real world datasets and show that it can be efficiently used in practice.

Más información

Título de la Revista: PROCEEDINGS OF THE VLDB ENDOWMENT
Volumen: 9
Número: 11
Editorial: ASSOC COMPUTING MACHINERY
Fecha de publicación: 2016
Página de inicio: 876
Página final: 887
DOI:

10.14778/2983200.2983204

Notas: WOS-ESCI