BDS-Analytics: Towards a PySpark Library for a Preliminary Exploratory Big Data Analysis
Abstract
Data observability is the ability of the whole monitoring and understanding of data quality and lineage to identify and address data issues early. This is becoming increasingly important as organizations collect and store more data; also the ingestion process and the data itself become more complex. This paper shows the first iteration of BDS-Analytics, a PySpark library born from the empirical experiences of data engineers and data scientists in projects where BDS S.P.A, a Chilean Big Data Consulting firm, has been working. We present the initial scenario detected in different Big Data initiatives, where data quality and data observability criteria, in addition to the difficulty of their implementation, have been discussed. The proposed library includes PySpark functions that can solve the common requirements of exploratory data analysis and primary data quality, and the potential to add new features or tools for a stronger study of the data. In addition, this research presents a qualitative evaluation that includes surveys with professionals in the area and seeks to evaluate issues such as Effort Estimation, Usability, and Quality. The main contributions of this research are (1) the development of a PySpark library and its key capabilities, and (2) the evaluation process of the library in a real industrial environment.
Más información
Editorial: | SPRINGER SINGAPORE PTE LTD |
Fecha de publicación: | 2025 |
Página de inicio: | 369 |
Página final: | 379 |
Idioma: | Inglés |
URL: | https://doi.org/10.1007/978-981-96-0235-3_30 |