Overview
Deequ is an open-source library developed by AWS Labs, designed to work on top of Apache Spark to facilitate data quality measurement in large datasets. It allows users to define data quality unit tests, which are essentially constraints or checks on data attributes. These tests are translated into Spark jobs to compute metrics on the data. The library enables early detection of data errors before feeding data into consuming systems or machine learning algorithms, reducing the risk of application crashes or incorrect outputs. It supports tabular data formats, including CSV files, database tables, logs, and flattened JSON files. Deequ integrates with Spark DataFrames, offering functionalities like completeness checks, uniqueness validation, range constraints, and custom assertions. The library also supports persistence and querying of computed metrics, data profiling, anomaly detection, and automatic constraint suggestion.
