What is the main entry point for defining data quality checks in Deequ?

The main entry point is the VerificationSuite, from which you can add Checks that define constraints on attributes of the data.

Can Deequ persist metrics?

Yes, Deequ supports persistence and querying of computed metrics over time with a MetricsRepository.

Deequ

Overview

Deequ is an open-source library developed by AWS Labs, designed to work on top of Apache Spark to facilitate data quality measurement in large datasets. It allows users to define data quality unit tests, which are essentially constraints or checks on data attributes. These tests are translated into Spark jobs to compute metrics on the data. The library enables early detection of data errors before feeding data into consuming systems or machine learning algorithms, reducing the risk of application crashes or incorrect outputs. It supports tabular data formats, including CSV files, database tables, logs, and flattened JSON files. Deequ integrates with Spark DataFrames, offering functionalities like completeness checks, uniqueness validation, range constraints, and custom assertions. The library also supports persistence and querying of computed metrics, data profiling, anomaly detection, and automatic constraint suggestion.

Common tasks

Data Quality Measurement Data Validation Data Profiling Constraint Suggestion Unit Testing for Data Schema Validation Data Anomaly Detection Data Completeness Checks

FAQ

View all

What is Deequ?

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

What are the dependencies for Deequ?

Deequ depends on Java 8 and Apache Spark 3.1 or a compatible version.

How do I add Deequ as a dependency to my project?

You can add Deequ as a dependency using Maven or SBT. Example for Spark 3.1.x: Maven - <dependency> <groupId>com.amazon.deequ</groupId> <artifactId>deequ</artifactId> <version>2.0.0-spark-3.1</version> </dependency>

What types of data sources can Deequ work with?

Deequ works on tabular data, such as CSV files, database tables, logs, and flattened JSON files, that can be loaded into a Spark DataFrame.

FAQ+

What is Deequ?

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

What are the dependencies for Deequ?

Deequ depends on Java 8 and Apache Spark 3.1 or a compatible version.

How do I add Deequ as a dependency to my project?

What types of data sources can Deequ work with?

Deequ works on tabular data, such as CSV files, database tables, logs, and flattened JSON files, that can be loaded into a Spark DataFrame.

View all

Deequ | Find AI List

Deequ

Should you use Deequ?

Overview

FAQ

Pricing

Pros & Cons

Reviews & Ratings