
Trino
Fast distributed SQL query engine for big data analytics.

A library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Deequ is an open-source library developed by AWS Labs, designed to work on top of Apache Spark to facilitate data quality measurement in large datasets. It allows users to define data quality unit tests, which are essentially constraints or checks on data attributes. These tests are translated into Spark jobs to compute metrics on the data. The library enables early detection of data errors before feeding data into consuming systems or machine learning algorithms, reducing the risk of application crashes or incorrect outputs. It supports tabular data formats, including CSV files, database tables, logs, and flattened JSON files. Deequ integrates with Spark DataFrames, offering functionalities like completeness checks, uniqueness validation, range constraints, and custom assertions. The library also supports persistence and querying of computed metrics, data profiling, anomaly detection, and automatic constraint suggestion.
Deequ is an open-source library developed by AWS Labs, designed to work on top of Apache Spark to facilitate data quality measurement in large datasets.
Explore all tools that specialize in data profiling. This domain focus ensures Deequ delivers optimized results for this specific requirement.
Allows persistence and querying of computed metrics over time, enabling trend analysis and historical comparisons.
Offers automated data profiling to understand the characteristics of large datasets, including statistics, distributions, and patterns.
Applies statistical techniques to detect anomalies in data quality metrics over time, alerting users to potential data degradation.
Suggests constraints for large datasets based on data profiling results, reducing the effort required to define data quality rules.
Supports incremental computation of metrics on new or updated data, minimizing the processing time for large datasets.
Install Java 8 or later.
Ensure you have Apache Spark 3.1 or compatible version installed.
Add Deequ as a dependency to your project using Maven or SBT.
Import necessary Deequ classes in your Scala or Java code.
Create a Spark DataFrame from your data source.
Define checks using VerificationSuite to specify data quality constraints.
Run the VerificationSuite to execute the checks and compute metrics.
Inspect the VerificationResult to identify data quality issues.
All Set
Ready to go
Verified feedback from other users.
"Deequ is highly regarded for its ease of use and effectiveness in data quality validation."
Post questions, share tips, and help other users.

Fast distributed SQL query engine for big data analytics.

Unlocking insights from unstructured data.

A visual data science platform combining visual analytics, data science, and data wrangling.

Open Source OCR Engine capable of recognizing over 100 languages.

Liberating data tables locked inside PDF files.

Move your data easily, securely, and efficiently with Stitch, now part of Qlik Talend Cloud.

Open Source High-Performance Data Warehouse delivering Sub-Second Analytics for End Users and Agents at Scale.