A curated list of awesome tools for testing and monitoring data quality - typically at the data warehouse/lake or within running data pipelines.
If you want to contribute to this list (please do), send me a pull request or contact me.
TBD
- elementary - Data monitoring and observability tailored to dbt.
- mobydq - tool for data engineering teams to run & automate data quality checks on their data pipeline.
- ydata-quality - python library for assessing data quality throughout stages of the data pipeline development.
- great-expectations - tool for data testing, documentation, and profiling.
- deepqu - libray by Amazon for defining unit tests for data with focus on large datasets. Based on Apache Spark.
- soda - enables data testing through extended SQL queries.
- dqm - another data quality monitoring tool implemented using Spark.
- owl-sanitizer - yet another Spark based lightweight data validation framework.
- griffin - Data Quality solution for distributed data systems at any scale in both streaming and batch data context.
- drunken-data-quality
- DataQuality for BigData
- TopNotch
- Phasor Data Quality Tracker
- DataCleaner
- data-quality
- deepchecks - tool for validating your machine learning models and data. Implemented test suites tailored towards ML models datasets and outputs.
- evidently - analyze and track data and ML model output quality.
Offering ranges from data to pipelines testing, with focus on real-time monitoring, automation of tests creation & threshold setting, and addditional enterprise features.
- Bigeye
- Soda
- Databand
- Monte Carlo
- great expectations
- Sifflet
- Validio
- Lightup
- Lantern
- Metaplane
- Datafold
- Acceldata
- Anomalo
- Marquez
TODOs
- Add tools for unstructured data (Arthur, Robust)