Soda Programmatic Checks

Introduction

Key Benefits of Soda’s Approach

🚀 Automatic Enforcement: Apply standard quality checks across your entire data stack without manual intervention.
🔄 Effortless Schema Evolution: Handle changes in schema seamlessly.
✅ Empty Dataset Verification: Ensure datasets are not empty with automated checks.
📊 Anomaly Detection: Identify irregularities in row counts to maintain data accuracy.
🔍 Missing Data Identification: Pinpoint and address issues with missing data effectively.
⚙️ Scalability: Automatically generate SodaCL for every dataset to scale effortlessly.
📚 Programmatic Integration: Leverage the Soda Library for large-scale operations across your organization.

Results

Soda’s approach has enabled customers to rapidly and successfully adopt data quality checks across their organizations.

How Does It Work?

1. Automated Dataset Discovery

Automatically detect and discover tables in the configured schema of your data sources, including:
- PostgreSQL
- Snowflake
- Databricks
- BigQuery
- Redshift

2. Auto-Generate SodaCL

Apply basic quality check coverage across all tables and columns:
- Schema evolution tracking.
- Verify row count > 0.
- Detect anomalies in row counts.
- Null checks for each column.

3. Automatically Run Soda Scans

Run scans automatically on your data sources.
Push the results to Soda Cloud for easy monitoring and insights.

How to Deploy

Pull the GitHub Repository
Clone the Soda Programmatic Checks repo:
GitHub Repository
Install Python Requirements
Run the following command to install dependencies:
```
pip install -r /path/to/requirements.txt
```

Provide Data Source Connections Add connections to your data sources using the Soda data source YAML format. Here is an example configuration:

# Please find all supported data sources on the Soda Docs: https://docs.soda.io/soda/connect-athena.html

 data_source XXXX:
   type: postgres
   connection:
     host: XXXX
     port: XXXX
     username: XXXX
     password: XXXX
     database: XXXX
   schema: XXXX
 
 soda_cloud:
   host: cloud.soda.io # or cloud.us.soda.io
   api_key_id: XXXX
   api_key_secret: XXXX

Add Soda Cloud API Keys Include your Soda Cloud API keys in every data source configuration:

 soda_cloud:
   host: cloud.soda.io # or cloud.us.soda.io
   api_key_id: XXXX
   api_key_secret: XXXX

Run the Scripts Execute the main script to start the programmatic checks:
```
python main.py
```
Schedule the Script Use a cron job (Linux/macOS) or Task Scheduler (Windows) to automate the script execution at a desired frequency. Example cron job for running the script daily at midnight:
```
0 0 * * * python /path/to/main.py
```

Supported Data Sources

🟦 Databricks SQL
❄️ Snowflake
🐘 PostgreSQL
📊 BigQuery
🏢 SQL Server

Need Help?

If you encounter any issues, please:

Log a ticket: support.soda.io
Contact us: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
configs/examples		configs/examples
helpers		helpers
.gitignore		.gitignore
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Soda Programmatic Checks

Introduction

Key Benefits of Soda’s Approach

Results

How Does It Work?

1. Automated Dataset Discovery

2. Auto-Generate SodaCL

3. Automatically Run Soda Scans

How to Deploy

Supported Data Sources

Need Help?

About

Releases

Packages

Languages

sodadata/soda-programmatic-checks

Folders and files

Latest commit

History

Repository files navigation

Soda Programmatic Checks

Introduction

Key Benefits of Soda’s Approach

Results

How Does It Work?

1. Automated Dataset Discovery

2. Auto-Generate SodaCL

3. Automatically Run Soda Scans

How to Deploy

Supported Data Sources

Need Help?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages