v2.0 is live: Native Snowflake Support

The Statistical Firewall for your
Data Pipeline.

Treat datasets like code. Generate 5KB statistical signatures of massive datasets to monitor drift, prevent model decay, and automate data integrity—all without moving your data.

Get Started View Documentation

Open-sourced under the Apache 2.0 License.

statgit check incoming.parquet --baseline v1

❯ statgit check incoming_data.parquet --baseline v1

Analyzing 4.2B rows... Done (0.8s via Zero-ETL)

Feature	Drift (PSI)	Type	Status
user_age	0.02	float64	PASS
income_bracket	0.45 ↑	category	DRIFT DETECTED
email_address	-	string	PII LEAK
transaction_vol	0.08	float64	PASS

✖ 2 checks failed. Action: quarantine triggered via GitHub webhook.

Security Action: --anonymize flag suggested to redact email_address sample.

Zero-ETL Integrations

AWS S3

GCS

Snowflake

BigQuery

Four Pillars of Data Integrity

StatGit replaces fragile DAGs and heavy ETL pipelines with a lightweight, cloud-native validation engine.

Zero-ETL Cloud Native

Snapshot data directly from S3, GCS, or Snowflake. Computation happens natively in your cloud warehouse using push-down queries. Only the 5KB mathematical signature stays local.

s3://bucket

.statgit/v1 (5KB)

Privacy-First Diffing

Automated PII scanning blocks sensitive columns. Differential Privacy (DP) noise injection means you share metrics, not raw data.

user_email BLOCKED

Model-Aware Sensitivity

Prioritize alerts based on feature importance. Stop waking up for noise in unused columns; start acting on what matters.

ALERT

Automated Remediation

Connect StatGit to your CI/CD. Automatically quarantine bad data, block model deployments, and trigger Airflow retraining workflows via Webhooks or GitHub Actions.

Action: retrain-xgboost

Check Data Integrity

Drift threshold exceeded (>0.2)

Triggering model_retrain.yml...

The 2-Minute Setup

No heavy infrastructure. Just a CLI designed for the modern ML stack.

Global Metric: Stable Segment (EU): Drifted

Global Average

Regional Drift

Drag slider to reveal hidden subpopulation drift

Expose Hidden Drift with Subpopulation Slicing.

Global averages lie. When you average out features across a massive dataset, dangerous drifts cancel each other out—a phenomenon related to Simpson's Paradox.

StatGit automatically slices your data across categorical dimensions (like Region or Age Group) and runs statistical tests on every segment.

Prevent Silent Failures Catch the 10% performance drop in Europe before it hits the bottom line, even if Global metrics look perfectly stable.
Automated Cohort Discovery You don't need to specify slices manually. StatGit finds the dimensions with the highest variance automatically.

The Statistical Firewall for your Data Pipeline.