v2.0 is live: Native Snowflake Support

The Statistical Firewall for your Data Pipeline.

Treat datasets like code. Generate 5KB statistical signatures of massive datasets to monitor drift, prevent model decay, and automate data integrity—all without moving your data.

Open-sourced under the Apache 2.0 License.

statgit check incoming.parquet --baseline v1
❯ statgit check incoming_data.parquet --baseline v1
Analyzing 4.2B rows... Done (0.8s via Zero-ETL)
Feature Drift (PSI) Type Status
user_age 0.02 float64 PASS
income_bracket 0.45 ↑ category DRIFT DETECTED
email_address - string PII LEAK
transaction_vol 0.08 float64 PASS
✖ 2 checks failed. Action: quarantine triggered via GitHub webhook.
Security Action: --anonymize flag suggested to redact email_address sample.

Zero-ETL Integrations

AWS S3
GCS
Snowflake
BigQuery

Four Pillars of Data Integrity

StatGit replaces fragile DAGs and heavy ETL pipelines with a lightweight, cloud-native validation engine.

Zero-ETL Cloud Native

Snapshot data directly from S3, GCS, or Snowflake. Computation happens natively in your cloud warehouse using push-down queries. Only the 5KB mathematical signature stays local.

s3://bucket
.statgit/v1 (5KB)

Privacy-First Diffing

Automated PII scanning blocks sensitive columns. Differential Privacy (DP) noise injection means you share metrics, not raw data.

user_email BLOCKED

Model-Aware Sensitivity

Prioritize alerts based on feature importance. Stop waking up for noise in unused columns; start acting on what matters.

ALERT

Automated Remediation

Connect StatGit to your CI/CD. Automatically quarantine bad data, block model deployments, and trigger Airflow retraining workflows via Webhooks or GitHub Actions.

Action: retrain-xgboost
Check Data Integrity
Drift threshold exceeded (>0.2)
Triggering model_retrain.yml...

The 2-Minute Setup

No heavy infrastructure. Just a CLI designed for the modern ML stack.

Global Metric: Stable Segment (EU): Drifted
Global Average
Regional Drift

Drag slider to reveal hidden subpopulation drift

Expose Hidden Drift with Subpopulation Slicing.

Global averages lie. When you average out features across a massive dataset, dangerous drifts cancel each other out—a phenomenon related to Simpson's Paradox.

StatGit automatically slices your data across categorical dimensions (like Region or Age Group) and runs statistical tests on every segment.

  • Prevent Silent Failures Catch the 10% performance drop in Europe before it hits the bottom line, even if Global metrics look perfectly stable.
  • Automated Cohort Discovery You don't need to specify slices manually. StatGit finds the dimensions with the highest variance automatically.