v2.0 is live: Native Snowflake Support

The Statistical Firewall for your Data Pipeline.

Treat datasets like code. Generate 5KB statistical signatures of massive datasets to monitor drift, prevent model decay, and automate data integrity—all without moving your data.

StatGit's code is strictly Source-Available. It is 100% Free and Open for individuals, academic researchers, and non-commercial projects under the StatGit Dual License.

statgit check incoming.parquet --baseline v1
❯ statgit check incoming_data.parquet --baseline v1
Analyzing 4.2B rows... Done (0.8s via Zero-ETL)
PSI Jensen-Shannon Wasserstein/EMD Mahalanobis
Feature PSI JS Div Status
user_age 0.021 0.004 PASS
income_bracket 0.184 ↑ 0.051 ↑ DRIFT DETECTED
email_address - - PII LEAK
transaction_vol 0.080 0.012 PASS
✖ 2 checks failed. Action: quarantine triggered via GitHub webhook.
Security Action: --anonymize flag suggested to redact email_address sample.

Zero-ETL Integrations

AWS S3
GCS
Snowflake
BigQuery

Four Pillars of Data Integrity

StatGit replaces fragile DAGs and heavy ETL pipelines with a lightweight, cloud-native validation engine.

Zero-ETL Cloud Native

Snapshot data directly from S3, GCS, or Snowflake. Computation happens natively in your cloud warehouse using push-down queries. Only the 5KB mathematical signature stays local.

s3://bucket
.statgit/v1 (5KB)

Privacy-First Diffing

Automated PII scanning blocks sensitive columns. Differential Privacy (DP) noise injection means you share metrics, not raw data.

user_email BLOCKED

Model-Aware Sensitivity

Prioritize alerts based on feature importance. Stop waking up for noise in unused columns; start acting on what matters.

ALERT

Automated Remediation

Connect StatGit to your CI/CD. Automatically quarantine bad data, block model deployments, and trigger Airflow retraining workflows via Webhooks or GitHub Actions.

Action: retrain-xgboost
Check Data Integrity
Drift threshold exceeded (>0.2)
Triggering model_retrain.yml...
First-Class Python SDK

Built for the Modern ML Stack.

Integrate StatGit effortlessly into your existing orchestration tools like Airflow, Dagster, or custom CI/CD scripts.

Native Integration Control your entire statistical firewall directly from Python. Generate synthetic data, validate pipelines, and query history.
Dynamic Triggers Evaluate pipeline execution automatically based on granular drift telemetry returned by the StatGitClient.
pipeline_dag.py
from statgit import StatGitClient

# Connects to your local or remote (S3/GCS) registry
client = StatGitClient()

# Generate privacy-safe synthetic data
df = client.synthesize("reqs.yaml", rows=5000)

# Validate incoming data pipeline
res = client.check("s3://incoming/today.csv", "reqs.yaml")
if not res["passed"]:
    print(f"Pipeline blocked. {len(res['issues'])} issues.")
    
# Query structured drift history
drift_data = client.diff("baseline.csv", "current.csv")
if drift_data["features"]["income"]["status"] == "DRIFTED":
    trigger_retraining_dag()
statgit serve

Observability Dashboard

Instantly deploy an interactive Streamlit application. Visually browse the SignatureRegistry, compare any two historical snapshots side-by-side, and view rich marginal distribution overlays.

statgit trend
statgit diff

Statistical Version Control (SVC)

Track continuous feature decay across your entire registry history using statgit trend, or magically time-travel to compare current data distributions against any historical checkpoint.

The 2-Minute Setup

No heavy infrastructure. Just a CLI designed for the modern ML stack.

Global Metric: Stable Segment (EU): Drifted
Global Average
Regional Drift

Drag slider to reveal hidden subpopulation drift

Expose Hidden Drift with Subpopulation Slicing.

Global averages lie. When you average out features across a massive dataset, dangerous drifts cancel each other out—a phenomenon related to Simpson's Paradox.

StatGit automatically slices your data across categorical dimensions (like Region or Age Group) and runs statistical tests on every segment.

  • Prevent Silent Failures Catch the 10% performance drop in Europe before it hits the bottom line, even if Global metrics look perfectly stable.
  • Automated Cohort Discovery You don't need to specify slices manually. StatGit finds the dimensions with the highest variance automatically.