Treat datasets like code. Generate 5KB statistical signatures of massive datasets to monitor drift, prevent model decay, and automate data integrity—all without moving your data.
StatGit's code is strictly Source-Available. It is 100% Free and Open for individuals, academic researchers, and non-commercial projects under the StatGit Dual License.
| Feature | PSI | JS Div | Status |
|---|---|---|---|
| user_age | 0.021 | 0.004 | PASS |
| income_bracket | 0.184 ↑ | 0.051 ↑ | DRIFT DETECTED |
| email_address | - | - | PII LEAK |
| transaction_vol | 0.080 | 0.012 | PASS |
--anonymize flag suggested to
redact email_address sample.
Zero-ETL Integrations
StatGit replaces fragile DAGs and heavy ETL pipelines with a lightweight, cloud-native validation engine.
Snapshot data directly from S3, GCS, or Snowflake. Computation happens natively in your cloud warehouse using push-down queries. Only the 5KB mathematical signature stays local.
Automated PII scanning blocks sensitive columns. Differential Privacy (DP) noise injection means you share metrics, not raw data.
Prioritize alerts based on feature importance. Stop waking up for noise in unused columns; start acting on what matters.
Connect StatGit to your CI/CD. Automatically quarantine bad data, block model deployments, and trigger Airflow retraining workflows via Webhooks or GitHub Actions.
Integrate StatGit effortlessly into your existing orchestration tools like Airflow, Dagster, or custom CI/CD scripts.
StatGitClient.
from statgit import StatGitClient
# Connects to your local or remote (S3/GCS) registry
client = StatGitClient()
# Generate privacy-safe synthetic data
df = client.synthesize("reqs.yaml", rows=5000)
# Validate incoming data pipeline
res = client.check("s3://incoming/today.csv", "reqs.yaml")
if not res["passed"]:
print(f"Pipeline blocked. {len(res['issues'])} issues.")
# Query structured drift history
drift_data = client.diff("baseline.csv", "current.csv")
if drift_data["features"]["income"]["status"] == "DRIFTED":
trigger_retraining_dag()
statgit serve
Instantly deploy an interactive Streamlit application. Visually browse the
SignatureRegistry, compare any two historical snapshots side-by-side, and view rich
marginal distribution overlays.
statgit trend
statgit diff
Track continuous feature decay across your entire registry history using
statgit trend, or magically time-travel to compare current data distributions
against any historical checkpoint.
No heavy infrastructure. Just a CLI designed for the modern ML stack.
Drag slider to reveal hidden subpopulation drift
Global averages lie. When you average out features across a massive dataset, dangerous drifts cancel each other out—a phenomenon related to Simpson's Paradox.
StatGit automatically slices your data across categorical dimensions (like Region or Age Group) and
runs statistical tests on every segment.