Dev Data Seeding
Create realistic but safe development datasets using sample and redact.
The Problem
Section titled “The Problem”Production data is realistic but contains sensitive PII. Random test data lacks realistic relationships and edge cases.
The Solution
Section titled “The Solution”- Sample production data to reduce size
- Redact PII to anonymize
- Validate the result
Step-by-Step
Section titled “Step-by-Step”-
Analyze Production Dump
Terminal window sql-splitter analyze prod.sql.gz --progressReview table sizes to plan sampling.
-
Sample with FK Preservation
Terminal window sql-splitter sample prod.sql.gz \-o sampled.sql \--percent 10 \--preserve-relations \--seed 42 \--progressKey options:
--preserve-relations: Ensures FK integrity--seed 42: Reproducible results
-
Generate Redaction Config
Terminal window sql-splitter redact sampled.sql \--generate-config \-o redact.yamlReview and customize
redact.yaml:seed: 12345locale: enrules:- column: "*.email"strategy: hash- column: "*.name"strategy: fakegenerator: name- column: "*.ssn"strategy: null- column: "*.phone"strategy: fakegenerator: phone- column: "*.credit_card"strategy: maskpattern: "****-****-****-XXXX"skip_tables:- schema_migrations- ar_internal_metadata -
Apply Redaction
Terminal window sql-splitter redact sampled.sql \-o dev.sql \--config redact.yaml \--progress -
Validate Final Output
Terminal window sql-splitter validate dev.sql --strict
One-Liner Approach
Section titled “One-Liner Approach”For quick anonymization without a config file:
sql-splitter sample prod.sql.gz --percent 10 --preserve-relations -o - | \ sql-splitter redact - \ --hash "*.email" \ --fake "*.name,*.phone" \ --null "*.ssn,*.password" \ -o dev.sqlChoosing a Redaction Strategy
Section titled “Choosing a Redaction Strategy”| Data Type | Strategy | Why |
|---|---|---|
| Emails | hash | Preserves FK relationships (same input → same output) |
| Names | fake | Realistic looking data |
| SSN/Passwords | null | Remove entirely |
| Credit Cards | mask | Keep format for testing |
| Timestamps | skip | Usually not PII |
Reproducible Builds
Section titled “Reproducible Builds”Use --seed for deterministic results:
# Same seed = same fake data every timesql-splitter sample prod.sql --percent 10 --seed 42 -o - | \ sql-splitter redact - --config redact.yaml --seed 42 -o dev.sqlCI/CD Integration
Section titled “CI/CD Integration”jobs: seed-dev: runs-on: ubuntu-latest steps: - name: Download production backup run: aws s3 cp s3://backups/prod.sql.gz .
- name: Create dev dataset run: | sql-splitter sample prod.sql.gz \ --percent 10 \ --preserve-relations \ --seed ${{ github.run_id }} \ -o - | \ sql-splitter redact - \ --config redact.yaml \ -o dev.sql
- name: Validate run: sql-splitter validate dev.sql --strict
- name: Upload run: aws s3 cp dev.sql s3://dev-data/