Skip to content

redact

Replace sensitive data with anonymized values for safe sharing, development, or compliance.

Alias: rd (e.g., sql-splitter rd dump.sql --hash "*.email" -o safe.sql)

  • GDPR/CCPA compliance - Remove or anonymize PII before sharing dumps with external parties
  • Vendor data sharing - Provide realistic data to contractors or consultants without exposing customer information
  • Development environment seeding - Create anonymized copies of production data for local development
  • Testing data pipelines - Validate ETL processes with structurally identical but safe data
  • Compliance audits - Demonstrate PII handling practices with sanitized sample exports

Use sample first if you also need to reduce the dataset size.

Redaction runs as a streaming single-pass operation:

  1. Parse statements - Reads each SQL statement (INSERT, COPY) from the input
  2. Match columns - Compares column names against your patterns (e.g., *.email matches users.email, orders.customer_email)
  3. Apply strategy - Transforms matching values using the specified strategy (null, hash, fake, mask, etc.)
  4. Write output - Emits transformed statements immediately (no buffering of entire file)

Memory characteristics: Constant ~10MB regardless of file size. Only the current statement is held in memory.

Hashing behavior: The hash strategy uses SHA-256 and is deterministic—the same input value always produces the same hash. This preserves foreign key relationships: if users.email and audit_log.user_email both contain alice@example.com, they both hash to the same value.

Fake data with --seed: When using fake strategy, the same --seed value produces identical fake data across runs. The fake value is derived from the original value’s hash, so alice@example.com always becomes the same fake email (e.g., jessica.smith@example.com) with a given seed.

Exit codes:

  • 0 - Success
  • 1 - Validation errors or --strict mode warnings
  • 2 - Runtime error (file not found, parse failure)
Terminal window
sql-splitter redact <INPUT> [OPTIONS]
Terminal window
# Hash emails, null SSNs, fake names
sql-splitter redact dump.sql -o safe.sql \
--null "*.ssn" \
--hash "*.email" \
--fake "*.name"
# Mask credit cards (show last 4 digits)
sql-splitter redact dump.sql -o safe.sql \
--mask "****-****-****-XXXX=*.credit_card"
# Replace passwords with a constant value
sql-splitter redact dump.sql -o safe.sql \
--constant "*.password=REDACTED"
Terminal window
# Use YAML config for complex rules
sql-splitter redact dump.sql -o safe.sql --config redact.yaml
# Auto-generate config by analyzing column names
sql-splitter redact dump.sql --generate-config -o redact.yaml
# Validate config without processing
sql-splitter redact dump.sql --config redact.yaml --validate
Terminal window
# Same seed = identical fake data across runs
sql-splitter redact dump.sql -o safe.sql --fake "*.name" --seed 42
# Useful for CI: consistent test fixtures
sql-splitter redact prod.sql -o fixtures/test-data.sql \
--config redact.yaml --seed 12345
Terminal window
# Only redact the users and orders tables
sql-splitter redact dump.sql -o safe.sql \
--hash "*.email" --tables users,orders
# Redact everything except audit_logs
sql-splitter redact dump.sql -o safe.sql \
--hash "*.email" --exclude audit_logs
FlagShortDescriptionDefault
--output-oOutput SQL filestdout
--dialect-dSQL dialectauto-detect
--config-cYAML config file-
--generate-configAnalyze input and generate YAML configfalse
--nullColumns to set to NULL (glob patterns)-
--hashColumns to hash with SHA256-
--fakeColumns to replace with fake data-
--maskColumns to mask (format: pattern=column)-
--constantColumn=value pairs-
--seedRandom seed for reproducibilityrandom
--localeLocale for fake dataen
--tables-tOnly redact specific tablesall
--exclude-xExclude specific tablesnone
--strictFail on warningsfalse
--progress-pShow progress barfalse
--dry-runPreview without writingfalse
--jsonOutput as JSONfalse
--validateValidate config onlyfalse
StrategyDescriptionUse Case
nullReplace with NULLRemove data entirely
hashSHA256 hash (deterministic)Preserve FK relationships
fakeGenerate realistic fake dataRealistic test data
maskPartial maskingCredit cards, SSNs
constantFixed valuePlaceholder values
shuffleRedistribute valuesPreserve distribution
skipNo redactionPassthrough

Patterns use table.column format with * as a wildcard:

PatternMatches
*.emailemail column in any table (users.email, orders.contact_email)
users.emailOnly the email column in the users table
users.*All columns in the users table
*.ssnAny column named ssn in any table
audit_*.created_bycreated_by in tables starting with audit_

Patterns are matched in order—first matching rule wins.

Terminal window
# Keep last 4 digits of credit card
sql-splitter redact dump.sql -o safe.sql --mask "****-****-****-XXXX=*.credit_card"

Pattern symbols:

  • * - Replace with asterisk
  • X - Keep original character
  • # - Random digit

25+ generators available:

GeneratorExample Output
emailjessica.smith@example.com
nameRobert Johnson
first_nameSarah
last_nameWilliams
phone+1 (555) 234-5678
address123 Oak Street, Springfield, IL
cityPortland
companyAcme Corporation
ip192.168.1.100
uuid550e8400-e29b-41d4-...
date1985-07-23
credit_card4532015112830366
ssn123-45-6789

The --generate-config flag analyzes column names to detect common PII patterns:

Terminal window
sql-splitter redact dump.sql --generate-config -o redact.yaml

Detected patterns include:

  • Email: *email*, *e_mail*
  • Phone: *phone*, *mobile*, *tel*
  • Names: *name*, *first_name*, *last_name*
  • Financial: *ssn*, *social_security*, *credit_card*, *card_number*
  • Addresses: *address*, *street*, *zip*, *postal*
  • Auth: *password*, *secret*, *token*

Review and adjust the generated config before use—automated detection may miss custom column names or flag non-sensitive columns.

seed: 12345
locale: en
defaults:
strategy: skip
rules:
- column: "*.ssn"
strategy: null
- column: "*.email"
strategy: hash
- column: "*.name"
strategy: fake
generator: name
- column: "*.credit_card"
strategy: mask
pattern: "****-****-****-XXXX"
skip_tables:
- schema_migrations

Reduce dataset size before anonymizing for faster processing:

Terminal window
sql-splitter sample prod.sql -o sampled.sql --percent 10 --preserve-relations
sql-splitter redact sampled.sql -o dev.sql --config redact.yaml

Ensure redacted output is valid SQL:

Terminal window
sql-splitter redact dump.sql -o safe.sql --config redact.yaml
sql-splitter validate safe.sql --strict

Create a complete anonymized dev dataset:

Terminal window
# 1. Sample 5% with FK preservation
sql-splitter sample prod.sql -o sampled.sql \
--percent 5 --preserve-relations --seed 42
# 2. Redact PII
sql-splitter redact sampled.sql -o dev.sql \
--config redact.yaml --seed 42
# 3. Validate and reorder for clean import
sql-splitter validate dev.sql --strict && \
sql-splitter order dev.sql -o dev-ordered.sql

Convert and Redact for Different Environments

Section titled “Convert and Redact for Different Environments”
Terminal window
# Redact MySQL dump, convert to PostgreSQL for testing
sql-splitter redact mysql.sql -o safe.sql --config redact.yaml
sql-splitter convert safe.sql -o postgres.sql --to postgres

Hashed values break foreign key relationships

Section titled “Hashed values break foreign key relationships”

Symptom: FK errors when importing redacted dump; parent/child relationships are broken.

Cause: Only one side of a relationship was hashed, or different patterns matched parent and child columns.

Solutions:

  1. Ensure both sides use the same pattern: --hash "*.user_id" matches in all tables
  2. Use explicit patterns for both: --hash "users.id" --hash "orders.user_id"
  3. Verify with: sql-splitter validate redacted.sql

Fake data repeats or is inconsistent across runs

Section titled “Fake data repeats or is inconsistent across runs”

Symptom: Different fake values each time you run redaction, or unexpected repetition.

Cause: No --seed specified (random seed each run), or same original values produce same fake values (by design).

Solutions:

  1. Use --seed for reproducible output: --seed 42
  2. Same original value → same fake value is intentional (preserves consistency)
  3. For truly random fake data each run, omit --seed

Symptom: Sensitive data appears in output despite having a matching rule.

Cause: Pattern doesn’t match the actual table.column name, or an earlier rule with skip strategy matched first.

Solutions:

  1. Check exact column names: sql-splitter analyze dump.sql shows all table/column names
  2. Use --dry-run to preview what would be redacted
  3. Remember patterns match in order—more specific patterns should come before *.column wildcards
  4. Verify pattern syntax: users.email not users->email

Symptom: “Invalid config” or “Unknown strategy” errors.

Cause: YAML syntax errors, typos in strategy names, or invalid pattern format.

Solutions:

  1. Validate before running: --validate flag
  2. Check strategy spelling: null, hash, fake, mask, constant, shuffle, skip
  3. Ensure patterns are quoted in YAML: column: "*.email" not column: *.email
  4. Use --generate-config as a starting template

Symptom: Warnings about rules that didn’t match any columns.

Cause: Pattern doesn’t match any columns in the dump (typo or columns don’t exist).

Solutions:

  1. Run sql-splitter analyze dump.sql to see actual column names
  2. Use --strict to fail on unmatched rules (catches config errors early)
  3. Check for schema changes since config was created