redact

Replace sensitive data with anonymized values for safe sharing, development, or compliance.

Alias: rd (e.g., sql-splitter rd dump.sql --hash "*.email" -o safe.sql)

When to Use This

GDPR/CCPA compliance - Remove or anonymize PII before sharing dumps with external parties
Vendor data sharing - Provide realistic data to contractors or consultants without exposing customer information
Development environment seeding - Create anonymized copies of production data for local development
Testing data pipelines - Validate ETL processes with structurally identical but safe data
Compliance audits - Demonstrate PII handling practices with sanitized sample exports

Use sample first if you also need to reduce the dataset size.

How It Works

Redaction runs as a streaming single-pass operation:

Parse statements - Reads each SQL statement (INSERT, COPY) from the input
Match columns - Compares column names against your patterns (e.g., *.email matches users.email, orders.customer_email)
Apply strategy - Transforms matching values using the specified strategy (null, hash, fake, mask, etc.)
Write output - Emits transformed statements immediately (no buffering of entire file)

Memory characteristics: Constant ~10MB regardless of file size. Only the current statement is held in memory.

Hashing behavior: The hash strategy uses SHA-256 and is deterministic—the same input value always produces the same hash. This preserves foreign key relationships: if users.email and audit_log.user_email both contain alice@example.com, they both hash to the same value.

Fake data with --seed: When using fake strategy, the same --seed value produces identical fake data across runs. The fake value is derived from the original value’s hash, so alice@example.com always becomes the same fake email (e.g., jessica.smith@example.com) with a given seed.

Exit codes:

0 - Success
1 - Validation errors or --strict mode warnings
2 - Runtime error (file not found, parse failure)

Usage

sql-splitter redact <INPUT> [OPTIONS]

Examples

Basic Redaction with Inline Patterns

# Hash emails, null SSNs, fake names
sql-splitter redact dump.sql -o safe.sql \
  --null "*.ssn" \
  --hash "*.email" \
  --fake "*.name"

# Mask credit cards (show last 4 digits)
sql-splitter redact dump.sql -o safe.sql \
  --mask "****-****-****-XXXX=*.credit_card"

# Replace passwords with a constant value
sql-splitter redact dump.sql -o safe.sql \
  --constant "*.password=REDACTED"

Using Configuration Files

# Use YAML config for complex rules
sql-splitter redact dump.sql -o safe.sql --config redact.yaml

# Auto-generate config by analyzing column names
sql-splitter redact dump.sql --generate-config -o redact.yaml

# Validate config without processing
sql-splitter redact dump.sql --config redact.yaml --validate

Reproducible Redaction

# Same seed = identical fake data across runs
sql-splitter redact dump.sql -o safe.sql --fake "*.name" --seed 42

# Useful for CI: consistent test fixtures
sql-splitter redact prod.sql -o fixtures/test-data.sql \
  --config redact.yaml --seed 12345

Targeting Specific Tables

# Only redact the users and orders tables
sql-splitter redact dump.sql -o safe.sql \
  --hash "*.email" --tables users,orders

# Redact everything except audit_logs
sql-splitter redact dump.sql -o safe.sql \
  --hash "*.email" --exclude audit_logs

Options

Flag	Short	Description	Default
`--output`	`-o`	Output SQL file	stdout
`--dialect`	`-d`	SQL dialect	auto-detect
`--config`	`-c`	YAML config file	-
`--generate-config`		Analyze input and generate YAML config	false
`--null`		Columns to set to NULL (glob patterns)	-
`--hash`		Columns to hash with SHA256	-
`--fake`		Columns to replace with fake data	-
`--mask`		Columns to mask (format: `pattern=column`)	-
`--constant`		Column=value pairs	-
`--seed`		Random seed for reproducibility	random
`--locale`		Locale for fake data	`en`
`--tables`	`-t`	Only redact specific tables	all
`--exclude`	`-x`	Exclude specific tables	none
`--strict`		Fail on warnings	false
`--progress`	`-p`	Show progress bar	false
`--dry-run`		Preview without writing	false
`--json`		Output as JSON	false
`--validate`		Validate config only	false

Redaction Strategies

Strategy	Description	Use Case
`null`	Replace with NULL	Remove data entirely
`hash`	SHA256 hash (deterministic)	Preserve FK relationships
`fake`	Generate realistic fake data	Realistic test data
`mask`	Partial masking	Credit cards, SSNs
`constant`	Fixed value	Placeholder values
`shuffle`	Redistribute values	Preserve distribution
`skip`	No redaction	Passthrough

Column Pattern Matching

Patterns use table.column format with * as a wildcard:

Pattern	Matches
`*.email`	`email` column in any table (`users.email`, `orders.contact_email`)
`users.email`	Only the `email` column in the `users` table
`users.*`	All columns in the `users` table
`*.ssn`	Any column named `ssn` in any table
`audit_*.created_by`	`created_by` in tables starting with `audit_`

Patterns are matched in order—first matching rule wins.

Mask Patterns

# Keep last 4 digits of credit card
sql-splitter redact dump.sql -o safe.sql --mask "****-****-****-XXXX=*.credit_card"

Pattern symbols:

* - Replace with asterisk
X - Keep original character
# - Random digit

Fake Data Generators

25+ generators available:

Generator	Example Output
`email`	`jessica.smith@example.com`
`name`	`Robert Johnson`
`first_name`	`Sarah`
`last_name`	`Williams`
`phone`	`+1 (555) 234-5678`
`address`	`123 Oak Street, Springfield, IL`
`city`	`Portland`
`company`	`Acme Corporation`
`ip`	`192.168.1.100`
`uuid`	`550e8400-e29b-41d4-...`
`date`	`1985-07-23`
`credit_card`	`4532015112830366`
`ssn`	`123-45-6789`

Auto-Generating Configuration

The --generate-config flag analyzes column names to detect common PII patterns:

sql-splitter redact dump.sql --generate-config -o redact.yaml

Detected patterns include:

Email: *email*, *e_mail*
Phone: *phone*, *mobile*, *tel*
Names: *name*, *first_name*, *last_name*
Financial: *ssn*, *social_security*, *credit_card*, *card_number*
Addresses: *address*, *street*, *zip*, *postal*
Auth: *password*, *secret*, *token*

Review and adjust the generated config before use—automated detection may miss custom column names or flag non-sensitive columns.

YAML Config

seed: 12345
locale: en

defaults:
  strategy: skip

rules:
  - column: "*.ssn"
    strategy: null
  - column: "*.email"
    strategy: hash
  - column: "*.name"
    strategy: fake
    generator: name
  - column: "*.credit_card"
    strategy: mask
    pattern: "****-****-****-XXXX"

skip_tables:
  - schema_migrations

Composing with Other Tools

Sample Then Redact

Reduce dataset size before anonymizing for faster processing:

sql-splitter sample prod.sql -o sampled.sql --percent 10 --preserve-relations
sql-splitter redact sampled.sql -o dev.sql --config redact.yaml

Validate After Redaction

Ensure redacted output is valid SQL:

sql-splitter redact dump.sql -o safe.sql --config redact.yaml
sql-splitter validate safe.sql --strict

Full Dev Environment Pipeline

Create a complete anonymized dev dataset:

# 1. Sample 5% with FK preservation
sql-splitter sample prod.sql -o sampled.sql \
  --percent 5 --preserve-relations --seed 42

# 2. Redact PII
sql-splitter redact sampled.sql -o dev.sql \
  --config redact.yaml --seed 42

# 3. Validate and reorder for clean import
sql-splitter validate dev.sql --strict && \
  sql-splitter order dev.sql -o dev-ordered.sql

Convert and Redact for Different Environments

# Redact MySQL dump, convert to PostgreSQL for testing
sql-splitter redact mysql.sql -o safe.sql --config redact.yaml
sql-splitter convert safe.sql -o postgres.sql --to postgres

Troubleshooting

Hashed values break foreign key relationships

Symptom: FK errors when importing redacted dump; parent/child relationships are broken.

Cause: Only one side of a relationship was hashed, or different patterns matched parent and child columns.

Solutions:

Ensure both sides use the same pattern: --hash "*.user_id" matches in all tables
Use explicit patterns for both: --hash "users.id" --hash "orders.user_id"
Verify with: sql-splitter validate redacted.sql

Fake data repeats or is inconsistent across runs

Symptom: Different fake values each time you run redaction, or unexpected repetition.

Cause: No --seed specified (random seed each run), or same original values produce same fake values (by design).

Solutions:

Use --seed for reproducible output: --seed 42
Same original value → same fake value is intentional (preserves consistency)
For truly random fake data each run, omit --seed

Column not being redacted

Symptom: Sensitive data appears in output despite having a matching rule.

Cause: Pattern doesn’t match the actual table.column name, or an earlier rule with skip strategy matched first.

Solutions:

Check exact column names: sql-splitter analyze dump.sql shows all table/column names
Use --dry-run to preview what would be redacted
Remember patterns match in order—more specific patterns should come before *.column wildcards
Verify pattern syntax: users.email not users->email

Config validation errors

Symptom: “Invalid config” or “Unknown strategy” errors.

Cause: YAML syntax errors, typos in strategy names, or invalid pattern format.

Solutions:

Validate before running: --validate flag
Check strategy spelling: null, hash, fake, mask, constant, shuffle, skip
Ensure patterns are quoted in YAML: column: "*.email" not column: *.email
Use --generate-config as a starting template

”Unknown column” warnings

Symptom: Warnings about rules that didn’t match any columns.

Cause: Pattern doesn’t match any columns in the dump (typo or columns don’t exist).

Solutions:

Run sql-splitter analyze dump.sql to see actual column names
Use --strict to fail on unmatched rules (catches config errors early)
Check for schema changes since config was created

redact

When to Use This

How It Works

Usage

Examples

Basic Redaction with Inline Patterns

Using Configuration Files

Reproducible Redaction

Targeting Specific Tables

Options

Redaction Strategies

Column Pattern Matching

Mask Patterns

Fake Data Generators

Auto-Generating Configuration

YAML Config

Composing with Other Tools

Sample Then Redact

Validate After Redaction

Full Dev Environment Pipeline

Convert and Redact for Different Environments

Troubleshooting

Hashed values break foreign key relationships

Fake data repeats or is inconsistent across runs

Column not being redacted

Config validation errors

”Unknown column” warnings

See Also