sample
Create smaller test datasets while preserving foreign key relationships.
Alias: sa (e.g., sql-splitter sa dump.sql --percent 10 -o dev.sql)
When to Use This
Section titled “When to Use This”- Creating dev/test datasets - Generate smaller databases for local development from production dumps
- Reducing CI test times - Smaller datasets mean faster test runs without losing coverage
- Reproducing production bugs - Extract a consistent slice of data that maintains relationships between tables
- Size-limited backups - Create smaller backups for quick restores or archival purposes
- Demo environments - Build realistic demo data that maintains referential integrity
Use shard instead if you need to extract data for a specific tenant from a multi-tenant database.
How It Works
Section titled “How It Works”The sample command operates in one of two modes:
Simple Random Sampling (default)
Section titled “Simple Random Sampling (default)”Without --preserve-relations, each table is sampled independently using reservoir sampling. Rows are selected randomly based on --percent or --rows. This is fast but doesn’t guarantee foreign key relationships remain valid.
FK-Preserving Sampling (--preserve-relations)
Section titled “FK-Preserving Sampling (--preserve-relations)”When FK preservation is enabled, the algorithm:
- Splits the dump into per-table files in a temp directory
- Builds a dependency graph from CREATE TABLE and ALTER TABLE statements
- Identifies root tables - tables with no outgoing foreign keys (or explicitly specified via
--root-tables) - Processes tables in topological order - parents before children
- Samples root tables first using the specified percentage or row count
- Follows FK chains - for each sampled row, tracks its primary key hash. When processing child tables, only rows whose FK values match a sampled parent’s PK are eligible
- Outputs a consistent dataset where all FK relationships are satisfied
Memory usage: PK values are stored as 64-bit hashes in a HashSet (8 bytes per row). For 10 million sampled rows across all tables, expect ~80MB of memory for PK tracking. Use --max-total-rows to cap memory growth.
Fractional rows: When a percentage results in a fractional row count (e.g., 10% of 15 rows = 1.5), reservoir sampling naturally handles this—you’ll get approximately the target percentage over many runs, or use --seed for deterministic results.
sql-splitter sample <INPUT> [OPTIONS]Examples
Section titled “Examples”Basic Sampling
Section titled “Basic Sampling”# Sample 10% of rows from each tablesql-splitter sample dump.sql -o sampled.sql --percent 10
# Sample fixed number of rows per tablesql-splitter sample dump.sql -o sampled.sql --rows 1000FK-Preserving Samples
Section titled “FK-Preserving Samples”# Preserve FK relationships - ensures if you sample an order, the user existssql-splitter sample dump.sql -o dev.sql --percent 10 --preserve-relations
# Fail if any FK integrity issues are detectedsql-splitter sample dump.sql -o dev.sql --percent 10 --preserve-relations --strict-fkReproducible Sampling
Section titled “Reproducible Sampling”# Same seed = identical sample every time (useful for CI/debugging)sql-splitter sample dump.sql -o dev.sql --percent 10 --seed 42Table Filtering
Section titled “Table Filtering”# Sample only specific tablessql-splitter sample dump.sql -o dev.sql --percent 10 --tables users,orders,products
# Exclude large or irrelevant tablessql-splitter sample dump.sql -o dev.sql --percent 10 --exclude logs,events,audit_trailUsing YAML Config for Per-Table Settings
Section titled “Using YAML Config for Per-Table Settings”# Use a config file for complex sampling requirementssql-splitter sample dump.sql -o dev.sql --config sample.yamlOptions
Section titled “Options”| Flag | Short | Description | Default |
|---|---|---|---|
--output | -o | Output SQL file | stdout |
--dialect | -d | SQL dialect | auto-detect |
--percent | Sample percentage (1-100) | - | |
--rows | Sample fixed number of rows per table | - | |
--preserve-relations | Preserve FK relationships | false | |
--tables | -t | Only sample these tables (comma-separated) | all |
--exclude | -e | Exclude these tables (comma-separated) | none |
--root-tables | Explicit root tables for sampling | auto | |
--include-global | Global table handling: none, lookups, all | lookups | |
--seed | Random seed for reproducibility | random | |
--config | -c | YAML config file for per-table settings | - |
--max-total-rows | Maximum total rows to sample | unlimited | |
--no-limit | Disable row limit | false | |
--strict-fk | Fail if any FK integrity issues | false | |
--no-schema | Exclude CREATE TABLE statements | false | |
--progress | -p | Show progress bar | false |
--dry-run | Preview without writing | false | |
--json | Output as JSON | false |
Root Tables and --root-tables
Section titled “Root Tables and --root-tables”Root tables are tables with no outgoing foreign keys—they don’t reference other tables. The sample command auto-detects these as starting points for FK-preserving sampling.
Override auto-detection when:
- Your FKs are implicit (not declared in the schema)
- You want to start sampling from specific tables
- Auto-detection picks the wrong tables
# Explicitly specify which tables to start sampling fromsql-splitter sample dump.sql -o dev.sql --percent 10 --preserve-relations --root-tables users,organizationsGlobal and Lookup Tables
Section titled “Global and Lookup Tables”The --include-global option controls how lookup/reference tables are handled:
| Value | Behavior |
|---|---|
none | Skip global/lookup tables entirely |
lookups | Include small lookup tables in full (default) |
all | Include all global tables in full |
What counts as a lookup table?
Tables are auto-classified as lookups if they match common patterns:
countries,states,provinces,citiescurrencies,languages,timezonespermissions,roles,settings
You can override classification in a YAML config file.
System tables (migrations, jobs, cache, sessions) are automatically skipped.
YAML Configuration File
Section titled “YAML Configuration File”For complex sampling requirements, use a YAML config file:
default: percent: 10
classification: # Include these tables fully global: - permissions - roles # Skip these tables system: - migrations - failed_jobs - telescope_entries # Classify as lookup tables lookup: - currencies - countries # Explicitly mark as root tables root: - organizations
tables: users: rows: 500 # Fixed row count for this table posts: percent: 5 # Different percentage for this table sessions: skip: true # Skip entirely audit_logs: classification: system # Override classificationThen run:
sql-splitter sample dump.sql -o dev.sql --config sample.yaml --preserve-relationsComposing with Other Tools
Section titled “Composing with Other Tools”Create anonymized dev database
Section titled “Create anonymized dev database”# Sample 10% with FK preservation, then redact PIIsql-splitter sample prod.sql --percent 10 --preserve-relations -o sampled.sqlsql-splitter redact sampled.sql --config redact.yaml -o dev.sqlValidate sampled output
Section titled “Validate sampled output”# Ensure the sample maintains FK integritysql-splitter sample dump.sql --percent 10 --preserve-relations -o dev.sqlsql-splitter validate dev.sql --strictVisualize dependencies before sampling
Section titled “Visualize dependencies before sampling”# Understand the FK graph before deciding on root tablessql-splitter graph dump.sql --format mermaid -o schema.mdsql-splitter sample dump.sql --percent 10 --preserve-relations --root-tables users,orgs -o dev.sqlSample then query
Section titled “Sample then query”# Create sample, then run ad-hoc queries against itsql-splitter sample dump.sql --percent 5 --preserve-relations -o sample.sqlsql-splitter query sample.sql "SELECT COUNT(*) FROM users"Troubleshooting
Section titled “Troubleshooting”Sample is larger than expected
Section titled “Sample is larger than expected”Cause: FK preservation pulls in all related rows. If you sample 10% of orders, you get 100% of the users who placed those orders.
Solutions:
- Use
--max-total-rowsto cap the total:--max-total-rows 100000 - Sample from tables higher in the dependency chain (use
--root-tables) - Use
--dry-runto preview sizes before sampling
# Preview what would be sampledsql-splitter sample dump.sql --percent 10 --preserve-relations --dry-run --json | jq '.table_stats'Missing related rows (broken FKs)
Section titled “Missing related rows (broken FKs)”Cause: Foreign keys weren’t detected in the schema (implicit FKs, or defined in application code).
Solutions:
- Specify root tables explicitly:
--root-tables users,organizations - Add FK constraints to your schema dump
- Use
--strict-fkto fail early when FK issues are detected
Out of memory
Section titled “Out of memory”Cause: --preserve-relations tracks all sampled PK values in memory (8 bytes per row).
Solutions:
- Reduce sample size:
--percent 1instead of--percent 10 - Use
--max-total-rows 500000to cap memory usage - Disable FK preservation for very large dumps (accept potential broken FKs)
- Sample fewer tables:
--tables critical_table1,critical_table2
Non-deterministic results
Section titled “Non-deterministic results”Cause: Without --seed, sampling uses a random seed each run.
Solution: Add a fixed seed for reproducible results:
sql-splitter sample dump.sql --percent 10 --seed 42 -o dev.sql“Warning: Reached max_total_rows limit”
Section titled ““Warning: Reached max_total_rows limit””Cause: The --max-total-rows guard stopped sampling to prevent runaway growth.
Solutions:
- Increase the limit if you have sufficient memory
- Reduce
--percentto sample fewer rows initially - Review which tables are pulling in the most related rows with
--dry-run --json
Exit Codes
Section titled “Exit Codes”| Code | Meaning |
|---|---|
0 | Success |
1 | Error (I/O, parse error, FK integrity failure with --strict-fk) |
2 | Invalid arguments |
See Exit Codes for usage in scripts.