Skip to content

Glossary

This glossary explains database and SQL terminology used in sql-splitter documentation. Terms are grouped by category for easier reference.

A column (or set of columns) whose values uniquely identify each row in a table. Primary keys cannot be NULL and are often referenced by foreign keys in other tables.

A constraint that requires values in one table’s column(s) to match existing values in another table’s primary key. Foreign keys are the main mechanism for representing relationships between tables.

A rule the database enforces on table data or structure. Common constraints include:

  • PRIMARY KEY - Unique row identifier
  • FOREIGN KEY - Reference to another table
  • UNIQUE - No duplicate values allowed
  • NOT NULL - Value required
  • CHECK - Custom validation (e.g., age >= 0)

The guarantee that relationships between tables remain consistent. For example, you can’t have an orders.user_id that references a users.id that doesn’t exist. Foreign keys enforce referential integrity.

A referential action on a foreign key that automatically propagates changes. For example, ON DELETE CASCADE deletes child rows when the parent row is deleted. Similarly, ON UPDATE CASCADE updates child foreign key values when the parent key changes.

Constraints that are checked at the end of a transaction instead of immediately after each statement. This allows temporarily “inconsistent” states while loading data, as long as everything is valid by commit time. Useful for loading tables with circular dependencies.

-- PostgreSQL syntax
SET CONSTRAINTS ALL DEFERRED;

The organized definition of database objects—tables, columns, types, constraints, indexes, etc. In PostgreSQL, “schema” also refers to a namespace (like public), but in sql-splitter docs, “schema” typically means the overall database structure.

SQL statements that define or change database structure:

  • CREATE TABLE, ALTER TABLE, DROP TABLE
  • CREATE INDEX, DROP INDEX
  • CREATE VIEW

DDL describes what the database looks like.

SQL statements that work with data:

  • INSERT, UPDATE, DELETE
  • COPY ... FROM stdin (PostgreSQL bulk load)

DML describes what data is in the database.

The specific flavor of SQL supported by a database engine. sql-splitter supports four dialects:

DialectKey Features
MySQLBacktick quoting, AUTO_INCREMENT, ENGINE=InnoDB
PostgreSQLDouble-quote identifiers, SERIAL, COPY FROM stdin
SQLiteINTEGER PRIMARY KEY for auto-increment
MSSQLIDENTITY, GO batch separators, NVARCHAR

A column attribute that automatically generates increasing numeric values. Commonly used for surrogate primary keys.

id INT AUTO_INCREMENT PRIMARY KEY

A convenience type that creates an integer column backed by an auto-incrementing sequence. Modern PostgreSQL prefers IDENTITY.

id SERIAL PRIMARY KEY
-- Equivalent to: id INTEGER DEFAULT nextval('tablename_id_seq')

A standards-aligned way to define auto-generated numeric columns.

-- PostgreSQL
id INT GENERATED BY DEFAULT AS IDENTITY
-- MSSQL
id INT IDENTITY(1,1)

A PostgreSQL bulk-loading command that reads table data from the client connection. In pg_dump output, it appears as:

COPY users (id, name, email) FROM stdin;
1 Alice alice@example.com
2 Bob bob@example.com
\.

The \. marks the end of data. sql-splitter’s convert command transforms this to INSERT statements for other dialects.

A directed graph where nodes represent tables and edges represent foreign key relationships. Used to determine safe ordering for creating tables and loading data.

An ordering of tables such that every dependency (parent table) appears before the table that depends on it (child table). In SQL dumps, this means parent tables are created and populated before child tables, satisfying foreign key constraints during import.

See order for reordering dumps topologically.

A situation where following foreign key relationships leads back to the starting table. For example: orders → payments → invoices → orders. Cycles prevent simple topological ordering and require special handling:

  • Defer constraints during import
  • Disable FK checks temporarily
  • Split data loading into phases

See graph --cycles-only to detect and visualize cycles.

A table with no outbound foreign keys (it doesn’t reference other tables). Root tables can be loaded first since they have no dependencies.

The referenced table in a foreign key relationship—the table that “owns” the primary key being referenced. Parent tables must generally exist before child tables can reference them.

The referencing table in a foreign key relationship—it contains the foreign key column(s). Child rows depend on parent rows existing.

Processing a dump file incrementally as it’s read, rather than loading the entire file into memory.

Commands that stream (constant ~50MB memory):

  • split, merge, convert, analyze, redact

Commands that require more memory:

  • query - Loads data into DuckDB (use --disk for large files)
  • diff - Tracks PK hashes for comparison (use --schema-only to reduce)
  • validate - Tracks PKs/FKs when checking integrity (use --no-fk-checks to skip)
  • order, graph - Build dependency graph (memory scales with number of tables/FKs)
  • sample, shard - Track selected rows when preserving FK relationships

See Memory & Streaming for detailed memory profiles per command.

Given the same input and configuration, the tool produces identical output every run. Determinism is important for CI pipelines, debugging, and reproducible data workflows. Use --seed with sampling/redaction operations for deterministic “random” results.

A fixed value used to initialize a pseudo-random number generator. Using the same seed produces the same “random” results every time, making operations reproducible.

Terminal window
sql-splitter sample dump.sql --percent 10 --seed 42

Data that can identify a person directly or indirectly: names, email addresses, phone numbers, IP addresses, government IDs, etc. Many regulations require protecting PII, including GDPR (EU) and CCPA (California).

Removing or masking sensitive values entirely, typically replacing them with NULL, asterisks, or a placeholder like [REDACTED]. Redaction prioritizes safety over data usefulness.

Transforming data to reduce identification risk while keeping it useful for testing or development. Techniques include replacing real names with fake names, hashing emails, or shuffling values.

Converting a value into a fixed-length digest (e.g., SHA-256) that cannot be reversed. Hashing preserves equality (same input → same hash), which is useful for maintaining foreign key relationships during anonymization.

Terminal window
sql-splitter redact dump.sql --hash "*.email" -o safe.sql

A customer, account, or organization whose data is logically separated from others within the same database. Multi-tenant systems store multiple tenants’ data in shared infrastructure.

An architecture where a single database serves multiple tenants, typically with row-level isolation via a tenant identifier column. This is common in SaaS applications.

In sql-splitter context, extracting a subset of data belonging to a specific tenant from a multi-tenant database dump. See shard.

A column (typically tenant_id, org_id, company_id, account_id) present in tables that indicates which tenant owns each row. sql-splitter auto-detects common tenant column names.

A diagram showing tables, their columns, and the relationships (foreign keys) between them. ERDs help visualize database structure and identify dependency cycles.

See graph for generating ERDs from SQL dumps.

When database structure differs across environments (development, staging, production) or differs from expected migrations. Drift can cause import failures or data inconsistencies.

See diff for detecting schema differences between dumps.

A simple pattern syntax for matching file paths:

PatternMatches
*.sqlAll SQL files in current directory
**/*.sqlAll SQL files recursively
dump_*.sql.gzFiles starting with dump_ ending in .sql.gz
backups/202[3-4]*.sqlBackups from 2023-2024

Most sql-splitter commands accept glob patterns:

Terminal window
sql-splitter validate "backups/**/*.sql" --fail-fast