Glossary

This glossary explains database and SQL terminology used in sql-splitter documentation. Terms are grouped by category for easier reference.

Keys, Constraints, and Integrity

Primary Key (PK)

A column (or set of columns) whose values uniquely identify each row in a table. Primary keys cannot be NULL and are often referenced by foreign keys in other tables.

Foreign Key (FK)

A constraint that requires values in one table’s column(s) to match existing values in another table’s primary key. Foreign keys are the main mechanism for representing relationships between tables.

Constraint

A rule the database enforces on table data or structure. Common constraints include:

PRIMARY KEY - Unique row identifier
FOREIGN KEY - Reference to another table
UNIQUE - No duplicate values allowed
NOT NULL - Value required
CHECK - Custom validation (e.g., age >= 0)

Referential Integrity

The guarantee that relationships between tables remain consistent. For example, you can’t have an orders.user_id that references a users.id that doesn’t exist. Foreign keys enforce referential integrity.

CASCADE

A referential action on a foreign key that automatically propagates changes. For example, ON DELETE CASCADE deletes child rows when the parent row is deleted. Similarly, ON UPDATE CASCADE updates child foreign key values when the parent key changes.

Deferred Constraints

Constraints that are checked at the end of a transaction instead of immediately after each statement. This allows temporarily “inconsistent” states while loading data, as long as everything is valid by commit time. Useful for loading tables with circular dependencies.

-- PostgreSQL syntax
SET CONSTRAINTS ALL DEFERRED;

SQL Structure

Schema

The organized definition of database objects—tables, columns, types, constraints, indexes, etc. In PostgreSQL, “schema” also refers to a namespace (like public), but in sql-splitter docs, “schema” typically means the overall database structure.

DDL (Data Definition Language)

SQL statements that define or change database structure:

CREATE TABLE, ALTER TABLE, DROP TABLE
CREATE INDEX, DROP INDEX
CREATE VIEW

DDL describes what the database looks like.

DML (Data Manipulation Language)

SQL statements that work with data:

INSERT, UPDATE, DELETE
COPY ... FROM stdin (PostgreSQL bulk load)

DML describes what data is in the database.

Dialect-Specific Syntax

SQL Dialect

The specific flavor of SQL supported by a database engine. sql-splitter supports four dialects:

Dialect	Key Features
MySQL	Backtick quoting, `AUTO_INCREMENT`, `ENGINE=InnoDB`
PostgreSQL	Double-quote identifiers, `SERIAL`, `COPY FROM stdin`
SQLite	`INTEGER PRIMARY KEY` for auto-increment
MSSQL	`IDENTITY`, `GO` batch separators, `NVARCHAR`

AUTO_INCREMENT (MySQL)

A column attribute that automatically generates increasing numeric values. Commonly used for surrogate primary keys.

id INT AUTO_INCREMENT PRIMARY KEY

SERIAL (PostgreSQL)

A convenience type that creates an integer column backed by an auto-incrementing sequence. Modern PostgreSQL prefers IDENTITY.

id SERIAL PRIMARY KEY
-- Equivalent to: id INTEGER DEFAULT nextval('tablename_id_seq')

IDENTITY (PostgreSQL/MSSQL)

A standards-aligned way to define auto-generated numeric columns.

-- PostgreSQL
id INT GENERATED BY DEFAULT AS IDENTITY

-- MSSQL
id INT IDENTITY(1,1)

COPY FROM stdin (PostgreSQL)

A PostgreSQL bulk-loading command that reads table data from the client connection. In pg_dump output, it appears as:

COPY users (id, name, email) FROM stdin;
1  Alice  alice@example.com
2  Bob  bob@example.com
\.

The \. marks the end of data. sql-splitter’s convert command transforms this to INSERT statements for other dialects.

Dependency Graphs and Ordering

Dependency Graph

A directed graph where nodes represent tables and edges represent foreign key relationships. Used to determine safe ordering for creating tables and loading data.

Topological Order / Topological Sort

An ordering of tables such that every dependency (parent table) appears before the table that depends on it (child table). In SQL dumps, this means parent tables are created and populated before child tables, satisfying foreign key constraints during import.

See order for reordering dumps topologically.

Circular Dependency / Cycle

A situation where following foreign key relationships leads back to the starting table. For example: orders → payments → invoices → orders. Cycles prevent simple topological ordering and require special handling:

Defer constraints during import
Disable FK checks temporarily
Split data loading into phases

See graph --cycles-only to detect and visualize cycles.

Root Table

A table with no outbound foreign keys (it doesn’t reference other tables). Root tables can be loaded first since they have no dependencies.

Parent Table

The referenced table in a foreign key relationship—the table that “owns” the primary key being referenced. Parent tables must generally exist before child tables can reference them.

Child Table

The referencing table in a foreign key relationship—it contains the foreign key column(s). Child rows depend on parent rows existing.

Data Processing

Streaming

Processing a dump file incrementally as it’s read, rather than loading the entire file into memory.

Commands that stream (constant ~50MB memory):

split, merge, convert, analyze, redact

Commands that require more memory:

query - Loads data into DuckDB (use --disk for large files)
diff - Tracks PK hashes for comparison (use --schema-only to reduce)
validate - Tracks PKs/FKs when checking integrity (use --no-fk-checks to skip)
order, graph - Build dependency graph (memory scales with number of tables/FKs)
sample, shard - Track selected rows when preserving FK relationships

See Memory & Streaming for detailed memory profiles per command.

Deterministic

Given the same input and configuration, the tool produces identical output every run. Determinism is important for CI pipelines, debugging, and reproducible data workflows. Use --seed with sampling/redaction operations for deterministic “random” results.

Seed

A fixed value used to initialize a pseudo-random number generator. Using the same seed produces the same “random” results every time, making operations reproducible.

sql-splitter sample dump.sql --percent 10 --seed 42

Data Privacy and Anonymization

PII (Personally Identifiable Information)

Data that can identify a person directly or indirectly: names, email addresses, phone numbers, IP addresses, government IDs, etc. Many regulations require protecting PII, including GDPR (EU) and CCPA (California).

Redaction

Removing or masking sensitive values entirely, typically replacing them with NULL, asterisks, or a placeholder like [REDACTED]. Redaction prioritizes safety over data usefulness.

Anonymization

Transforming data to reduce identification risk while keeping it useful for testing or development. Techniques include replacing real names with fake names, hashing emails, or shuffling values.

Hashing

Converting a value into a fixed-length digest (e.g., SHA-256) that cannot be reversed. Hashing preserves equality (same input → same hash), which is useful for maintaining foreign key relationships during anonymization.

sql-splitter redact dump.sql --hash "*.email" -o safe.sql

Multi-Tenant Concepts

Tenant

A customer, account, or organization whose data is logically separated from others within the same database. Multi-tenant systems store multiple tenants’ data in shared infrastructure.

Multi-Tenant

An architecture where a single database serves multiple tenants, typically with row-level isolation via a tenant identifier column. This is common in SaaS applications.

Sharding

In sql-splitter context, extracting a subset of data belonging to a specific tenant from a multi-tenant database dump. See shard.

Tenant Column

A column (typically tenant_id, org_id, company_id, account_id) present in tables that indicates which tenant owns each row. sql-splitter auto-detects common tenant column names.

Visualization and Documentation

ERD (Entity Relationship Diagram)

A diagram showing tables, their columns, and the relationships (foreign keys) between them. ERDs help visualize database structure and identify dependency cycles.

See graph for generating ERDs from SQL dumps.

Schema Drift

When database structure differs across environments (development, staging, production) or differs from expected migrations. Drift can cause import failures or data inconsistencies.

See diff for detecting schema differences between dumps.

File Patterns

Glob Pattern

A simple pattern syntax for matching file paths:

Pattern	Matches
`*.sql`	All SQL files in current directory
`*/.sql`	All SQL files recursively
`dump_*.sql.gz`	Files starting with `dump_` ending in `.sql.gz`
`backups/202[3-4]*.sql`	Backups from 2023-2024

Most sql-splitter commands accept glob patterns:

sql-splitter validate "backups/**/*.sql" --fail-fast