Tag: Write-Ahead Logging (WAL) | Tags

Documentation tagged with Write-Ahead Logging (WAL) in the Geode graph database. WAL is a fundamental technique for ensuring data durability and enabling crash recovery—every database modification is recorded in a sequential log file before being applied to the database itself.

Introduction to Write-Ahead Logging

Write-Ahead Logging (WAL) is the cornerstone of database durability. The principle is elegantly simple: before modifying any data in the database, first write a description of the change to a sequential log file on persistent storage. This ensures that even if the system crashes mid-operation, the database can recover to a consistent state by replaying the log.

WAL has been a standard component of database systems since the 1970s, used by virtually every production database including PostgreSQL, Oracle, MySQL, and now Geode. The technique solves a fundamental problem: random writes to database pages are expensive (requiring disk seeks and scattered I/O), but sequential writes to a log file are fast (leveraging sequential disk throughput).

By committing transactions via WAL, Geode achieves both:

Fast commits: Write sequential log records instead of scattered database pages
Crash recovery: Replay the WAL to reconstruct lost in-memory state
Point-in-time recovery: Restore database to any historical moment
Replication: Ship WAL records to replicas for synchronization

Geode’s WAL implementation provides full ACID durability guarantees while maintaining high write throughput—critical for production graph workloads.

Core WAL Concepts

The WAL Protocol

The fundamental WAL protocol consists of three rules:

Write-Ahead Rule: Log records must reach persistent storage before data pages
Force-Log-at-Commit: All log records for a transaction must be durable before commit returns
No-Force Policy: Data pages may remain in memory; only WAL must be forced to disk

These rules ensure that:

Committed transactions survive crashes (durability)
Uncommitted transactions can be rolled back (atomicity)
The database can be reconstructed from the WAL (recoverability)

Log Sequence Numbers (LSN)

Every WAL record is assigned a monotonically increasing Log Sequence Number (LSN). LSNs serve multiple purposes:

LSN: 64-bit integer representing log position
Example: 0x0000000123456789

Components:
- High 32 bits: WAL file number
- Low 32 bits: Offset within file

LSNs enable:

Ordering of operations
Tracking which records have been applied
Identifying the recovery start point
Coordinating replication

WAL Record Structure

Each WAL record contains:

WAL Record Format:
[LSN][Transaction ID][Operation Type][Before Image][After Image][Checksum]

Example:
[0x123456][tx-98765][UPDATE_NODE][{name: "Alice"}][{name: "Alice", age: 30}][crc32: 0xABCD]

Components:

LSN: Unique identifier for this record
Transaction ID: Which transaction generated this operation
Operation Type: INSERT, UPDATE, DELETE, BEGIN, COMMIT, ABORT
Before Image: Data state before modification (for undo)
After Image: Data state after modification (for redo)
Checksum: Integrity verification

Checkpointing

Periodically, Geode performs a checkpoint:

Flush all dirty pages to disk
Write a checkpoint record to WAL
Record the LSN in a known location

During recovery, Geode only needs to replay WAL records after the last checkpoint, dramatically reducing recovery time.

How WAL Works in Geode

Write Path

When a transaction modifies data:

Generate WAL Record: Create log record describing the change
Append to WAL Buffer: Write to in-memory buffer (fast)
Modify In-Memory Data: Update in-memory database structures
Flush WAL on Commit: Call fsync() to ensure WAL reaches disk
Return Success: Commit completes only after WAL is durable

BEGIN TRANSACTION;
MATCH (p:Person {id: 123})
SET p.age = 30;
-- At this point, WAL record exists in memory buffer

COMMIT;
-- fsync() forces WAL to disk, then commit returns

Crash Recovery

On startup after a crash, Geode performs WAL replay:

Find Last Checkpoint: Read checkpoint LSN from control file
Read WAL: Start reading from checkpoint LSN
Identify Transactions: Separate committed from uncommitted transactions
Redo Pass: Replay committed transactions to reconstruct state
Undo Pass: Roll back uncommitted transactions
Resume Operation: Database is now consistent and ready

Recovery Example:
Checkpoint at LSN 1000
Crash at LSN 1500

WAL Records:
[1001][tx-1][BEGIN]
[1002][tx-1][INSERT Node(Person, id=1)]
[1003][tx-1][COMMIT]
[1004][tx-2][BEGIN]
[1005][tx-2][INSERT Node(Person, id=2)]
[CRASH - no commit for tx-2]

Recovery:
- Replay tx-1: INSERT Node(Person, id=1) - COMMITTED
- Rollback tx-2: Undo INSERT Node(Person, id=2) - UNCOMMITTED

WAL Archival and Rotation

WAL files grow continuously. Geode manages this through:

WAL Rotation: When a WAL file reaches a size limit (default 64MB), start a new file

wal/
├── 00000001.wal (complete, archived)
├── 00000002.wal (complete, archived)
├── 00000003.wal (complete, archived)
└── 00000004.wal (current, active)

WAL Archival: Old WAL files are moved to archive storage for:

Point-in-time recovery
Replication to new replicas
Audit and compliance

WAL Recycling: After checkpoint, archived WAL files older than the retention period are deleted or recycled.

Integration with MVCC

WAL works hand-in-hand with Multi-Version Concurrency Control:

Before Images: WAL records contain old version data for rollback
After Images: WAL records contain new version data for redo
Version Chains: WAL enables reconstruction of version chains after crash
Garbage Collection: WAL archival coordinated with MVCC GC

Use Cases and Benefits

Crash Recovery

WAL enables automatic recovery from crashes:

# Server crashes during transaction
$ geode serve
[CRASH - power failure]

# Restart automatically recovers
$ geode serve
[2026-01-24 10:00:00] Starting crash recovery...
[2026-01-24 10:00:01] Reading WAL from LSN 123456789
[2026-01-24 10:00:02] Replaying 1523 committed transactions
[2026-01-24 10:00:03] Rolling back 2 uncommitted transactions
[2026-01-24 10:00:04] Recovery complete
[2026-01-24 10:00:05] Database ready

Point-in-Time Recovery

WAL enables time-travel recovery:

# Restore database to specific timestamp
$ geode restore --time "2026-01-24T09:00:00Z"
[2026-01-24 10:05:00] Restoring from checkpoint at 08:00:00
[2026-01-24 10:05:01] Replaying WAL up to 09:00:00
[2026-01-24 10:05:02] Restored to 2026-01-24T09:00:00Z

This is invaluable for:

Recovering from accidental data deletion
Investigating historical states for debugging
Compliance and audit requirements

Replication

WAL enables efficient replication:

Primary Server                 Replica Server
    |                                |
    |----[WAL Record 1001]---------->|
    |----[WAL Record 1002]---------->|
    |----[WAL Record 1003]---------->|
    |                                |
    |                          [Apply WAL]

Replicas apply WAL records to stay synchronized with the primary, enabling:

High availability (fail over to replica)
Read scalability (distribute reads across replicas)
Geographic distribution (replicas in multiple regions)

Audit Logging

WAL provides a complete audit trail:

-- Query WAL for specific transaction
CALL dbms.audit.wal.query({
  transaction_id: "tx-12345",
  time_range: {start: "2026-01-24T00:00:00Z", end: "2026-01-24T23:59:59Z"}
}) YIELD lsn, operation, before_image, after_image, timestamp
RETURN lsn, operation, before_image, after_image, timestamp;

This supports:

Compliance (SOC 2, HIPAA, GDPR)
Forensic investigation
Change tracking

Best Practices

WAL Storage Configuration

Use dedicated storage: Place WAL on separate disks from data files
```
$ geode serve --wal-dir /mnt/wal-ssd --data-dir /mnt/data-hdd
```

Use SSDs for WAL: Sequential writes benefit from SSD latency

WAL on HDD: ~10ms fsync latency → 100 commits/sec
WAL on SSD: ~0.1ms fsync latency → 10,000 commits/sec

RAID configuration: Use RAID 1 or RAID 10 for WAL reliability

Checkpoint Tuning

Configure checkpoint frequency to balance:

Recovery time: More frequent checkpoints → faster recovery
Performance: Less frequent checkpoints → less I/O overhead

# geode.yaml
wal:
  checkpoint_interval: 5m  # Checkpoint every 5 minutes
  checkpoint_timeout: 30s  # Max time for checkpoint
  max_wal_size: 1GB        # Force checkpoint if WAL exceeds 1GB

WAL Archival Strategy

# geode.yaml
wal:
  archive_mode: enabled
  archive_command: "cp %p /mnt/wal-archive/%f"
  archive_timeout: 60s
  retention_days: 30

Configure archival based on requirements:

Compliance: Long retention (90+ days)
Replication: Keep WAL until all replicas apply it
Storage cost: Balance retention with storage costs

Monitoring WAL Health

Track critical WAL metrics:

-- WAL metrics
CALL dbms.monitor.wal.stats()
YIELD current_lsn, checkpoint_lsn, wal_files, wal_size, flush_rate;

-- WAL lag (replication)
CALL dbms.monitor.replication.lag()
YIELD replica, wal_lag_bytes, wal_lag_seconds;

-- Checkpoint statistics
CALL dbms.monitor.checkpoint.stats()
YIELD last_checkpoint, duration, pages_written;

Alert on:

WAL accumulation: WAL files growing without bound
Slow fsync: Commit latency increasing
Replication lag: Replicas falling behind

Troubleshooting

WAL Disk Full

Symptom: Commits fail with “WAL disk full” error Cause: WAL directory ran out of space Solution:

Add storage to WAL directory
Reduce WAL retention
Increase checkpoint frequency
Archive old WAL files to cheaper storage

Slow Commits

Symptom: High commit latency Cause: Slow fsync() calls Solution:

Move WAL to faster storage (SSD)
Check disk I/O contention
Tune filesystem mount options (noatime, data=writeback)
Consider group commit optimization

Long Recovery Times

Symptom: Database takes minutes to recover after crash Cause: Large amount of WAL to replay Solution:

Increase checkpoint frequency
Reduce transaction size
Add more memory for in-memory buffers
Use parallel WAL replay (if available)

Corrupted WAL

Symptom: Recovery fails with checksum mismatch Cause: Disk corruption, partial write, bit flip Solution:

Restore from backup
Recover to last valid LSN (may lose recent commits)
Use WAL archive if available
Check hardware (bad disk, bad memory)

Performance Considerations

WAL Write Throughput

Optimize WAL write performance:

Batch commits: Group multiple transactions into single fsync

wal:
  group_commit_delay: 10ms  # Wait up to 10ms to batch commits
  group_commit_size: 100     # Or batch up to 100 transactions

Asynchronous commit: Trade durability for throughput (use cautiously)

BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED, COMMIT MODE ASYNC;
-- Transaction commits without waiting for fsync
COMMIT;

WAL compression: Reduce WAL size with compression

wal:
  compression: lz4  # Fast compression algorithm

Recovery Performance

Optimize recovery speed:

Parallel replay: Replay independent transactions concurrently
Incremental checkpoints: Spread checkpoint I/O over time
Hot standby: Keep replica always ready for instant failover

ACID Transactions - WAL enables durability
Multi-Version Concurrency Control (MVCC) - WAL integrates with MVCC
Crash Recovery - Recovery process details
Replication - WAL-based replication
Backup and Restore - Using WAL for backups
Performance - WAL performance tuning

Advanced WAL Features

Parallel WAL Writing

Geode uses parallel WAL writers for improved throughput:

Transaction Commit Pipeline:

Thread 1: Txn A → WAL Buffer → Flush Queue
Thread 2: Txn B → WAL Buffer → Flush Queue
Thread 3: Txn C → WAL Buffer → Flush Queue
                                    ↓
                            WAL Writer Thread
                                    ↓
                              fsync() call
                            (batches A, B, C)
                                    ↓
                            All 3 txns commit

Benefit: Single fsync() commits multiple transactions
Result: 10,000+ commits/sec on NVMe SSD

Group Commit Optimization

Batch commits to amortize fsync() cost:

# geode.yaml
wal:
  group_commit:
    enabled: true
    max_delay_ms: 10      # Wait up to 10ms to batch
    max_batch_size: 100   # Or commit when 100 txns queued

# Result:
# - Individual commit latency: 0.1ms (WAL append)
# - Actual fsync() latency: 10ms (once per batch)
# - Effective throughput: 10,000 commits/sec vs 100 commits/sec

WAL Compression

Reduce WAL storage and I/O:

wal:
  compression:
    algorithm: lz4  # Fast compression
    level: 1        # Low CPU overhead
    min_size: 1024  # Only compress records >1KB

# Results:
# - 60-80% compression ratio for typical graph operations
# - <5% CPU overhead for compression
# - Reduced network bandwidth for replication

Incremental Checkpointing

Spread checkpoint I/O over time:

Traditional Checkpoint:
Time: ────────────[CHECKPOINT]────────────────────
I/O:               ████████████ (burst)

Incremental Checkpoint:
Time: ────────────────────────────────────────────
I/O:               ██ ██ ██ ██ ██ ██ (spread out)

Benefits:
- No I/O spikes
- Consistent query performance
- Faster recovery (shorter WAL replay)

Configuration:

wal:
  checkpoint:
    mode: incremental
    target_duration: 300s     # Spread over 5 minutes
    max_dirty_pages: 100000   # Trigger at 100k dirty pages
    completion_target: 0.9    # Aim to complete 90% of checkpoint interval

WAL Replication Strategies

Streaming Replication

Ship WAL records to replicas in real-time:

Primary Server                        Replica Server
     │                                      │
     │ [Txn Commit]                         │
     │      ↓                                │
     │ [WAL Write]                           │
     │      ↓                                │
     │ [Stream WAL Record]───────────────>  │
     │                                    [Apply WAL]
     │                                       ↓
     │ <─────────────[ACK]──────────────  [Update State]
     │                                       │
     │ [Commit Txn]                          │

Implementation:

replication:
  mode: streaming
  sync_mode: async  # or 'sync' for synchronous replication
  max_lag_bytes: 10485760  # 10MB max lag
  max_lag_seconds: 10      # 10 seconds max time lag

  replicas:
    - host: replica1.example.com
      port: 7000
      sync_priority: 1
    - host: replica2.example.com
      port: 7000
      sync_priority: 2

Synchronous vs. Asynchronous Replication

-- Synchronous replication (wait for replica ACK)
BEGIN TRANSACTION REPLICATION sync;
CREATE (:User {id: 123, name: 'Alice'});
COMMIT;  -- Blocks until replica confirms WAL applied

-- Asynchronous replication (don't wait)
BEGIN TRANSACTION REPLICATION async;
CREATE (:User {id: 456, name: 'Bob'});
COMMIT;  -- Returns immediately, replica applies eventually

Trade-offs:

Synchronous:

Pro: Zero data loss on failover
Pro: Strong consistency across replicas
Con: Higher commit latency (network round-trip)
Con: Availability depends on replica health

Asynchronous:

Pro: Low commit latency
Pro: Primary unaffected by replica failures
Con: Potential data loss on failover (uncommitted WAL)
Con: Eventual consistency

Cascading Replication

Replicate across multiple tiers:

      Primary (US-East)
           │
           ├───> Replica 1 (US-East)
           │
           └───> Replica 2 (US-West)
                     │
                     ├───> Replica 3 (EU-West)
                     │
                     └───> Replica 4 (AP-South)

Benefits:
- Reduced load on primary
- Geographic distribution
- Lower cross-region bandwidth

WAL-Based Logical Replication

Replicate specific subsets of data:

-- Create logical replication slot
CREATE REPLICATION SLOT analytics_slot LOGICAL;

-- Define publication (what to replicate)
CREATE PUBLICATION analytics_pub FOR
  LABELS (User, Order, Product)
  WHERE region = 'us-east';

-- Subscriber consumes filtered WAL stream
SUBSCRIBE TO PUBLICATION analytics_pub
  FROM SLOT analytics_slot;

Use cases:

Replicate subset of data to analytics database
Multi-tenant replication (one tenant per replica)
Cross-database replication (Geode → PostgreSQL)

Disaster Recovery with WAL

Continuous WAL Archival

#!/bin/bash
# WAL archival script

# Geode calls this when WAL segment complete
WAL_FILE=$1
WAL_PATH="/var/lib/geode/wal/$WAL_FILE"

# Archive to S3
aws s3 cp "$WAL_PATH" "s3://geode-wal-archive/$(date +%Y%m%d)/$WAL_FILE" \
  --storage-class GLACIER

# Verify upload
if [ $? -eq 0 ]; then
  echo "Archived $WAL_FILE to S3"
else
  echo "Failed to archive $WAL_FILE" >&2
  exit 1
fi

Configuration:

wal:
  archive:
    enabled: true
    command: "/usr/local/bin/archive_wal.sh %p"
    timeout: 300  # 5 minute timeout
    retention_days: 30

Point-in-Time Recovery (PITR)

Restore database to specific timestamp:

# Stop Geode server
systemctl stop geode

# Restore base backup
tar -xzf /backups/base-backup-20260124.tar.gz -C /var/lib/geode/data

# Configure recovery
cat > /var/lib/geode/data/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://geode-wal-archive/%f %p'
recovery_target_time = '2026-01-24 14:30:00'
recovery_target_action = promote
EOF

# Start Geode (recovery mode)
systemctl start geode

# Geode replays WAL up to target time, then becomes operational

Recovery process:

Restore base backup
Replay WAL files from archive
Stop at recovery target time
Apply final checkpoint
Database ready at target timestamp

WAL Retention for Compliance

Configure long-term WAL retention:

wal:
  archive:
    enabled: true
    retention:
      short_term:
        days: 7
        storage: local  # /var/lib/geode/wal_archive
      long_term:
        days: 2555  # 7 years for compliance
        storage: s3_glacier
        bucket: geode-compliance-archive

Monitoring WAL Health

Critical WAL Metrics

from prometheus_client import Gauge, Counter

# WAL metrics
wal_current_lsn = Gauge('geode_wal_current_lsn', 'Current WAL LSN')
wal_files_count = Gauge('geode_wal_files_count', 'Number of WAL files')
wal_bytes_written = Counter('geode_wal_bytes_written_total', 'Total bytes written to WAL')
wal_sync_duration = Histogram('geode_wal_sync_seconds', 'WAL fsync duration')

# Replication lag
replication_lag_bytes = Gauge(
    'geode_replication_lag_bytes',
    'Replication lag in bytes',
    ['replica']
)
replication_lag_seconds = Gauge(
    'geode_replication_lag_seconds',
    'Replication lag in seconds',
    ['replica']
)

# Checkpoint metrics
checkpoint_duration = Histogram('geode_checkpoint_duration_seconds', 'Checkpoint duration')
checkpoint_pages_written = Counter('geode_checkpoint_pages_written_total', 'Pages written during checkpoint')

WAL Health Checks

async def check_wal_health(client):
    """Monitor WAL health"""
    # Check WAL disk usage
    wal_size, _ = await client.query("""
        SELECT SUM(size_bytes) AS total_size
        FROM SYSTEM.wal_files
    """)

    if wal_size > 10 * 1024 * 1024 * 1024:  # >10GB
        alert("WAL directory growing too large")

    # Check replication lag
    lag, _ = await client.query("""
        SELECT replica_name,
               current_lsn - replica_lsn AS lag_bytes,
               extract(epoch from now() - last_replay_time) AS lag_seconds
        FROM SYSTEM.replication_status
    """)

    for replica in lag.bindings:
        if replica['lag_bytes'] > 100 * 1024 * 1024:  # >100MB
            alert(f"Replica {replica['replica_name']} falling behind")

    # Check WAL archival
    unarchived, _ = await client.query("""
        SELECT COUNT(*) AS count
        FROM SYSTEM.wal_files
        WHERE archived = false
          AND created_at < NOW() - INTERVAL '1 hour'
    """)

    if unarchived.bindings[0]['count'] > 10:
        alert("WAL archival falling behind")

Grafana Dashboard Queries

# WAL write rate
rate(geode_wal_bytes_written_total[5m])

# Average fsync duration
rate(geode_wal_sync_seconds_sum[5m]) /
rate(geode_wal_sync_seconds_count[5m])

# Replication lag (seconds)
geode_replication_lag_seconds{replica="replica1"}

# WAL disk usage
geode_wal_files_count * 64 * 1024 * 1024  # Assuming 64MB segments

Introduction to Write-Ahead Logging Share link

Core WAL Concepts Share link

The WAL Protocol Share link

Log Sequence Numbers (LSN) Share link

WAL Record Structure Share link

Checkpointing Share link

How WAL Works in Geode Share link

Write Path Share link

Crash Recovery Share link

WAL Archival and Rotation Share link

Integration with MVCC Share link

Use Cases and Benefits Share link

Crash Recovery Share link

Point-in-Time Recovery Share link

Replication Share link

Audit Logging Share link

Best Practices Share link

WAL Storage Configuration Share link

Checkpoint Tuning Share link

WAL Archival Strategy Share link

Monitoring WAL Health Share link

Troubleshooting Share link

WAL Disk Full Share link

Slow Commits Share link

Long Recovery Times Share link

Corrupted WAL Share link

Performance Considerations Share link

WAL Write Throughput Share link

Recovery Performance Share link

Related Topics Share link

Further Reading Share link

Advanced WAL Features Share link

Parallel WAL Writing Share link

Group Commit Optimization Share link

WAL Compression Share link

Incremental Checkpointing Share link

WAL Replication Strategies Share link

Streaming Replication Share link

Synchronous vs. Asynchronous Replication Share link

Cascading Replication Share link

WAL-Based Logical Replication Share link

Disaster Recovery with WAL Share link

Continuous WAL Archival Share link

Point-in-Time Recovery (PITR) Share link

WAL Retention for Compliance Share link

Monitoring WAL Health Share link

Critical WAL Metrics Share link

WAL Health Checks Share link

Grafana Dashboard Queries Share link

Further Reading Share link

Related Articles

Transactions and Data Integrity

Introduction to Write-Ahead Logging

Core WAL Concepts

The WAL Protocol

Log Sequence Numbers (LSN)

WAL Record Structure

Checkpointing

How WAL Works in Geode

Write Path

Crash Recovery

WAL Archival and Rotation

Integration with MVCC

Use Cases and Benefits

Crash Recovery

Point-in-Time Recovery

Replication

Audit Logging

Best Practices

WAL Storage Configuration

Checkpoint Tuning

WAL Archival Strategy

Monitoring WAL Health

Troubleshooting

WAL Disk Full

Slow Commits

Long Recovery Times

Corrupted WAL

Performance Considerations

WAL Write Throughput

Recovery Performance

Related Topics

Further Reading

Advanced WAL Features

Parallel WAL Writing

Group Commit Optimization

WAL Compression

Incremental Checkpointing

WAL Replication Strategies

Streaming Replication

Synchronous vs. Asynchronous Replication

Cascading Replication

WAL-Based Logical Replication

Disaster Recovery with WAL

Continuous WAL Archival

Point-in-Time Recovery (PITR)

WAL Retention for Compliance

Monitoring WAL Health

Critical WAL Metrics

WAL Health Checks

Grafana Dashboard Queries

Further Reading