Database Recovery

This guide covers all aspects of database recovery for Geode, from automatic crash recovery to manual disaster recovery procedures. Understanding these concepts ensures minimal data loss and rapid service restoration.

Overview

Geode provides multiple recovery mechanisms to handle various failure scenarios:

Recovery Type	Use Case	RTO	RPO
Automatic Crash Recovery	Server restart after crash	Seconds	Zero
WAL Replay	Incomplete transactions	Seconds	Zero
Point-in-Time Recovery	Restore to specific timestamp	Minutes	Configurable
Full Restore	Complete database restoration	Minutes	Last backup
Disaster Recovery	Site failure, data corruption	Minutes-Hours	Last backup

Key Terms:

RTO (Recovery Time Objective): Maximum acceptable downtime
RPO (Recovery Point Objective): Maximum acceptable data loss

Automatic Crash Recovery

How It Works

Geode uses Write-Ahead Logging (WAL) to ensure durability. When the server restarts after a crash:

WAL Analysis: Scans WAL to identify incomplete transactions
Redo Phase: Replays committed transactions not yet flushed to disk
Undo Phase: Rolls back uncommitted transactions
Checkpoint: Creates recovery checkpoint

Crash Recovery Process

Server Crash
    │
    ▼
Server Restart
    │
    ▼
┌─────────────────────────────────────┐
│  1. Load last checkpoint            │
│  2. Scan WAL from checkpoint LSN    │
│  3. Build transaction table         │
│  4. Redo committed transactions     │
│  5. Undo uncommitted transactions   │
│  6. Create new checkpoint           │
└─────────────────────────────────────┘
    │
    ▼
Server Ready

Verifying Crash Recovery

After server restart, verify recovery completed:

# Check server logs for recovery messages
geode logs --tail 100 | grep -i recovery

# Expected output:
# [INFO] Starting crash recovery from checkpoint LSN: 12345678
# [INFO] WAL replay: 156 transactions to redo
# [INFO] WAL replay: 3 transactions to undo
# [INFO] Crash recovery completed in 1.2s
# [INFO] Server ready on port 3141

# Verify database integrity
geode verify --data-dir /var/lib/geode/data

# Run health check query
geode query "RETURN 1 AS health_check"

Recovery Configuration

# geode.yaml
recovery:
  # WAL configuration
  wal:
    enabled: true
    sync_mode: fsync        # Options: none, fsync, fdatasync
    buffer_size: 16MB
    segment_size: 64MB

  # Checkpoint configuration
  checkpoint:
    interval: 5m            # Checkpoint frequency
    min_wal_size: 256MB     # Minimum WAL before checkpoint

  # Recovery limits
  max_recovery_time: 10m    # Abort if recovery exceeds
  parallel_redo: true       # Parallel transaction replay
  redo_threads: 4           # Number of redo threads

Point-in-Time Recovery (PITR)

PITR allows restoring the database to any point in time within the retention period.

Prerequisites

WAL archiving enabled
Periodic full backups
Archived WAL segments

Enable WAL Archiving

# geode.yaml
backup:
  wal_archiving:
    enabled: true
    archive_path: /var/lib/geode/wal_archive
    # Or use S3
    s3:
      enabled: true
      bucket: geode-wal-archive
      prefix: production/wal
    retention_hours: 168     # 7 days

PITR Restore Procedure

Step 1: Stop the Server

sudo systemctl stop geode

Step 2: Identify Target Time

Determine the point in time to restore to:

# List available recovery points
geode backup --dest s3://geode-backups/production --list-recovery-points

# Output:
# Recovery Points Available:
# ─────────────────────────────────────────────────────────
# Base Backup ID    Timestamp               Earliest PITR
# 1738012345        2026-01-23 02:00:00     2026-01-23 02:00:00
# WAL Segments: 2026-01-23 02:00:00 to 2026-01-28 10:30:00
# ─────────────────────────────────────────────────────────

Step 3: Restore Base Backup

# Create fresh data directory
sudo mv /var/lib/geode/data /var/lib/geode/data.old

# Restore base backup
geode restore \
  --source s3://geode-backups/production \
  --backup-id 1738012345 \
  --target /var/lib/geode/data

Step 4: Apply WAL to Target Time

geode recover \
  --data-dir /var/lib/geode/data \
  --wal-archive s3://geode-wal-archive/production/wal \
  --target-time "2026-01-28 10:30:00" \
  --timezone UTC

# Output:
# Starting point-in-time recovery
# Base backup: 1738012345 (2026-01-23 02:00:00)
# Target time: 2026-01-28 10:30:00 UTC
# Applying WAL segments:
#   - segment-0001.wal (2026-01-23 02:05:00 - 2026-01-24 02:00:00)
#   - segment-0002.wal (2026-01-24 02:00:00 - 2026-01-25 02:00:00)
#   - segment-0003.wal (2026-01-25 02:00:00 - 2026-01-26 02:00:00)
#   - segment-0004.wal (2026-01-26 02:00:00 - 2026-01-27 02:00:00)
#   - segment-0005.wal (2026-01-27 02:00:00 - 2026-01-28 02:00:00)
#   - segment-0006.wal (2026-01-28 02:00:00 - 2026-01-28 11:00:00)
# Stopping at: 2026-01-28 10:30:00 UTC
# Recovery complete
# Transactions applied: 12,456
# Final LSN: 0x1A2B3C4D

Step 5: Verify and Start

# Verify data integrity
geode verify --data-dir /var/lib/geode/data

# Start server
sudo systemctl start geode

# Verify recovery
geode query "MATCH (n) RETURN count(n) AS node_count"

PITR to Named Recovery Point

If you created named recovery points:

# Recover to named point
geode recover \
  --data-dir /var/lib/geode/data \
  --wal-archive s3://geode-wal-archive/production/wal \
  --target-name "pre-migration-backup"

Full Database Restore

For complete database restoration from backup.

Standard Restore

# Stop server
sudo systemctl stop geode

# Backup current data (safety)
sudo mv /var/lib/geode/data /var/lib/geode/data.backup-$(date +%Y%m%d)

# List available backups
geode backup --dest s3://geode-backups/production --list

# Restore latest backup
LATEST_BACKUP=$(geode backup --dest s3://geode-backups/production --list --latest --quiet)

geode restore \
  --source s3://geode-backups/production \
  --backup-id $LATEST_BACKUP \
  --target /var/lib/geode/data

# Verify integrity
geode verify --data-dir /var/lib/geode/data

# Start server
sudo systemctl start geode

Restore with Incremental Backups

When using incremental backup strategy:

# Restore full backup chain
geode restore \
  --source s3://geode-backups/production \
  --backup-id 1738012345 \                    # Base full backup
  --apply-incrementals \                       # Apply all incremental backups
  --target /var/lib/geode/data

# Output:
# Restoring base backup 1738012345...
# Applying incremental 1738098745 (2026-01-24)...
# Applying incremental 1738185145 (2026-01-25)...
# Applying incremental 1738271545 (2026-01-26)...
# Applying incremental 1738357945 (2026-01-27)...
# Restore complete

Restore to Different Location

# Restore to alternate directory
geode restore \
  --source s3://geode-backups/production \
  --backup-id 1738012345 \
  --target /var/lib/geode/data-restored

# Start with alternate data directory
geode serve \
  --data-dir /var/lib/geode/data-restored \
  --listen 0.0.0.0:3142

Disaster Recovery

Complete Site Failure

When the primary site is unavailable:

Step 1: Provision New Infrastructure

# Deploy new Geode server
# (Use your provisioning tool: Terraform, Ansible, etc.)

# Install Geode
curl -fsSL https://geodedb.com/install.sh | sudo bash

Step 2: Configure Cloud Storage Access

export AWS_ACCESS_KEY_ID='your-access-key'
export AWS_SECRET_ACCESS_KEY='your-secret-key'
export AWS_REGION='us-west-2'

Step 3: Restore from Off-Site Backup

# List available backups
geode backup --dest s3://geode-backups/production --list

# Restore latest backup
geode restore \
  --source s3://geode-backups/production \
  --backup-id latest \
  --target /var/lib/geode/data

# Apply WAL if PITR is needed
geode recover \
  --data-dir /var/lib/geode/data \
  --wal-archive s3://geode-wal-archive/production/wal \
  --target-time "latest"

Step 4: Verify and Start

# Verify data
geode verify --data-dir /var/lib/geode/data

# Start server
sudo systemctl start geode

# Run comprehensive health check
geode query "MATCH (n) RETURN labels(n) AS label, count(n) AS count"

Data Corruption Recovery

If data files are corrupted:

Step 1: Identify Corruption

# Run integrity check
geode verify --data-dir /var/lib/geode/data --full

# Output:
# Checking nodes.db... ERROR: Checksum mismatch at block 12345
# Checking edges.db... OK
# Checking indexes/... OK
# Verification FAILED: 1 corrupted file(s)

Step 2: Attempt Repair

# Try automatic repair (minor corruption)
geode repair --data-dir /var/lib/geode/data --auto

# Output:
# Repairing nodes.db...
# Block 12345: Reconstructing from WAL
# Block 12345: Repair successful
# Repair complete

Step 3: If Repair Fails, Restore from Backup

# Stop server
sudo systemctl stop geode

# Restore from last known good backup
geode restore \
  --source s3://geode-backups/production \
  --backup-id last-verified \
  --target /var/lib/geode/data

# Start server
sudo systemctl start geode

Accidental Data Deletion Recovery

If data was accidentally deleted:

Option 1: Point-in-Time Recovery (Preferred)

# Restore to time before deletion
geode recover \
  --data-dir /var/lib/geode/data-recovery \
  --wal-archive s3://geode-wal-archive/production/wal \
  --target-time "2026-01-28 09:00:00"  # Before deletion at 10:00

# Extract deleted data from recovered instance
geode query "MATCH (n:DeletedLabel) RETURN n" \
  --server localhost:3142 \
  --format json > deleted_data.json

# Re-import to production
geode import --file deleted_data.json --server localhost:3141

Option 2: Partial Restore

# Export specific data from backup
geode backup-extract \
  --source s3://geode-backups/production \
  --backup-id 1738012345 \
  --query "MATCH (n:ImportantLabel) RETURN n" \
  --output recovered_data.json

# Import to production
geode import --file recovered_data.json

Recovery Verification

Post-Recovery Checks

After any recovery, run these verification steps:

#!/bin/bash
# /usr/local/bin/geode-recovery-verify.sh

echo "=== Geode Recovery Verification ==="

# 1. Check server is running
echo "1. Checking server status..."
systemctl is-active geode || { echo "FAILED: Server not running"; exit 1; }

# 2. Verify data integrity
echo "2. Verifying data integrity..."
geode verify --data-dir /var/lib/geode/data --quick || { echo "FAILED: Integrity check failed"; exit 1; }

# 3. Check node count
echo "3. Checking node count..."
NODE_COUNT=$(geode query "MATCH (n) RETURN count(n)" --format json | jq -r '.result.rows[0][0]')
echo "   Node count: $NODE_COUNT"

# 4. Check relationship count
echo "4. Checking relationship count..."
REL_COUNT=$(geode query "MATCH ()-[r]->() RETURN count(r)" --format json | jq -r '.result.rows[0][0]')
echo "   Relationship count: $REL_COUNT"

# 5. Test transactions
echo "5. Testing transactions..."
geode query "BEGIN; CREATE (:RecoveryTest {ts: datetime()}); ROLLBACK" || { echo "FAILED: Transaction test failed"; exit 1; }

# 6. Check replication (if cluster)
echo "6. Checking cluster status..."
geode cluster status 2>/dev/null || echo "   Single-node mode (no cluster)"

echo ""
echo "=== Recovery Verification PASSED ==="

Data Comparison

Compare recovered data with expected state:

# Export node counts by label
geode query "
  MATCH (n)
  RETURN labels(n) AS label, count(n) AS count
  ORDER BY label
" --format json > current_counts.json

# Compare with expected (from before incident)
diff expected_counts.json current_counts.json

Recovery Runbook

Quick Reference: Recovery Decision Tree

Server Down?
    │
    ├── Server crash → Restart server (auto recovery)
    │
    └── Data issue?
            │
            ├── Minor corruption → Run geode repair
            │
            ├── Major corruption → Restore from backup
            │
            └── Accidental deletion?
                    │
                    ├── Need specific time → PITR
                    │
                    └── Need latest data → Restore + WAL replay

Emergency Recovery Checklist

Immediate Actions:

Assess the situation (type of failure)
Check server logs: geode logs --tail 500
Preserve evidence: cp -r /var/lib/geode/data /var/lib/geode/data.incident-$(date +%s)
Notify stakeholders

Recovery Actions:

Determine recovery strategy (auto, PITR, full restore)
Execute recovery procedure
Run verification checks
Monitor server stability

Post-Recovery:

Document incident timeline
Identify root cause
Update runbooks if needed
Schedule post-incident review

Best Practices

Prevention

Regular Backups: Daily incremental, weekly full
WAL Archiving: Continuous to enable PITR
Monitoring: Alert on backup failures, disk space
Testing: Monthly recovery drills

Configuration Recommendations

# geode.yaml - Production recovery settings
recovery:
  wal:
    enabled: true
    sync_mode: fsync
    segment_size: 64MB

  checkpoint:
    interval: 5m

backup:
  wal_archiving:
    enabled: true
    retention_hours: 168  # 7 days

  schedule:
    full_backup: '0 2 * * 0'      # Weekly Sunday
    incremental_backup: '0 2 * * 1-6'  # Daily

  retention_days: 90

Recovery Time Targets

Scenario	Target RTO	Target RPO
Server crash	< 1 minute	0 (no data loss)
Minor corruption	< 5 minutes	0
Full restore	< 15 minutes	Last backup
PITR	< 30 minutes	Target timestamp
Disaster recovery	< 1 hour	Last WAL archive

Backup Automation - Automated backup configuration
Server Configuration - Server settings
Troubleshooting - Common issues and solutions
Deployment Patterns - High availability setup
Distributed Architecture - Cluster recovery

Summary

Geode provides robust recovery mechanisms:

Automatic crash recovery with WAL ensures zero data loss for committed transactions
Point-in-time recovery enables restoration to any moment within retention period
Full backup restore provides complete database reconstruction
Disaster recovery procedures handle site-wide failures

Key Takeaways:

Enable WAL archiving for PITR capability
Test recovery procedures regularly (monthly)
Document and practice recovery runbooks
Monitor backup health continuously

Last Updated: January 28, 2026 Geode Version: v0.1.3+ Status: Production Ready