Storage Engine in Geode
The storage engine is the foundation of any database system, responsible for durably persisting data, managing memory, and providing efficient access patterns. Geode’s storage engine is purpose-built for graph workloads, optimizing for relationship traversals, flexible schemas, and ACID transactions.
This guide explores Geode’s storage architecture, configuration options, and best practices for optimal storage performance.
Storage Architecture Overview
Design Principles
Geode’s storage engine follows key principles:
Graph-Native Design: Optimized for adjacency lookups and traversals, not just point queries
Memory-Mapped I/O: Efficient use of OS page cache for frequently accessed data
Write-Ahead Logging: Durability without sacrificing write performance
Copy-on-Write MVCC: Non-blocking reads during concurrent modifications
Tiered Storage: Hot/warm/cold data management for cost efficiency
High-Level Architecture
┌─────────────────────────────────────────────────────────────┐
│ Query Engine │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ Transaction Manager │
│ (MVCC, Locking, Isolation) │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ Buffer Pool │
│ (Page Cache, Dirty Page Management) │
└──────────────────────────┬──────────────────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────┐ ┌─────────────────┐
│ Index Files │ │ Data Files │ │ WAL Files │
│ (B-tree) │ │ (Pages) │ │ (Sequential) │
└─────────────────┘ └─────────────┘ └─────────────────┘
Data File Organization
Page Structure
Data is organized into fixed-size pages (default 16 KB):
Page Layout (16 KB):
┌──────────────────────────────────────────────────────────┐
│ Page Header (64 bytes) │
│ - Page ID (8 bytes) │
│ - Page Type (2 bytes): DATA, INDEX, OVERFLOW, FREE │
│ - LSN (8 bytes): Log Sequence Number │
│ - Checksum (4 bytes) │
│ - Free Space Offset (2 bytes) │
│ - Item Count (2 bytes) │
│ - Flags (2 bytes) │
│ - Reserved (36 bytes) │
├──────────────────────────────────────────────────────────┤
│ Item Pointers (variable) │
│ - Offset (2 bytes) + Length (2 bytes) per item │
├──────────────────────────────────────────────────────────┤
│ │
│ Free Space │
│ │
├──────────────────────────────────────────────────────────┤
│ Items (variable) │
│ - Node records, edge records, or property data │
└──────────────────────────────────────────────────────────┘
Page Configuration:
[storage.pages]
size_kb = 16 # Page size (8, 16, or 32 KB)
alignment = 4096 # Disk alignment (usually 4K)
checksum = "crc32c" # Checksum algorithm
Node Storage
Nodes are stored with their labels and properties:
Node Record:
┌─────────────────────────────────────────────────────────┐
│ Node ID (8 bytes) │
│ Label Bitmap (8 bytes) - Up to 64 labels per node │
│ Property Count (2 bytes) │
│ First Edge Pointer (8 bytes) - Outgoing edges │
│ Properties (variable): │
│ - Key ID (4 bytes) │
│ - Type (1 byte) │
│ - Value (variable) │
│ - ... │
│ Overflow Pointer (8 bytes) - For large properties │
└─────────────────────────────────────────────────────────┘
Edge Storage
Edges connect nodes with relationship data:
Edge Record:
┌─────────────────────────────────────────────────────────┐
│ Edge ID (8 bytes) │
│ Source Node ID (8 bytes) │
│ Target Node ID (8 bytes) │
│ Relationship Type ID (4 bytes) │
│ Next Outgoing Edge (8 bytes) - Linked list │
│ Next Incoming Edge (8 bytes) - Linked list │
│ Property Count (2 bytes) │
│ Properties (variable) │
└─────────────────────────────────────────────────────────┘
File Layout
Data Directory Structure:
/var/lib/geode/data/
├── base/ # Main database files
│ ├── nodes.dat # Node data pages
│ ├── edges.dat # Edge data pages
│ ├── properties.dat # Large property overflow
│ └── free_space.map # Free space tracking
├── indexes/ # Index files
│ ├── node_labels.idx # Label index
│ ├── edge_types.idx # Relationship type index
│ └── property_*.idx # Property indexes
├── wal/ # Write-ahead log
│ ├── 000000010000000000000001
│ ├── 000000010000000000000002
│ └── ...
├── system/ # System catalog
│ ├── schema.dat # Schema definitions
│ ├── statistics.dat # Query statistics
│ └── config.dat # Runtime configuration
└── temp/ # Temporary files
└── sort_*.tmp # Sort spill files
Write-Ahead Logging (WAL)
WAL Architecture
All modifications are written to the WAL before applying to data files:
Write Flow:
1. Transaction begins
2. Modifications logged to WAL buffer
3. WAL buffer flushed to disk (fsync)
4. Changes applied to buffer pool (memory)
5. Transaction commits
6. Background: dirty pages flushed to data files
WAL Record Format:
WAL Record:
┌─────────────────────────────────────────────────────────┐
│ LSN (8 bytes) - Log Sequence Number │
│ Transaction ID (8 bytes) │
│ Record Type (1 byte): INSERT, UPDATE, DELETE, COMMIT │
│ Table ID (4 bytes) │
│ Record Length (4 bytes) │
│ Before Image (variable) - For rollback │
│ After Image (variable) - For replay │
│ Checksum (4 bytes) │
└─────────────────────────────────────────────────────────┘
WAL Configuration
[storage.wal]
enabled = true
directory = "/var/lib/geode/wal"
# Segment size (WAL files)
segment_size_mb = 64
# Sync mode
sync_mode = "fsync" # fsync, fdatasync, or async
# Buffer size
buffer_size_mb = 16
# Checkpoint settings
checkpoint_interval_seconds = 300
checkpoint_threshold_mb = 1024
# Archiving
archive_enabled = true
archive_command = "/etc/geode/archive_wal.sh %f %p"
WAL Sync Modes
fsync (default): Full durability, slightly slower
[storage.wal]
sync_mode = "fsync"
fdatasync: Faster, skips metadata sync
[storage.wal]
sync_mode = "fdatasync"
async: Best performance, risk of data loss on crash
[storage.wal]
sync_mode = "async"
sync_interval_ms = 100 # Periodic sync
Monitoring WAL
-- WAL statistics
SELECT
current_wal_lsn,
last_checkpoint_lsn,
wal_size_mb,
wal_write_rate_mb_per_sec,
checkpoint_in_progress
FROM system.wal_stats;
-- WAL file status
SELECT
segment_name,
size_mb,
start_lsn,
end_lsn,
archived,
archived_at
FROM system.wal_segments
ORDER BY start_lsn DESC
LIMIT 10;
Buffer Pool Management
Buffer Pool Architecture
The buffer pool caches data pages in memory:
Buffer Pool:
┌─────────────────────────────────────────────────────────┐
│ Hash Table (Page ID -> Buffer Index) │
├─────────────────────────────────────────────────────────┤
│ Buffer Frames: │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Frame 0 │ │ Frame 1 │ │ Frame 2 │ ... │
│ │ Page 42 │ │ Page 17 │ │ Page 891 │ │
│ │ Dirty │ │ Clean │ │ Dirty │ │
│ │ Pinned:2 │ │ Pinned:0 │ │ Pinned:1 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
├─────────────────────────────────────────────────────────┤
│ LRU List (eviction candidates) │
├─────────────────────────────────────────────────────────┤
│ Dirty Page List (checkpoint candidates) │
└─────────────────────────────────────────────────────────┘
Configuration
[storage.buffer_pool]
# Total buffer pool size
size_mb = 4096 # 4 GB
# Percentage for different purposes
data_cache_percent = 70
index_cache_percent = 25
temp_buffer_percent = 5
# Eviction policy
eviction_policy = "lru" # lru, lru-k, or clock
# Background writer
background_writer_enabled = true
background_writer_interval_ms = 100
background_writer_batch_size = 64
# Prefetching
prefetch_enabled = true
prefetch_distance = 32 # Pages to read ahead
Page Replacement
Geode uses LRU-K for intelligent page eviction:
-- Buffer pool statistics
SELECT
total_pages,
used_pages,
dirty_pages,
hit_ratio,
evictions_per_sec,
reads_per_sec,
writes_per_sec
FROM system.buffer_pool_stats;
-- Per-table buffer usage
SELECT
table_name,
cached_pages,
cached_mb,
hit_ratio
FROM system.buffer_pool_by_table
ORDER BY cached_mb DESC
LIMIT 10;
Index Storage
B-Tree Indexes
Primary index structure for most lookups:
B-Tree Structure:
┌─────────────┐
│ Root Node │
│ [50, 100] │
└──────┬──────┘
┌────────────┼────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ < 50 │ │ 50-100 │ │ > 100 │
│[10,25] │ │[75,90] │ │[125,150]│
└────┬────┘ └────┬────┘ └────┬────┘
... ... ...
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Leaf Page│ │Leaf Page│ │Leaf Page│
│Key→Value│ │Key→Value│ │Key→Value│
└─────────┘ └─────────┘ └─────────┘
B-Tree Configuration:
[storage.indexes.btree]
# Fill factor for leaf pages
fill_factor = 90 # Percent
# Node split strategy
split_strategy = "balanced" # balanced or right_biased
# Bulk loading optimization
bulk_load_factor = 95
Index File Format
Index Page:
┌─────────────────────────────────────────────────────────┐
│ Page Header │
│ - Page Type: INDEX_INTERNAL or INDEX_LEAF │
│ - Level: 0 for leaf, 1+ for internal │
│ - Key Count │
│ - Right Sibling Pointer │
├─────────────────────────────────────────────────────────┤
│ Keys and Pointers: │
│ Internal: [Key1][Ptr1][Key2][Ptr2]...[PtrN] │
│ Leaf: [Key1][Value1][Key2][Value2]... │
└─────────────────────────────────────────────────────────┘
Vector Indexes (HNSW)
For similarity search on embeddings:
[storage.indexes.hnsw]
# HNSW parameters
m = 16 # Connections per layer
ef_construction = 200 # Construction time quality
ef_search = 50 # Search time quality
# Memory mapping
mmap_enabled = true
preload = false # Load into memory on startup
# Quantization for memory efficiency
quantization = "none" # none, pq, or sq
Compaction and Maintenance
Background Compaction
Geode performs continuous compaction to reclaim space:
[storage.compaction]
enabled = true
# Compaction triggers
dead_tuple_threshold = 20 # Percent dead tuples
size_amplification_threshold = 1.5
# Compaction schedule
schedule = "continuous" # continuous or scheduled
scheduled_time = "03:00"
# Resource limits
max_concurrent_compactions = 2
throttle_mb_per_sec = 100
Manual Maintenance
# Trigger compaction
./geode admin compact --graph social_network
# Analyze statistics
./geode admin analyze --graph social_network
# Vacuum dead tuples
./geode admin vacuum --graph social_network
# Rebuild indexes
./geode admin reindex --index user_email_idx
Via GQL:
-- Compact a specific table
CALL system.compact('User');
-- Update statistics
ANALYZE User;
-- Vacuum dead tuples
VACUUM User;
-- Rebuild index
REINDEX INDEX user_email_idx;
-- Check fragmentation
SELECT
table_name,
live_tuples,
dead_tuples,
dead_tuple_ratio,
last_vacuum,
last_analyze
FROM system.table_stats;
Checkpointing
Checkpoints write dirty pages to disk and advance the recovery point:
[storage.checkpoint]
# Checkpoint triggers
interval_seconds = 300
wal_size_mb = 1024
dirty_page_percent = 50
# Checkpoint behavior
spread_writes = true # Spread I/O over time
spread_duration_seconds = 60
# Monitoring
log_checkpoints = true
-- Checkpoint status
SELECT
checkpoint_start_time,
checkpoint_end_time,
duration_seconds,
pages_written,
wal_segments_removed
FROM system.checkpoint_log
ORDER BY checkpoint_start_time DESC
LIMIT 5;
-- Force checkpoint
CHECKPOINT;
Storage Monitoring
Key Metrics
# Prometheus metrics
curl http://localhost:3141/metrics | grep -E "geode_storage|geode_buffer|geode_wal"
# Example metrics
geode_storage_data_size_bytes 13421772800
geode_storage_index_size_bytes 2147483648
geode_storage_wal_size_bytes 536870912
geode_buffer_pool_hits_total 8472938
geode_buffer_pool_misses_total 234789
geode_buffer_pool_dirty_pages 1234
geode_wal_writes_total 847293
geode_wal_bytes_written_total 2147483648
geode_checkpoint_duration_seconds_sum 45.7
Storage Health Queries
-- Overall storage statistics
SELECT
data_size_gb,
index_size_gb,
wal_size_gb,
free_space_gb,
total_size_gb,
fragmentation_percent
FROM system.storage_overview;
-- Per-graph storage
SELECT
graph_name,
node_count,
edge_count,
data_size_mb,
index_size_mb
FROM system.graph_storage
ORDER BY data_size_mb DESC;
-- Disk I/O statistics
SELECT
reads_per_sec,
writes_per_sec,
read_bytes_per_sec,
write_bytes_per_sec,
avg_read_latency_ms,
avg_write_latency_ms
FROM system.disk_io_stats;
Alerting Rules
# Prometheus alerts for storage
groups:
- name: geode_storage_alerts
rules:
- alert: DiskSpaceLow
expr: geode_storage_free_bytes / geode_storage_total_bytes < 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space below 10%"
- alert: DiskSpaceCritical
expr: geode_storage_free_bytes / geode_storage_total_bytes < 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space below 5%"
- alert: BufferPoolHitRateLow
expr: geode_buffer_pool_hit_ratio < 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "Buffer pool hit rate below 90%"
- alert: WALGrowthHigh
expr: rate(geode_wal_bytes_written_total[5m]) > 100000000
for: 10m
labels:
severity: warning
annotations:
summary: "High WAL write rate"
- alert: CheckpointTooLong
expr: geode_checkpoint_duration_seconds > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Checkpoint taking too long"
Storage Configuration Best Practices
Hardware Recommendations
SSDs: Strongly recommended for production
[storage]
# SSD-optimized settings
disk_type = "ssd"
page_size_kb = 16
read_ahead_kb = 256
NVMe: Best performance for write-heavy workloads
[storage]
disk_type = "nvme"
page_size_kb = 16
io_depth = 32
HDDs: Only for archival/cold storage
[storage]
disk_type = "hdd"
page_size_kb = 32
read_ahead_kb = 1024
sequential_read_threshold = 64
Memory Sizing
# Rule of thumb: buffer pool = 50-75% of available RAM
[storage.buffer_pool]
size_mb = 32768 # 32 GB for 48 GB system
# Working set should fit in buffer pool
# Monitor hit ratio and adjust
File System Settings
# Linux ext4 recommended settings
mkfs.ext4 -O ^has_journal /dev/sdb1 # Disable journal (WAL handles durability)
# Mount options
mount -o noatime,nodiratime,data=writeback /dev/sdb1 /var/lib/geode
# Disable transparent huge pages
echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Increase file descriptors
ulimit -n 65535
Configuration Template
# Production storage configuration
[storage]
data_directory = "/var/lib/geode/data"
temp_directory = "/var/lib/geode/temp"
[storage.pages]
size_kb = 16
checksum = "crc32c"
[storage.buffer_pool]
size_mb = 32768
eviction_policy = "lru"
background_writer_enabled = true
[storage.wal]
directory = "/var/lib/geode/wal"
segment_size_mb = 64
sync_mode = "fsync"
checkpoint_interval_seconds = 300
[storage.compaction]
enabled = true
max_concurrent_compactions = 2
throttle_mb_per_sec = 100
[storage.indexes]
fill_factor = 90
Related Topics
- Performance - Performance optimization
- Caching - Caching strategies
- Backup - Backup procedures
- Recovery - Recovery procedures
- Indexing - Index management
- Configuration - Server configuration
Further Reading
- Storage Engine Architecture Deep Dive
- WAL Configuration Guide
- Buffer Pool Tuning
- Index Selection and Maintenance
- Storage Capacity Planning
- I/O Performance Optimization