Query Optimization

Geode implements a sophisticated cost-based query optimizer (CBO) that automatically selects efficient execution plans based on data statistics, index availability, and query patterns.

Overview

The optimizer transforms logical query plans into optimized physical execution plans:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Logical Plan   │────▶│   Optimizer     │────▶│ Physical Plan   │
│                 │     │                 │     │                 │
│ - What to do    │     │ - Statistics    │     │ - How to do it  │
│ - Declarative   │     │ - Cost model    │     │ - Imperative    │
└─────────────────┘     │ - Index info    │     └─────────────────┘
                        └─────────────────┘

Cost Model

Cost Components

The optimizer considers three primary cost factors:

I/O Cost: Disk reads and writes
CPU Cost: Computation overhead
Memory Cost: Buffer pool utilization

pub const Cost = struct {
    io_cost: f64,      // Estimated page reads
    cpu_cost: f64,     // Estimated operations
    memory_cost: f64,  // Estimated memory usage

    pub fn total(self: Cost) f64 {
        return self.io_cost * IO_WEIGHT +
               self.cpu_cost * CPU_WEIGHT +
               self.memory_cost * MEMORY_WEIGHT;
    }
};

Cost Estimation

Operation	Cost Formula
Full Scan	`pages * io_cost + rows * cpu_cost`
Index Seek	`log(n) * io_cost + selectivity * rows * cpu_cost`
Hash Join	`build_cost + probe_cost + memory_cost`
Sort	`n * log(n) * cpu_cost + spill_io_cost`

Optimization Strategies

1. Predicate Pushdown

Move filters closer to data sources to reduce intermediate result sizes:

Before:                         After:
  Filter (age > 30)               Expand (KNOWS)
    │                               │
    └─▶ Expand (KNOWS)              └─▶ Filter (age > 30)
           │                               │
           └─▶ Scan (Person)               └─▶ Scan (Person)

Example:

-- Original query
MATCH (p:Person)-[:KNOWS]->(f:Person)
WHERE p.age > 30
RETURN f.name

-- Optimized: Filter pushed before expansion

2. Index Selection

The optimizer automatically selects the most efficient index:

fn selectBestIndex(
    scan: *ScanOperator,
    available_indexes: []Index,
    stats: *Statistics
) ?Index {
    var best_index: ?Index = null;
    var best_cost = std.math.inf(f64);

    for (available_indexes) |index| {
        if (!index.covers(scan.predicates)) continue;

        const selectivity = estimateSelectivity(scan.predicates, index, stats);
        const cost = estimateIndexCost(index, selectivity, stats);

        if (cost < best_cost) {
            best_cost = cost;
            best_index = index;
        }
    }

    return best_index;
}

Index Selection Rules:

Use equality indexes for exact matches
Use range indexes for comparison predicates
Consider covering indexes to avoid lookups
Prefer composite indexes matching multiple predicates

3. Join Ordering

Optimal join order minimizes intermediate result sizes:

Poor Order:                    Optimal Order:
  Join (1M × 10K = 10B)          Join (10 × 10K = 100K)
    │                              │
    ├─▶ Scan A (1M rows)           ├─▶ Filter (10 rows)
    │                              │      │
    └─▶ Scan B (10K rows)          │      └─▶ Scan A
                                   │
                                   └─▶ Scan B (10K rows)

Join Order Algorithm:

fn optimizeJoinOrder(joins: []Join, stats: *Statistics) []Join {
    // Use dynamic programming for small join counts
    if (joins.len <= 6) {
        return dpJoinOrdering(joins, stats);
    }
    // Use greedy heuristic for larger join counts
    return greedyJoinOrdering(joins, stats);
}

4. Join Type Selection

Choose the optimal join algorithm based on input sizes:

Join Type	Best For	Cost
Nested Loop	Small inner, indexed	`O(n)` with index
Hash Join	Equality joins, fits memory	`O(n + m)`
Merge Join	Pre-sorted inputs	`O(n + m)`

5. Aggregation Pushdown

Push aggregations to reduce data movement:

Before:                         After:
  Aggregate (COUNT)               Aggregate (SUM of counts)
    │                               │
    └─▶ Union All                   └─▶ Union All
           │                               │
           ├─▶ Shard 1                     ├─▶ Local COUNT (Shard 1)
           └─▶ Shard 2                     └─▶ Local COUNT (Shard 2)

Statistics Collection

Automatic Statistics

Geode automatically maintains statistics for optimization:

statistics:
  auto_update: true
  update_threshold: 10000  # Update after N modifications
  sample_size: 1000        # Rows sampled for histograms

Statistics Types

Statistic	Description
Cardinality	Number of rows per label/type
Selectivity	Predicate filtering ratio
Histogram	Value distribution for range estimation
Distinct Count	Number of unique values

Manual Statistics Update

-- Update all statistics
CALL db.stats.update()

-- Update statistics for specific label
CALL db.stats.update('Person')

-- View current statistics
CALL db.stats.show()

Adaptive Query Execution

Geode can adjust execution plans at runtime:

Runtime Statistics

pub const RuntimeStats = struct {
    rows_processed: u64,
    actual_selectivity: f64,
    memory_used: usize,

    pub fn shouldReplan(self: RuntimeStats, estimated: Cost) bool {
        const actual_cost = self.computeActualCost();
        return actual_cost > estimated.total() * REPLAN_THRESHOLD;
    }
};

Mid-Query Optimization

When actual cardinalities differ significantly from estimates:

Pause execution
Re-optimize with actual statistics
Resume with new plan

Query Hints

Override optimizer decisions when needed:

-- Force specific index
MATCH (p:Person)
USING INDEX p:Person(email)
WHERE p.email = 'alice@example.com'
RETURN p

-- Force join order
MATCH (a:Account)-[:OWNS]->(t:Transaction)
USING JOIN ON a
WHERE a.balance > 10000
RETURN t

EXPLAIN and PROFILE

EXPLAIN

View the query plan without execution:

EXPLAIN
MATCH (p:Person)-[:KNOWS]->(f:Person)
WHERE p.age > 30
RETURN f.name

Output:

| plan                          | estimated_rows |
|-------------------------------|----------------|
| Project (f.name)              | 1000           |
|   Expand (KNOWS, OUTGOING)    | 1000           |
|     Filter (p.age > 30)       | 500            |
|       IndexSeek (Person.age)  | 500            |

PROFILE

Execute and show actual statistics:

PROFILE
MATCH (p:Person)-[:KNOWS]->(f:Person)
WHERE p.age > 30
RETURN f.name

Output:

| plan                          | rows  | db_hits | time_ms |
|-------------------------------|-------|---------|---------|
| Project (f.name)              | 847   | 847     | 0.5     |
|   Expand (KNOWS, OUTGOING)    | 847   | 1694    | 12.3    |
|     Filter (p.age > 30)       | 423   | 423     | 1.2     |
|       IndexSeek (Person.age)  | 423   | 846     | 3.1     |
| Total                         |       | 3810    | 17.1    |

Best Practices

1. Create Appropriate Indexes

-- Index frequently filtered properties
CREATE INDEX ON :Person(email)
CREATE INDEX ON :Person(age)

-- Composite index for common predicates
CREATE INDEX ON :Transaction(account_id, timestamp)

2. Write Selective Predicates First

-- Good: Most selective predicate first
MATCH (p:Person)
WHERE p.email = 'alice@example.com'  -- Unique
  AND p.age > 30                      -- Less selective
RETURN p

-- The optimizer handles this, but it helps readability

3. Avoid Cartesian Products

-- Bad: Cartesian product
MATCH (p:Person), (c:Company)
RETURN p.name, c.name

-- Good: Use relationships
MATCH (p:Person)-[:WORKS_AT]->(c:Company)
RETURN p.name, c.name

4. Use LIMIT for Large Results

-- Always limit result sets
MATCH (p:Person)
RETURN p
ORDER BY p.name
LIMIT 100

Performance Tuning

Next Steps

Storage Engine - Understand data storage
Distributed Systems - Multi-node query execution
Query Execution - Execution pipeline details