Category: Performance and Scaling

The Performance and Scaling category provides comprehensive guidance on optimizing Geode for maximum throughput, minimal latency, and efficient resource utilization. Whether you’re tuning a single-node deployment or planning a distributed architecture, these resources help you achieve production-ready performance.

Understanding Graph Database Performance

Graph database performance differs fundamentally from relational databases. Traditional databases excel at scanning large tables but struggle with multi-hop joins. Graphs invert this: traversing relationships is fast regardless of depth, but scanning all nodes is expensive. Understanding this trade-off is essential for effective optimization.

Index-based lookups provide O(log n) access to starting points for graph traversals. Relationship traversal operates in O(k) time where k is the number of adjacent relationships, regardless of total graph size. Pattern matching combines these primitives, making proper index usage critical for query performance.

Indexing Strategies

Indexes are the foundation of graph query performance. Without indexes, Geode must scan all nodes to find pattern matches. With proper indexes, queries execute in milliseconds even on large graphs.

Label and Property Indexes

Create indexes on frequently queried node labels and properties:

-- Index for user lookups by email
CREATE INDEX user_email_idx ON Person(email);

-- Composite index for multi-property queries
CREATE INDEX product_category_price_idx ON Product(category, price);

-- Index on relationship properties
CREATE INDEX transaction_timestamp_idx ON TRANSACTION(timestamp);

When to index:

Properties used in WHERE clauses
Properties used for ORDER BY
Properties used in equality comparisons
Frequently traversed relationship types

When not to index:

Properties with very low cardinality (e.g., boolean flags)
Properties rarely queried
Write-heavy workloads where index maintenance overhead outweighs read benefits

Index Selection and Monitoring

Geode’s query planner automatically selects optimal indexes using cost-based optimization. Use EXPLAIN to verify index usage:

EXPLAIN
MATCH (u:User {email: $email})-[:PURCHASED]->(p:Product)
WHERE p.price > 100
RETURN p.name, p.price
ORDER BY p.price DESC

The execution plan shows:

Which indexes were considered and chosen
Estimated cardinalities at each step
Scan types (index seek vs. full scan)
Join strategies

Monitor index effectiveness in production:

import geode_client

client = geode_client.open_database("localhost:3141")

async with client.connection() as client:
    # Query index statistics
    stats, _ = await client.query("""
        SHOW INDEX STATISTICS
    """)

    for idx in stats:
        print(f"Index: {idx['name']}")
        print(f"  Reads: {idx['read_count']}")
        print(f"  Selectivity: {idx['selectivity']}")
        print(f"  Size: {idx['size_bytes']}")

Query Optimization

Pattern Matching Optimization

Order patterns from most selective to least selective. The query planner uses cardinality estimates, but you can guide it with explicit ordering:

-- Inefficient: starts with broad pattern
MATCH (u:User)-[:FOLLOWS]->(friend:User)
WHERE u.email = 'alice@example.com'
RETURN friend.name

-- Efficient: starts with indexed lookup
MATCH (u:User {email: 'alice@example.com'})-[:FOLLOWS]->(friend:User)
RETURN friend.name

Variable-length paths can be expensive. Limit the maximum depth when possible:

-- Unbounded depth (potentially expensive)
MATCH path = (a:Person)-[:KNOWS*]-(b:Person {name: 'Bob'})
RETURN path

-- Bounded depth (more predictable)
MATCH path = (a:Person)-[:KNOWS*1..4]-(b:Person {name: 'Bob'})
RETURN path
LIMIT 10

Aggregation Optimization

Push filters before aggregations to reduce data volumes:

-- Less efficient: aggregates then filters
MATCH (u:User)-[:PURCHASED]->(p:Product)
WITH u, COUNT(p) AS purchase_count
WHERE purchase_count > 10
RETURN u.name, purchase_count

-- More efficient: filters then aggregates
MATCH (u:User)-[:PURCHASED]->(p:Product)
WHERE p.price > 50
WITH u, COUNT(p) AS expensive_purchases
WHERE expensive_purchases > 10
RETURN u.name, expensive_purchases

Use indexes on aggregation keys:

-- Ensure index exists for grouping
CREATE INDEX transaction_user_idx ON TRANSACTION(user_id);

-- Efficient grouped aggregation
MATCH (u:User)-[t:TRANSACTION]->()
RETURN u.id, SUM(t.amount) AS total
GROUP BY u.id

Prepared Statements

Prepared statements improve performance by caching query plans:

client = geode_client.open_database("localhost:3141")
async with client.connection() as client:
    # Prepare once
    stmt = await client.prepare("""
        MATCH (u:User {id: $user_id})-[:PURCHASED]->(p:Product)
        RETURN p.name, p.price
        ORDER BY p.price DESC
    """)

    # Execute many times with different parameters
    for user_id in range(1000):
        result, _ = await stmt.execute({"user_id": user_id})
        process_results(result)

Benefits:

Query parsing and planning occur once
Parameter values can be efficiently bound
Network roundtrips reduced in some protocols
Query plan cache improves resource utilization

Profiling and Diagnostics

Query Profiling

Use PROFILE to measure actual query execution:

PROFILE
MATCH (u:User)-[:FOLLOWS*2..3]->(recommendation:User)
WHERE NOT EXISTS {
    MATCH (u)-[:FOLLOWS]->(recommendation)
}
RETURN recommendation.name, COUNT(*) AS mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10

Profile output includes:

Execution time: Actual wall-clock time per operator
Rows processed: Actual cardinalities vs. estimates
Cache hits: Index and data cache effectiveness
Memory usage: Peak memory per operator

Compare estimated vs. actual cardinalities. Large discrepancies indicate:

Missing or outdated statistics
Complex predicates the planner can’t estimate
Correlated data patterns
Opportunities for manual optimization

Performance Metrics

Monitor key metrics in production:

# Query performance metrics
metrics, _ = await client.query("""
    SELECT
        query_id,
        execution_time_ms,
        rows_read,
        rows_written,
        cache_hit_ratio
    FROM system.query_log
    WHERE execution_time_ms > 1000
    ORDER BY execution_time_ms DESC
    LIMIT 20
""")

Critical metrics:

Query latency: p50, p95, p99 percentiles
Throughput: Queries per second
Cache hit rates: Index cache, data cache, query plan cache
Resource utilization: CPU, memory, disk I/O
Connection pool: Active connections, wait times, errors

Continuous Profiling

Enable query logging for production systems:

# Server configuration
./geode serve \
    --query-log /var/log/geode/queries.log \
    --slow-query-threshold 100ms \
    --metrics-export prometheus \
    --metrics-port 9090

Integrate with observability platforms:

# Prometheus configuration
scrape_configs:
  - job_name: 'geode'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'

Connection Management

Connection Pooling

Client libraries provide automatic connection pooling:

import "geodedb.com/geode"

// Create connection pool
pool, err := geode.NewPool(geode.PoolConfig{
    MinSize: 10,
    MaxSize: 100,
    MaxIdleTime: 5 * time.Minute,
    MaxLifetime: 30 * time.Minute,
})

// Pool automatically manages connections
ctx := context.Background()
result, err := pool.Query(ctx, "MATCH (n:Node) RETURN count(n)")

Pool sizing guidelines:

MinSize: Keep connections warm for low-latency response
MaxSize: Match available server connection capacity
Idle timeout: Balance resource utilization vs. connection overhead
Lifetime: Rotate connections to prevent resource leaks

Concurrent Queries

Geode’s MVCC architecture enables high concurrency:

import asyncio

async def concurrent_queries():
    client = geode_client.open_database("localhost:3141")
    async with client.connection() as client:
        # Execute 100 queries concurrently
        tasks = [
            client.execute("MATCH (n:Node {id: $id}) RETURN n", {"id": i})
            for i in range(100)
        ]
        results = await asyncio.gather(*tasks)
        return results

Concurrency limits:

Read queries: Limited only by system resources (CPU, memory)
Write queries: SSI isolation may cause conflicts in high-contention scenarios
Transactions: Use optimistic concurrency; retry on conflicts

Memory Management

Cache Tuning

Geode uses multiple caches for performance:

# Server configuration
./geode serve \
    --index-cache-size 2GB \
    --data-cache-size 8GB \
    --query-plan-cache-size 256MB

Index cache: Stores index B-tree nodes. Size based on index working set.

Data cache: Stores frequently accessed nodes and relationships. Size based on hot data working set.

Query plan cache: Stores compiled query plans. Usually small (100-500MB).

Monitor cache effectiveness:

SELECT
    cache_name,
    size_bytes,
    entry_count,
    hit_rate,
    eviction_count
FROM system.cache_statistics

Adjust sizes based on hit rates. Target 90%+ hit rate for index cache, 70%+ for data cache.

Memory-Bounded Operations

Limit memory usage for large operations:

-- Use LIMIT to bound result sets
MATCH (n:Node)
RETURN n
LIMIT 10000

-- Use pagination for large exports
MATCH (n:Node)
WHERE n.created > $checkpoint
ORDER BY n.created
LIMIT 10000

For analytical workloads, consider materialized views or batch processing.

Horizontal Scaling

Read Replicas

Distribute read load across multiple nodes:

# Primary node (accepts writes)
./geode serve --role primary --listen 0.0.0.0:3141

# Read replica (read-only)
./geode serve \
    --role replica \
    --primary primary.example.com:3141 \
    --listen 0.0.0.0:3141

Client routing:

# Primary for writes
primary = geode_client.open_database("primary.example.com:3141")
# Replicas for reads
replicas = geode_client.Pool([
    "replica1.example.com:3141",
    "replica2.example.com:3141",
    "replica3.example.com:3141"
])

# Route appropriately
await primary.execute("CREATE (n:Node {id: 123})")
result, _ = await replicas.query("MATCH (n:Node) RETURN count(n)")

Sharding Strategies

Partition large graphs across multiple databases:

Geographic sharding: Route based on location Functional sharding: Separate by entity type Hash sharding: Distribute by consistent hash

Implement application-level routing:

def get_shard(user_id):
    """Route to appropriate shard based on user ID"""
    shard_id = hash(user_id) % NUM_SHARDS
    return shard_connections[shard_id]

# Route query to correct shard
shard = get_shard(user_id)
result, _ = await shard.query("MATCH (u:User {id: $id})", {"id": user_id})

Best Practices

Index strategically: Cover common query patterns without over-indexing
Profile before optimizing: Measure actual bottlenecks, don’t guess
Use prepared statements: For repeated queries with different parameters
Leverage connection pooling: Reuse connections efficiently
Monitor continuously: Track performance metrics in production
Test at scale: Performance characteristics change with data volume
Plan for growth: Design with 10x future capacity in mind
Document optimization decisions: Explain why indexes and tuning choices were made

Query Optimization - GQL query tuning techniques
Indexing - Index design and management
Profiling - Performance measurement tools
Monitoring - Production observability
Architecture - System design for performance
Deployment - Production deployment patterns

Popular

Understanding Graph Database Performance

Indexing Strategies

Label and Property Indexes

Index Selection and Monitoring

Query Optimization

Pattern Matching Optimization

Aggregation Optimization

Prepared Statements

Profiling and Diagnostics

Query Profiling

Performance Metrics

Continuous Profiling

Connection Management

Connection Pooling

Concurrent Queries

Memory Management

Cache Tuning

Memory-Bounded Operations

Horizontal Scaling

Read Replicas

Sharding Strategies

Best Practices

Further Reading

Related Articles

Querying, Indexing, and Query Optimization

Performance and Scalability

Understanding Graph Database Performance Share link

Indexing Strategies Share link

Label and Property Indexes Share link

Index Selection and Monitoring Share link

Query Optimization Share link

Pattern Matching Optimization Share link

Aggregation Optimization Share link

Prepared Statements Share link

Profiling and Diagnostics Share link

Query Profiling Share link

Performance Metrics Share link

Continuous Profiling Share link

Connection Management Share link

Connection Pooling Share link

Concurrent Queries Share link

Memory Management Share link

Cache Tuning Share link

Memory-Bounded Operations Share link

Horizontal Scaling Share link

Read Replicas Share link

Sharding Strategies Share link

Best Practices Share link

Related Topics Share link

Further Reading Share link

Related Articles

Querying, Indexing, and Query Optimization

Performance and Scalability

Understanding Graph Database Performance

Indexing Strategies

Label and Property Indexes

Index Selection and Monitoring

Query Optimization

Pattern Matching Optimization

Aggregation Optimization

Prepared Statements

Profiling and Diagnostics

Query Profiling

Performance Metrics

Continuous Profiling

Connection Management

Connection Pooling

Concurrent Queries

Memory Management

Cache Tuning

Memory-Bounded Operations

Horizontal Scaling

Read Replicas

Sharding Strategies

Best Practices

Related Topics

Further Reading