Overview

This guide teaches you how to analyze and optimize query performance in Geode using EXPLAIN, PROFILE, indexes, and the cost-based optimizer. You’ll learn to diagnose slow queries, understand execution plans, and apply optimization strategies.

Performance Tuning Process:

  1. Measure - Use PROFILE to identify slow queries
  2. Analyze - Use EXPLAIN to understand execution plans
  3. Optimize - Apply indexes, rewrite queries, tune configuration
  4. Verify - Re-measure to confirm improvements

Quick Links:


Using EXPLAIN

EXPLAIN shows the query execution plan without running the query. Use it to understand how Geode will execute your query before committing resources.

Basic EXPLAIN Syntax

EXPLAIN <statement>

Understanding Execution Plans

EXPLAIN output shows logical operators Geode will use:

OperatorDescriptionPerformance Impact
NodeScanScans all nodes (or by label)O(n) - can be slow for large graphs
IndexSeekUses index for direct lookupO(log n) - fast
IndexScanScans index rangeO(k log n) - moderate
ExpandRelationshipsTraverses relationshipsO(degree) per node
FilterApplies WHERE predicatesO(n) on input rows
ProjectSelects/transforms columnsO(n) - lightweight
SortOrders results (ORDER BY)O(n log n) - expensive
LimitRestricts output rowsO(1) - cheap
AggregateGroups and aggregatesO(n) - moderate

EXPLAIN Examples

Example 1: Simple Node Scan

EXPLAIN MATCH (p:Person) RETURN p.name;

Output:

| plan              |
|-------------------|
| EXPLAIN           |
| MATCH             |
|   NodeScan        |    ← Scans all Person nodes
| RETURN            |
|   Project         |    ← Selects name property

Analysis: Full node scan - acceptable for small graphs, slow for large.


Example 2: Filtered Query

EXPLAIN MATCH (p:Person) WHERE p.age > 25 RETURN p.name, p.age;

Output:

| plan              |
|-------------------|
| EXPLAIN           |
| MATCH             |
|   NodeScan        |    ← Scans all Person nodes
|   Filter          |    ← Filters age > 25
| RETURN            |
|   Project         |

Analysis: Scans all nodes, then filters. Optimization needed if large dataset.


Example 3: Relationship Traversal

EXPLAIN MATCH (p:Person)-[:WORKS_FOR]->(c:Company)
RETURN p.name, c.name;

Output:

| plan                    |
|-------------------------|
| EXPLAIN                 |
| MATCH                   |
|   NodeScan              |    ← Start with Person nodes
|   ExpandRelationships   |    ← Traverse WORKS_FOR edges
| RETURN                  |
|   Project               |

Analysis: Scans all Persons, expands relationships. Consider index on Company if filtering by company properties.


Example 4: Complex Query with Aggregation

EXPLAIN MATCH (p:Person)-[:WORKS_FOR]->(c:Company)
WHERE p.age > 25
RETURN c.name, count(p) AS employee_count
ORDER BY employee_count DESC
LIMIT 10;

Output:

| plan                    |
|-------------------------|
| EXPLAIN                 |
| MATCH                   |
|   NodeScan              |
|   ExpandRelationships   |
|   Filter                |
| RETURN                  |
|   Aggregate             |    ← Group by company, count employees
|   Sort                  |    ← Sort by count (expensive!)
|   Limit                 |
|   Project               |

Analysis: Full scan → filter → aggregate → sort (expensive). Optimize filter with index.


Reading EXPLAIN Output

🚩 Red Flags (Slow Patterns):

  • NodeScan without subsequent Filter on large graphs
  • Sort on large result sets
  • Multiple ExpandRelationships without indexes
  • Aggregate on unindexed properties

✅ Good Patterns (Fast):

  • IndexSeek for direct lookups
  • Filter early in the plan (before expensive operations)
  • Limit without Sort for large datasets
  • Index-backed Aggregate operations

Using PROFILE

PROFILE executes the query and returns real execution metrics. Use it to measure actual performance.

Basic PROFILE Syntax

PROFILE <statement>

PROFILE Metrics

MetricDescription
rows_returnedNumber of rows returned
columnsNumber of columns in result
execution_time_msExecution time in milliseconds

PROFILE Examples

Example 1: Measure Simple Query

PROFILE MATCH (p:Person) RETURN p.name;

Output:

| metric              | value |
|---------------------|-------|
| rows_returned       | 42    |
| columns             | 1     |
| execution_time_ms   | 1     |

Analysis: Returned 42 rows in 1ms - acceptable.


Example 2: Measure Query with Filter

PROFILE MATCH (p:Person) WHERE p.age > 30 RETURN p.name, p.age;

Output:

| metric              | value |
|---------------------|-------|
| rows_returned       | 15    |
| columns             | 2     |
| execution_time_ms   | 2     |

Analysis: 15 results in 2ms - good. If time is higher, add index on age.


Example 3: Before/After Optimization

Before (no index):

PROFILE MATCH (p:Person) WHERE p.email = '[email protected]' RETURN p;

Output:

| metric              | value |
|---------------------|-------|
| rows_returned       | 1     |
| execution_time_ms   | 50    |    ← Slow: full scan

Create Index:

CREATE INDEX person_email ON Person(email);

After (with index):

PROFILE MATCH (p:Person) WHERE p.email = '[email protected]' RETURN p;

Output:

| metric              | value |
|---------------------|-------|
| rows_returned       | 1     |
| execution_time_ms   | 2     |    ← Fast: index seek!

Result: Significant speedup with index.


Performance Measurement Workflow

Step 1: Establish Baseline

PROFILE MATCH (p:Person)-[:KNOWS]->(friend:Person)
WHERE p.age > 30
RETURN p.name, count(friend) AS friend_count
ORDER BY friend_count DESC;

Step 2: Analyze with EXPLAIN

EXPLAIN MATCH (p:Person)-[:KNOWS]->(friend:Person)
WHERE p.age > 30
RETURN p.name, count(friend) AS friend_count
ORDER BY friend_count DESC;

Step 3: Apply Optimization (add index, rewrite query)

CREATE INDEX person_age ON Person(age);

Step 4: Re-measure with PROFILE

PROFILE MATCH (p:Person)-[:KNOWS]->(friend:Person)
WHERE p.age > 30
RETURN p.name, count(friend) AS friend_count
ORDER BY friend_count DESC;

Step 5: Compare Results

  • Before: execution_time_ms = 150
  • After: execution_time_ms = 20
  • Speedup: 7.5x

Index Optimization

Index Types in Geode

Index TypeUse CaseCreation Syntax
B-tree (default)Exact matches, range queries, sortingCREATE INDEX name ON Label(property)
HashExact lookups only (faster than B-tree)CREATE INDEX name ON Label(property) USING hash
FulltextText search, CONTAINS queriesCREATE FULLTEXT INDEX name ON Label(property)
SpatialGeographic queries (distance, bounding box)CREATE SPATIAL INDEX name ON Label(property)
VectorSimilarity search (embeddings, recommendations)CREATE VECTOR INDEX name ON Label(property)

Creating Indexes

B-tree Index (Default)

-- For exact matches and range queries
CREATE INDEX person_age ON Person(age);
CREATE INDEX product_price ON Product(price);

-- Use for queries like:
MATCH (p:Person) WHERE p.age > 30 RETURN p;
MATCH (prod:Product) WHERE prod.price < 100 RETURN prod;

Hash Index

-- For exact lookups only (faster than B-tree for equality)
CREATE INDEX person_email ON Person(email) USING hash;

-- Use for queries like:
MATCH (p:Person {email: '[email protected]'}) RETURN p;
MATCH (p:Person) WHERE p.email = '[email protected]' RETURN p;

Fulltext Index (BM25-Optimized)

-- For text search with relevance ranking
CREATE FULLTEXT INDEX article_content ON Article(content);
CREATE FULLTEXT INDEX doc_abstract ON Document(abstract);

-- Use for queries like:
MATCH (a:Article) WHERE a.content CONTAINS 'graph database' RETURN a;
MATCH (d:Document)
WHERE d.abstract CONTAINS 'machine learning'
RETURN d.title, bm25_score(d.abstract, 'machine learning') AS relevance
ORDER BY relevance DESC;

Spatial Index

-- For geographic queries
CREATE SPATIAL INDEX location_coords ON Location(coordinates);

-- Use for queries like:
MATCH (loc:Location)
WHERE distance(loc.coordinates, point(37.7749, -122.4194)) < 10000  -- 10km
RETURN loc.name, distance(loc.coordinates, point(37.7749, -122.4194)) AS dist
ORDER BY dist;

Vector Index (HNSW)

-- For similarity search (embeddings)
CREATE VECTOR INDEX doc_embedding ON Document(embedding);

-- Use for queries like:
MATCH (doc:Document)
WHERE vector_cosine(doc.embedding, $query_embedding) > 0.7
RETURN doc.title, vector_cosine(doc.embedding, $query_embedding) AS similarity
ORDER BY similarity DESC
LIMIT 10;

When to Create Indexes

✅ Create Indexes For:

  1. Frequently filtered properties

    -- Queries like WHERE p.status = 'active' are common
    CREATE INDEX user_status ON User(status);
    
  2. Properties used in ORDER BY

    -- Queries like ORDER BY created_at DESC
    CREATE INDEX post_created ON Post(created_at);
    
  3. Properties in JOIN conditions

    -- Foreign-key-like lookups
    CREATE INDEX order_customer_id ON Order(customer_id);
    
  4. Unique identifiers

    -- Email, username, SKU, etc.
    CREATE INDEX user_email ON User(email) USING hash;
    CREATE INDEX product_sku ON Product(sku) USING hash;
    

❌ Don’t Create Indexes For:

  1. Rarely queried properties - Index overhead not worth it
  2. High-cardinality properties with rare lookups - E.g., free-text descriptions
  3. Properties that change frequently - Index maintenance overhead
  4. Small datasets - Full scans are fast enough (<1000 nodes)

Viewing Indexes

-- Show all indexes
SHOW INDEXES;

-- Output:
| index_name       | label   | property    | type      |
|------------------|---------|-------------|-----------|
| person_email     | Person  | email       | hash      |
| person_age       | Person  | age         | btree     |
| article_content  | Article | content     | fulltext  |
| doc_embedding    | Document| embedding   | vector    |

Dropping Indexes

-- Drop unused or redundant indexes
DROP INDEX person_age;
DROP INDEX article_content;

BM25 Full-Text Search Optimization

Geode’s cost-based optimizer automatically uses BM25 relevance ranking for fulltext searches, providing enterprise-grade search quality.

BM25 Scoring Formula

score(q,d) = Σ IDF(qi) × f(qi,d) × (k1 + 1) / (f(qi,d) + k1 × (1 - b + b × |d| / avgdl))

Where:

  • IDF(qi) = Inverse document frequency (rare terms score higher)
  • f(qi,d) = Term frequency in document
  • |d| = Document length in words
  • avgdl = Average document length in collection
  • k1 = 1.2 = Term frequency saturation parameter
  • b = 0.75 = Length normalization parameter

BM25 Optimization Examples

Example 1: Simple Text Search

-- Create fulltext index
CREATE FULLTEXT INDEX article_content ON Article(content);

-- Query automatically uses BM25
MATCH (a:Article)
WHERE a.content CONTAINS 'machine learning algorithms'
RETURN a.title, a.author
ORDER BY a.relevance_score DESC;

Optimizer Behavior:

  • Base cost: 25.0 (vs 20.0 for standard fulltext)
  • Query complexity: 1.0 + (3 terms - 1) × 0.3 = 1.6x multiplier
  • Corpus scaling: Logarithmic with document count
  • Result: Relevance-ranked results, 40-60% better quality

Example 2: Multi-Term Search with Ranking

MATCH (doc:Document)
WHERE doc.abstract CONTAINS 'artificial intelligence'
  AND doc.keywords CONTAINS 'neural networks'
RETURN doc.title,
       bm25_score(doc.abstract, 'artificial intelligence neural networks') AS relevance
ORDER BY relevance DESC
LIMIT 10;

Performance:

  • BM25 considers document length and term frequency
  • Shorter documents with exact matches score higher
  • Longer documents are penalized unless terms are frequent
  • Typical speedup over naive text search: 2-5x with better results

Example 3: Corpus-Aware Optimization

-- Technical documentation (high vocabulary density, long docs)
MATCH (tech_doc:TechnicalDocument)
WHERE tech_doc.content CONTAINS 'distributed systems architecture'
RETURN tech_doc.title, tech_doc.complexity_score;

-- Optimizer adjusts:
-- - Vocabulary density > 100  1.2x cost multiplier
-- - Average document length > 1000 words  length normalization
-- - Result: Accurate cost estimation for complex corpus
-- Social media posts (low vocabulary, short docs)
MATCH (post:SocialPost)
WHERE post.text CONTAINS 'climate change'
RETURN post.text, post.engagement_score
ORDER BY post.timestamp DESC;

-- Optimizer adjusts:
-- - Vocabulary density < 20  0.9x cost multiplier (simpler)
-- - Average document length < 200 words  reduced length penalty
-- - Result: Faster execution for short-form content

BM25 vs Standard Fulltext Comparison

MetricStandard FulltextBM25 EnhancedBenefit
Base cost20.025.025% overhead for ranking
Query complexity20% per term30% per termBetter multi-term accuracy
Corpus scalingLinearLogarithmicBetter large-scale performance
Search qualityTerm matchingRelevance ranking40-60% better results
Cost accuracyHeuristicStatistics-based25-35% more accurate

Query Rewriting Patterns

Pattern 1: Push Filters Early

❌ Bad (Filter Late):

MATCH (p:Person)-[:KNOWS]->(friend:Person)
WHERE p.age > 30 AND friend.age > 25
RETURN p.name, friend.name;

✅ Good (Filter Early):

MATCH (p:Person {age: 30})  -- Filter immediately
MATCH (p)-[:KNOWS]->(friend:Person)
WHERE friend.age > 25       -- Filter on expansion
RETURN p.name, friend.name;

Why: Reduces nodes to expand from, fewer relationship traversals.


Pattern 2: Avoid Cartesian Products

❌ Bad (Cartesian Product):

MATCH (p:Person)
MATCH (c:Company)
WHERE p.company_id = c.id
RETURN p.name, c.name;
-- If 1000 people × 100 companies = 100,000 intermediate rows!

✅ Good (Connected Pattern):

MATCH (p:Person)-[:WORKS_AT]->(c:Company)
RETURN p.name, c.name;
-- Direct relationship traversal

Why: Explicit relationships prevent combinatorial explosion.


Pattern 3: Use Limits Early

❌ Bad (Limit After Sort):

MATCH (p:Person)
RETURN p.name, p.age
ORDER BY p.age DESC;  -- Sorts ALL people
-- Then application takes only first 10

✅ Good (Limit in Query):

MATCH (p:Person)
RETURN p.name, p.age
ORDER BY p.age DESC
LIMIT 10;  -- Database returns only 10

Why: Database can use top-K optimization, avoid sorting/returning all rows.


Pattern 4: Denormalize for Read Performance

❌ Bad (Join on Every Query):

MATCH (order:Order)-[:CONTAINS]->(:LineItem)-[:OF_PRODUCT]->(product:Product)
WHERE order.id = 12345
RETURN sum(product.price) AS total;
-- Complex join every time

✅ Good (Precompute Total):

-- Store total on Order when creating
MATCH (order:Order {id: 12345})
RETURN order.total;  -- Single property lookup

-- Update total when items change (trade write speed for read speed)
MATCH (order:Order {id: 12345})-[:CONTAINS]->(li:LineItem)
SET order.total = sum(li.subtotal);

Why: Read-heavy workloads benefit from denormalization.


Pattern 5: Use OPTIONAL MATCH Sparingly

❌ Bad (Multiple OPTIONAL):

MATCH (p:Person)
OPTIONAL MATCH (p)-[:WORKS_AT]->(c:Company)
OPTIONAL MATCH (p)-[:LIVES_IN]->(city:City)
OPTIONAL MATCH (p)-[:STUDIED_AT]->(school:School)
RETURN p, c, city, school;
-- Many optional branches slow execution

✅ Good (Filter Required First):

-- If most people work somewhere, make it required
MATCH (p:Person)-[:WORKS_AT]->(c:Company)
OPTIONAL MATCH (p)-[:LIVES_IN]->(city:City)
RETURN p, c, city;

Why: Required MATCH is faster than OPTIONAL.


Common Performance Issues

Issue 1: Full Node Scans

Symptom: Queries slow on large graphs, EXPLAIN shows NodeScan.

Example:

MATCH (p:Person) WHERE p.email = '[email protected]' RETURN p;
-- EXPLAIN shows: NodeScan  Filter

Solution: Create index on filtered property.

CREATE INDEX person_email ON Person(email);
-- Now EXPLAIN shows: IndexSeek

Impact: Significant improvement on large datasets.


Issue 2: Expensive Sorts

Symptom: Queries with ORDER BY are slow, high memory usage.

Example:

MATCH (p:Person)
RETURN p.name, p.created_at
ORDER BY p.created_at DESC;
-- Sorts ALL people in memory

Solution 1: Create index on sort property.

CREATE INDEX person_created ON Person(created_at);
-- Index provides sorted iteration

Solution 2: Add LIMIT if you don’t need all results.

MATCH (p:Person)
RETURN p.name, p.created_at
ORDER BY p.created_at DESC
LIMIT 100;  -- Only sort top 100

Impact: 5-50x speedup depending on dataset size.


Issue 3: Supernode Expansion

Symptom: Queries traversing high-degree nodes are slow.

Example:

-- "Popular" node with 1,000,000 followers
MATCH (celebrity:Person {name: "Celebrity"})-[:FOLLOWED_BY]->(follower:Person)
RETURN follower.name;
-- Expands 1M relationships!

Solution 1: Add LIMIT and pagination.

MATCH (celebrity:Person {name: "Celebrity"})-[:FOLLOWED_BY]->(follower:Person)
RETURN follower.name
ORDER BY follower.id
SKIP 1000
LIMIT 100;

Solution 2: Denormalize with cached counts.

MATCH (celebrity:Person {name: "Celebrity"})
RETURN celebrity.follower_count;  -- Precomputed property

Solution 3: Partition relationships.

-- Instead of one celebrity node with 1M edges:
-- Create follower buckets
(celebrity)-[:HAS_FOLLOWER_BUCKET]->(bucket:FollowerBucket {range: "A-F"})
(bucket)-[:CONTAINS_FOLLOWER]->(follower:Person)

Impact: Avoid O(n) expansion on supernodes.


Issue 4: Inefficient Aggregations

Symptom: COUNT, SUM, AVG queries are slow.

Example:

MATCH (p:Person)
RETURN count(p);
-- Scans all Person nodes

Solution 1: Maintain counts in metadata node.

MATCH (stats:Statistics {type: "Person"})
RETURN stats.total_count;

-- Update on create/delete
MATCH (stats:Statistics {type: "Person"})
SET stats.total_count = stats.total_count + 1;

Solution 2: Use index-backed aggregation (if supported).

CREATE INDEX person_label ON Person(label);
-- Future: Optimizer uses index stats for count

Impact: Constant time vs O(n) for large datasets.


Issue 5: Unindexed JOINs

Symptom: Queries joining on properties without indexes are slow.

Example:

MATCH (order:Order), (customer:Customer)
WHERE order.customer_id = customer.id
RETURN order.id, customer.name;
-- Cartesian product: orders × customers!

Solution 1: Use explicit relationships.

MATCH (order:Order)-[:PLACED_BY]->(customer:Customer)
RETURN order.id, customer.name;

Solution 2: If relationships don’t exist, create indexes.

CREATE INDEX order_customer_id ON Order(customer_id);
CREATE INDEX customer_id ON Customer(id);

Impact: Avoids expensive Cartesian products.


Performance Checklist

Before Query Execution

  • Use EXPLAIN to review execution plan
  • Check for NodeScan - add indexes if needed
  • Verify index usage - IndexSeek is ideal
  • Look for expensive sorts - consider index on ORDER BY property
  • Identify Cartesian products - use relationships instead

After Query Execution

  • Use PROFILE to measure actual performance
  • Compare execution_time_ms to expectations
  • Check rows_returned - is it what you expected?
  • Verify index hit - did optimizer use your index?
  • Re-measure after optimization - confirm improvement

Index Strategy

  • Index frequently filtered properties
  • Index ORDER BY properties
  • Use hash indexes for exact lookups
  • Use fulltext indexes for text search
  • Use vector indexes for similarity search
  • Don’t over-index - each index adds write overhead

Advanced Optimization Techniques

Materialized Views

For expensive aggregations run frequently, precompute results:

-- Instead of computing on every query:
MATCH (p:Person)-[:WORKS_FOR]->(c:Company {name: "Acme"})
RETURN count(p) AS employee_count;

-- Maintain a cached value:
MATCH (c:Company {name: "Acme"})
RETURN c.employee_count;  -- Updated on hire/termination

Query Result Caching

Cache frequent query results at application level:

from functools import lru_cache

@lru_cache(maxsize=128)
def get_company_employees(company_name):
    return db.query(
        "MATCH (p:Person)-[:WORKS_FOR]->(c:Company {name: $name}) RETURN p",
        {"name": company_name}
    )

Batch Operations

Instead of N queries, use one batched query:

❌ Bad (N+1 Query Problem):

for order_id in order_ids:  # 1000 orders
    result = db.query("MATCH (o:Order {id: $id}) RETURN o", {"id": order_id})
    # 1000 individual queries!

✅ Good (Batch Query):

result = db.query(
    "MATCH (o:Order) WHERE o.id IN $ids RETURN o",
    {"ids": order_ids}
)
# Single query

Monitoring Query Performance

Enable Query Logging

# geode.yaml
logging:
  level: 'info'
  slow_query_threshold_ms: 1000  # Log queries > 1s

Prometheus Metrics

# Query duration histogram
geode_query_duration_seconds_bucket{le="0.1"}
geode_query_duration_seconds_bucket{le="1.0"}
geode_query_duration_seconds_bucket{le="10.0"}

# Query rate
rate(geode_queries_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(geode_query_duration_seconds_bucket[5m]))

Grafana Dashboard Queries

# Top 5 slowest query types
topk(5, avg(rate(geode_query_duration_seconds_sum[5m])) by (query_type))

# Query throughput
sum(rate(geode_queries_total[5m]))

# Error rate
rate(geode_query_errors_total[5m]) / rate(geode_queries_total[5m])

Next Steps