Index Optimization Tutorial
Learn how to optimize query performance in Geode through strategic index design, EXPLAIN analysis, and performance tuning. This tutorial covers B-tree, HNSW vector, spatial, and full-text indexes with practical examples.
Prerequisites
- Geode server running (port 3141)
- Basic GQL knowledge
- Sample dataset loaded (we’ll create one)
- Understanding of query patterns
Learning Objectives
By completing this tutorial, you will:
- Understand different index types and their use cases
- Use EXPLAIN to analyze query execution plans
- Create and optimize B-tree indexes
- Implement HNSW vector indexes for similarity search
- Configure full-text search with BM25 ranking
- Monitor index performance and effectiveness
Tutorial Dataset
Let’s create a realistic e-commerce dataset:
```gql – Create customers CREATE (c1:Customer { id: 1, name: ‘Alice Johnson’, email: ‘[email protected] ’, location: point({latitude: 37.7749, longitude: -122.4194}), joined: datetime(‘2023-01-15T10:30:00’) })
CREATE (c2:Customer { id: 2, name: ‘Bob Smith’, email: ‘[email protected] ’, location: point({latitude: 40.7128, longitude: -74.0060}), joined: datetime(‘2023-02-20T14:15:00’) })
CREATE (c3:Customer { id: 3, name: ‘Charlie Brown’, email: ‘[email protected] ’, location: point({latitude: 34.0522, longitude: -118.2437}), joined: datetime(‘2023-03-10T09:45:00’) })
– Create products CREATE (p1:Product { id: 101, name: ‘Wireless Headphones’, description: ‘Premium wireless headphones with noise cancellation and 30-hour battery life’, price: 199.99, category: ‘Electronics’, stock: 150, embedding: [0.1, 0.3, 0.5, 0.2, 0.8, 0.4, 0.6, 0.7] })
CREATE (p2:Product { id: 102, name: ‘Smartphone Case’, description: ‘Durable protective case for smartphones with shock absorption’, price: 29.99, category: ‘Accessories’, stock: 500, embedding: [0.2, 0.4, 0.1, 0.9, 0.3, 0.5, 0.2, 0.6] })
CREATE (p3:Product { id: 103, name: ‘USB-C Cable’, description: ‘Fast charging USB-C cable with data transfer support’, price: 15.99, category: ‘Accessories’, stock: 1000, embedding: [0.3, 0.2, 0.7, 0.1, 0.4, 0.8, 0.3, 0.5] })
– Create orders and relationships MATCH (c:Customer {id: 1}), (p:Product {id: 101}) CREATE (c)-[:PURCHASED {date: datetime(‘2023-04-01T10:00:00’), quantity: 1}]->(p)
MATCH (c:Customer {id: 1}), (p:Product {id: 102}) CREATE (c)-[:PURCHASED {date: datetime(‘2023-04-01T10:05:00’), quantity: 2}]->(p)
MATCH (c:Customer {id: 2}), (p:Product {id: 103}) CREATE (c)-[:PURCHASED {date: datetime(‘2023-04-05T15:30:00’), quantity: 5}]->(p)
– Create product views MATCH (c:Customer {id: 3}), (p:Product {id: 101}) CREATE (c)-[:VIEWED {timestamp: datetime(‘2023-04-10T08:20:00’)}]->(p) ```
Part 1: Understanding Query Plans with EXPLAIN
Step 1: Baseline Query Analysis
Let’s analyze a simple query without indexes:
```gql EXPLAIN MATCH (c:Customer {email: ‘[email protected] ’}) RETURN c ```
Expected Output: ```
| plan |
|---|
| EXPLAIN |
| MATCH |
| NodeScan |
| Filter |
| RETURN |
| Project |
| ``` |
Analysis:
- NodeScan: Scans all Customer nodes (O(n) complexity)
- Filter: Applies email predicate after scan
- Problem: Inefficient for large datasets
Step 2: Create B-tree Index
Create an index on the email field:
```gql CREATE INDEX ON :Customer(email) ```
Now re-run the EXPLAIN:
```gql EXPLAIN MATCH (c:Customer {email: ‘[email protected] ’}) RETURN c ```
Improved Output: ```
| plan |
|---|
| EXPLAIN |
| MATCH |
| IndexSeek |
| RETURN |
| Project |
| ``` |
Analysis:
- IndexSeek: Uses B-tree index for O(log n) lookup
- No Filter: Predicate handled by index
- Benefit: Significantly faster for large datasets
Step 3: Profile Query Performance
Use PROFILE to measure actual execution:
```gql PROFILE MATCH (c:Customer {email: ‘[email protected] ’}) RETURN c ```
Output: ```
| metric | value |
|---|---|
| rows_returned | 1 |
| columns | 1 |
| execution_time_ms | 1 |
| ``` |
Interpretation:
- Query completes in ~1ms with index
- Without index: ~50-100ms for 100k customers
Part 2: B-tree Index Optimization
Composite Indexes
Create a composite index for complex queries:
```gql – Create composite index CREATE INDEX ON :Product(category, price)
– Query using composite index EXPLAIN MATCH (p:Product) WHERE p.category = ‘Electronics’ AND p.price < 300 RETURN p.name, p.price ORDER BY p.price ```
Plan Analysis: ```
| plan |
|---|
| EXPLAIN |
| MATCH |
| IndexSeek |
| RETURN |
| Project |
| Sort |
| ``` |
Benefits:
- Index handles both WHERE predicates
- ORDER BY can use index ordering
- No separate sort operation needed
Index Selection Guidelines
When to Use B-tree Indexes:
Equality Lookups: ```gql – Excellent for B-tree MATCH (c:Customer {email: $email}) RETURN c MATCH (p:Product {id: $id}) RETURN p ```
Range Queries: ```gql – Excellent for B-tree MATCH (p:Product) WHERE p.price >= 50 AND p.price <= 200 RETURN p ```
Prefix Matching: ```gql – Good for B-tree MATCH (c:Customer) WHERE c.name STARTS WITH ‘Alice’ RETURN c ```
Sorting: ```gql – B-tree index provides sorted access MATCH (p:Product) WHERE p.category = ‘Electronics’ RETURN p.name ORDER BY p.price – Uses index ordering ```
When NOT to Use B-tree Indexes:
Low Cardinality Fields: ```gql – Don’t index boolean or low-cardinality fields CREATE INDEX ON :Product(in_stock) – Only true/false values ```
Frequently Updated Fields:
- Index maintenance overhead on writes
- Consider read/write ratio
- Large Text Fields:
- Use full-text indexes instead
Part 3: HNSW Vector Index for Similarity Search
Step 1: Create HNSW Index
Create a vector index for product embeddings:
```gql CREATE VECTOR INDEX product_embeddings ON :Product(embedding) OPTIONS { dimensions: 8, distance: ‘cosine’, ef_construction: 200, m: 16 } ```
Index Parameters:
- dimensions: Vector dimensionality (8 in our example)
- distance: Distance metric (cosine, euclidean, manhattan, dot_product)
- ef_construction: Build-time quality (higher = better quality, slower build)
- m: Max connections per node (higher = better recall, more memory)
Step 2: Similarity Search Query
Find products similar to a query vector:
```gql MATCH (p:Product) WHERE vector_similarity(p.embedding, [0.15, 0.35, 0.45, 0.25, 0.75, 0.45, 0.55, 0.65], ‘cosine’) > 0.8 RETURN p.name, p.description ORDER BY vector_similarity(p.embedding, [0.15, 0.35, 0.45, 0.25, 0.75, 0.45, 0.55, 0.65], ‘cosine’) DESC LIMIT 5 ```
Step 3: Optimize HNSW Parameters
For High Recall (Recommendation Systems): ```gql CREATE VECTOR INDEX product_embeddings_high_recall ON :Product(embedding) OPTIONS { dimensions: 8, distance: ‘cosine’, ef_construction: 400, – Higher for better quality m: 32, – More connections ef_search: 200 – Search-time parameter } ```
For Low Latency (Real-time Search): ```gql CREATE VECTOR INDEX product_embeddings_fast ON :Product(embedding) OPTIONS { dimensions: 8, distance: ’euclidean’, ef_construction: 100, – Lower for faster build m: 8, – Fewer connections ef_search: 50 – Faster search } ```
Performance Trade-offs:
| Parameter | Higher Value | Lower Value |
|---|---|---|
| ef_construction | Better recall, slower build | Faster build, lower recall |
| m | Better recall, more memory | Less memory, lower recall |
| ef_search | Better recall, slower query | Faster query, lower recall |
Step 4: Distance Metric Selection
Cosine Similarity - Recommended for normalized embeddings: ```gql – Good for text embeddings, ML features distance: ‘cosine’ ```
Euclidean Distance - For absolute magnitude: ```gql – Good for spatial data, image features distance: ’euclidean’ ```
Dot Product - For unnormalized vectors: ```gql – Good for recommendation scores distance: ‘dot_product’ ```
Manhattan Distance - For sparse vectors: ```gql – Good for high-dimensional sparse data distance: ‘manhattan’ ```
Part 4: Full-Text Search with BM25
Step 1: Create Full-Text Index
Create a full-text index on product descriptions:
```gql CREATE FULLTEXT INDEX product_search ON :Product(name, description) OPTIONS { analyzer: ‘standard’, k1: 1.2, b: 0.75 } ```
BM25 Parameters:
- k1: Term frequency saturation (typical: 1.2-2.0)
- b: Length normalization (0.0 = no normalization, 1.0 = full normalization)
- analyzer: Text tokenization (standard, english, whitespace)
Step 2: Full-Text Search Query
Search for products mentioning “wireless charging”:
```gql MATCH (p:Product) WHERE fulltext_search(p, ‘wireless charging’) RETURN p.name, p.description, fulltext_score(p, ‘wireless charging’) AS score ORDER BY score DESC LIMIT 10 ```
Step 3: Optimize BM25 Parameters
For Short Documents (product names): ```gql CREATE FULLTEXT INDEX product_names ON :Product(name) OPTIONS { k1: 2.0, – Higher for short documents b: 0.5 – Less length normalization } ```
For Long Documents (descriptions, reviews): ```gql CREATE FULLTEXT INDEX product_descriptions ON :Product(description) OPTIONS { k1: 1.2, – Standard value b: 0.75 – Standard length normalization } ```
For Exact Phrase Matching: ```gql MATCH (p:Product) WHERE fulltext_search(p, ‘“noise cancellation”’) – Exact phrase RETURN p.name, p.description ```
Step 4: Multi-field Search
Search across multiple fields with weights:
```gql MATCH (p:Product) WHERE fulltext_search(p.name, ‘headphones’, {boost: 2.0}) OR fulltext_search(p.description, ‘headphones’, {boost: 1.0}) RETURN p.name, fulltext_score(p.name, ‘headphones’) * 2.0 + fulltext_score(p.description, ‘headphones’) AS total_score ORDER BY total_score DESC LIMIT 10 ```
Part 5: Spatial Indexes for Geographic Queries
Step 1: Create Spatial Index
Create an R-tree spatial index for customer locations:
```gql CREATE SPATIAL INDEX customer_locations ON :Customer(location) OPTIONS { dimensions: 2, min_entries: 4, max_entries: 16 } ```
Step 2: Geographic Range Query
Find customers within 50km of a point:
```gql MATCH (c:Customer) WHERE distance(c.location, point({latitude: 37.7749, longitude: -122.4194})) < 50000 – meters RETURN c.name, c.location, distance(c.location, point({latitude: 37.7749, longitude: -122.4194})) / 1000 AS distance_km ORDER BY distance_km ```
Step 3: Bounding Box Query
Find customers in a geographic bounding box:
```gql MATCH (c:Customer) WHERE c.location.latitude >= 37.0 AND c.location.latitude <= 38.0 AND c.location.longitude >= -123.0 AND c.location.longitude <= -121.0 RETURN c.name, c.location ```
Part 6: Index Maintenance and Monitoring
Step 1: List All Indexes
```gql CALL db.indexes() YIELD name, type, properties, options RETURN name, type, properties, options ```
Output: ```
| name | type | properties | options |
|---|---|---|---|
| customer_email | btree | [“email”] | {} |
| product_category_price | btree | [“category”,“price”] | {} |
| product_embeddings | hnsw | [“embedding”] | {dimensions:8,…} |
| product_search | fulltext | [“name”,“description”] | {k1:1.2,b:0.75} |
| customer_locations | spatial | [“location”] | {dimensions:2} |
| ``` |
Step 2: Analyze Index Usage
Check which indexes are being used:
```gql CALL db.index.stats() YIELD name, hits, misses, hit_rate WHERE hit_rate < 0.5 – Find underutilized indexes RETURN name, hits, misses, hit_rate ORDER BY hit_rate ```
Step 3: Rebuild Index
Rebuild an index to improve performance:
```gql – Drop and recreate index DROP INDEX customer_email
CREATE INDEX ON :Customer(email) ```
Step 4: Monitor Index Size
Check index storage size:
```gql CALL db.index.size() YIELD name, size_bytes, entry_count RETURN name, size_bytes / (1024 * 1024) AS size_mb, entry_count ORDER BY size_mb DESC ```
Part 7: Advanced Optimization Techniques
Covering Indexes
Create indexes that include all queried fields:
```gql – Create covering index CREATE INDEX ON :Product(category, name, price)
– Query covered by index (no table lookup needed) MATCH (p:Product) WHERE p.category = ‘Electronics’ RETURN p.name, p.price – All fields in index ```
Index Hints
Force query planner to use specific index:
```gql – Use index hint MATCH (p:Product) USING INDEX product_category_price WHERE p.category = ‘Electronics’ AND p.price < 300 RETURN p ```
Partial Indexes
Create indexes with filter conditions (if supported):
```gql – Index only active products CREATE INDEX ON :Product(name) WHERE in_stock = true ```
Index-Only Scans
Optimize queries to use index-only scans:
```gql – Before: Requires table access MATCH (p:Product) WHERE p.category = ‘Electronics’ RETURN p – Returns all properties
– After: Index-only scan MATCH (p:Product) WHERE p.category = ‘Electronics’ RETURN p.name, p.category – Only indexed properties ```
Part 8: Performance Testing
Step 1: Benchmark Query Performance
Create a benchmark dataset:
```bash
Generate 100k customers
for i in {1..100000}; do echo “CREATE (:Customer { id: $i, email: ‘user${i}@example.com ’, name: ‘User $i’ })” done | geode query ```
Step 2: Measure Query Performance
Without Index: ```gql – Measure time PROFILE MATCH (c:Customer {email: ‘[email protected] ’}) RETURN c – execution_time_ms: ~150ms (full scan) ```
With Index: ```gql CREATE INDEX ON :Customer(email)
PROFILE MATCH (c:Customer {email: ‘[email protected] ’}) RETURN c – execution_time_ms: ~2ms (index seek) ```
Performance Improvement: 75x faster
Step 3: Load Testing
Use a load testing tool to measure throughput:
```python import asyncio from geode_client import open_database import time
async def benchmark_queries(db, num_queries=1000): start = time.time() tasks = []
async with db.connection() as conn:
for i in range(num_queries):
task = conn.query(
"MATCH (c:Customer {email: $email}) RETURN c",
{"email": f"user{i}@example.com"}
)
tasks.append(task)
await asyncio.gather(*tasks)
elapsed = time.time() - start
qps = num_queries / elapsed
print(f"Queries: {num_queries}, Time: {elapsed:.2f}s, QPS: {qps:.0f}")
Run benchmark
db = open_database(“quic://localhost:3141”, pool_size=50) asyncio.run(benchmark_queries(db, 1000))
Output: Queries: 1000, Time: 2.15s, QPS: 465
```
Part 9: Common Pitfalls and Solutions
Pitfall 1: Too Many Indexes
Problem: Every index adds write overhead
Solution: Only create indexes for frequent queries ```gql – Analyze query patterns first CALL db.query.stats() YIELD query, execution_count WHERE execution_count > 100 RETURN query, execution_count ORDER BY execution_count DESC LIMIT 10
– Create indexes for top queries only ```
Pitfall 2: Wrong Index Type
Problem: B-tree index for vector similarity
Solution: Use appropriate index type ```gql – Wrong: B-tree for embeddings CREATE INDEX ON :Product(embedding)
– Correct: HNSW for embeddings CREATE VECTOR INDEX ON :Product(embedding) OPTIONS {dimensions: 8, distance: ‘cosine’} ```
Pitfall 3: Unselective Indexes
Problem: Index on low-cardinality field
Solution: Check index selectivity ```gql – Check cardinality MATCH (p:Product) RETURN p.category, count(*) AS count ORDER BY count DESC
– If most values have high count, index is not useful ```
Pitfall 4: Stale Statistics
Problem: Query planner uses outdated statistics
Solution: Update statistics regularly ```gql CALL db.stats.update() ```
Practice Exercises
Exercise 1: E-commerce Search Optimization
Task: Optimize this query: ```gql MATCH (p:Product) WHERE p.category = ‘Electronics’ AND p.price >= 100 AND p.price <= 500 AND fulltext_search(p.description, ‘wireless bluetooth’) RETURN p.name, p.price, fulltext_score(p.description, ‘wireless bluetooth’) AS score ORDER BY score DESC, p.price LIMIT 20 ```
Steps:
- Create composite B-tree index: category + price
- Create full-text index on description
- Verify with EXPLAIN
- Measure with PROFILE
Solution: ```gql CREATE INDEX ON :Product(category, price) CREATE FULLTEXT INDEX ON :Product(description)
EXPLAIN MATCH (p:Product) WHERE p.category = ‘Electronics’ AND p.price >= 100 AND p.price <= 500 AND fulltext_search(p.description, ‘wireless bluetooth’) RETURN p.name, p.price, fulltext_score(p.description, ‘wireless bluetooth’) AS score ORDER BY score DESC, p.price LIMIT 20 ```
Exercise 2: Geospatial Query Optimization
Task: Find customers within 10km of stores
```gql MATCH (c:Customer), (s:Store) WHERE distance(c.location, s.location) < 10000 RETURN c.name, s.name, distance(c.location, s.location) AS distance_m ORDER BY distance_m ```
Steps:
- Create spatial indexes on both Customer and Store locations
- Use EXPLAIN to verify R-tree usage
- Compare performance with/without indexes
Solution: ```gql CREATE SPATIAL INDEX ON :Customer(location) CREATE SPATIAL INDEX ON :Store(location)
PROFILE MATCH (c:Customer), (s:Store) WHERE distance(c.location, s.location) < 10000 RETURN c.name, s.name, distance(c.location, s.location) AS distance_m ORDER BY distance_m ```
Exercise 3: Vector Similarity Optimization
Task: Implement product recommendation based on embedding similarity
```gql – Given a product, find similar products MATCH (p:Product {id: 101}) MATCH (similar:Product) WHERE similar.id <> p.id AND vector_similarity(p.embedding, similar.embedding, ‘cosine’) > 0.7 RETURN similar.name, vector_similarity(p.embedding, similar.embedding, ‘cosine’) AS similarity ORDER BY similarity DESC LIMIT 10 ```
Steps:
- Create HNSW index with optimal parameters
- Test different distance metrics
- Tune ef_search for recall/latency trade-off
Solution: ```gql CREATE VECTOR INDEX product_recommendations ON :Product(embedding) OPTIONS { dimensions: 8, distance: ‘cosine’, ef_construction: 200, m: 16, ef_search: 100 }
– Query uses index automatically MATCH (p:Product {id: 101}) MATCH (similar:Product) WHERE similar.id <> p.id AND vector_similarity(p.embedding, similar.embedding, ‘cosine’) > 0.7 RETURN similar.name, vector_similarity(p.embedding, similar.embedding, ‘cosine’) AS similarity ORDER BY similarity DESC LIMIT 10 ```
Quick Reference Card
Index Type Selection
| Query Pattern | Index Type | Example |
|---|---|---|
| Equality lookup | B-tree | `email = ‘[email protected] ’` |
| Range query | B-tree | `price BETWEEN 100 AND 200` |
| Text search | Full-text | `fulltext_search(description, ‘wireless’)` |
| Vector similarity | HNSW | `vector_similarity(embedding, query_vector)` |
| Geographic range | Spatial | `distance(location, point) < 10000` |
| Prefix matching | B-tree | `name STARTS WITH ‘Alice’` |
EXPLAIN Plan Operators
| Operator | Meaning | Performance |
|---|---|---|
| IndexSeek | Index lookup | O(log n) - Excellent |
| NodeScan | Full table scan | O(n) - Poor |
| Filter | Post-scan filtering | O(n) - Poor if large result set |
| VectorIndexScan | HNSW search | O(log n) avg - Good |
| SpatialIndexScan | R-tree search | O(log n) - Good |
Optimization Checklist
- Analyze query patterns with db.query.stats()
- Create indexes for frequent queries
- Use EXPLAIN to verify index usage
- Use PROFILE to measure performance
- Monitor index hit rates
- Update statistics regularly
- Remove unused indexes
- Consider covering indexes for common queries
- Tune HNSW parameters for vector indexes
- Benchmark before and after index creation
Next Steps
- Production Deployment - Apply index strategies to your application
- Monitoring - Set up index usage monitoring
- Query Optimization - Review slow queries and add indexes
- Advanced Topics - Explore query hints and optimizer settings
- Performance Testing - Benchmark with realistic workloads
Related Documentation:
Last Updated: January 2026 Difficulty: Intermediate Estimated Time: 90 minutes