Indexing and Query Optimization Tutorial
Learn to create indexes and optimize query performance using EXPLAIN and PROFILE in this hands-on 15-minute tutorial.
Prerequisites
- Completed MATCH Basics Tutorial
- Geode server running (
geode serve) - Access to Geode shell (
geode shell)
Tutorial Overview
Time: 15 minutes Difficulty: Intermediate Topics: Index creation, query plans, performance profiling, optimization strategies
By the end of this tutorial, you’ll be able to:
- Identify slow queries using PROFILE
- Create appropriate indexes for query patterns
- Use EXPLAIN to verify index usage
- Compare query performance before and after indexing
- Understand cost-based optimizer behavior
Step 1: Create Sample Dataset
Create a larger dataset to demonstrate performance differences:
CREATE GRAPH PerformanceTest;
USE PerformanceTest;
-- Create 100 person nodes
CREATE
(:Person {name: "Person_1", age: 25, city: "Seattle", salary: 50000}),
(:Person {name: "Person_2", age: 30, city: "Portland", salary: 60000}),
(:Person {name: "Person_3", age: 35, city: "Seattle", salary: 70000}),
(:Person {name: "Person_4", age: 28, city: "San Francisco", salary: 90000}),
(:Person {name: "Person_5", age: 42, city: "Seattle", salary: 85000});
-- In production, you'd create many more nodes
-- For this tutorial, 5 nodes demonstrate the concepts
Expected output:
Created 5 nodes
What You Learned
- Indexing becomes critical with large datasets (1000+ nodes)
- Small datasets may not show dramatic improvements
- Real benefits appear at scale (millions of nodes/relationships)
Step 2: Identify Slow Query with PROFILE
Run a query with performance profiling:
PROFILE
MATCH (p:Person)
WHERE p.age > 30
RETURN p.name, p.age, p.salary
ORDER BY p.salary DESC;
Expected output:
name | age | salary
----------|-----|--------
Person_4 | 28 | 90000
Person_5 | 42 | 85000
Person_3 | 35 | 70000
PROFILE:
Operation: SeqScan [Person]
Rows scanned: 5
Rows returned: 3
Execution time: 0.234ms
Filter: age > 30
Understanding PROFILE Output
- SeqScan: Sequential scan (reads every node)
- Rows scanned: Total nodes examined
- Rows returned: Nodes matching criteria
- Execution time: Query duration
- Filter: Condition applied during scan
Problem Identified
- SeqScan scans ALL nodes, even those not matching
- Inefficient for large datasets (scans millions to find thousands)
- Execution time grows linearly with data size
What You Learned
PROFILEshows query execution details- SeqScan indicates no index usage
- Performance degrades with dataset size
- Need indexes for efficient filtering
Step 3: Create B-tree Index
Create an index on the age property:
CREATE INDEX person_age_idx ON Person(age) USING btree;
Expected output:
Index 'person_age_idx' created successfully
Indexed 5 nodes
Build time: 0.12ms
B-tree Index Characteristics
- Best for: Range queries (
>,<,>=,<=,BETWEEN) - Supported operations: Equality (
=), ordering (ORDER BY) - Structure: Balanced tree for O(log n) lookups
- Space overhead: ~5-10% of data size
Index Naming Convention
<label>_<property>_idx
Examples:
person_age_idxproduct_price_idxuser_email_idx
What You Learned
CREATE INDEXbuilds an index structureUSING btreespecifies index type- Indexes are built immediately (synchronous)
- B-tree optimal for numeric ranges
Step 4: Verify Index Usage with EXPLAIN
Check if the optimizer uses your index:
EXPLAIN
MATCH (p:Person)
WHERE p.age > 30
RETURN p.name, p.age, p.salary
ORDER BY p.salary DESC;
Expected output:
QUERY PLAN:
┌─ Project [p.name, p.age, p.salary]
└─ Sort [p.salary DESC]
└─ Filter [p.age > 30]
└─ IndexScan [person_age_idx]
Index: person_age_idx (btree)
Condition: age > 30
Estimated rows: 3
Estimated cost: 2.5
Understanding EXPLAIN Output
- Project: Select specific columns
- Sort: ORDER BY implementation
- Filter: WHERE clause evaluation
- IndexScan: Uses index (not SeqScan!)
- Estimated rows: Optimizer’s row count prediction
- Estimated cost: Relative expense (lower = better)
Verify Index Used
Look for:
IndexScaninstead ofSeqScan- Index name (
person_age_idx) - Condition pushed down to index
What You Learned
EXPLAINshows query plan without executing- IndexScan confirms index usage
- Optimizer estimates rows and costs
- Index dramatically reduces work
Step 5: Compare Performance
Run PROFILE again with index:
PROFILE
MATCH (p:Person)
WHERE p.age > 30
RETURN p.name, p.age, p.salary
ORDER BY p.salary DESC;
Expected output:
name | age | salary
----------|-----|--------
Person_4 | 28 | 90000
Person_5 | 42 | 85000
Person_3 | 35 | 70000
PROFILE:
Operation: IndexScan [person_age_idx]
Index: person_age_idx (btree)
Rows scanned: 3
Rows returned: 3
Execution time: 0.08ms
Filter: age > 30 (index)
Performance Improvements
Before Index:
- Operation: SeqScan
- Rows scanned: 5
- Execution time: 0.234ms
After Index:
- Operation: IndexScan
- Rows scanned: 3 (only matching rows!)
- Execution time: 0.08ms
Speedup: ~3x faster (more dramatic with larger datasets)
What You Learned
- Index reduced rows scanned (5 → 3)
- Execution time improved by ~66%
- Improvement scales with dataset size
- Index pays off for repeated queries
Step 6: Composite Index
Create multi-column index for complex queries:
CREATE INDEX person_city_age_idx ON Person(city, age) USING btree;
Expected output:
Index 'person_city_age_idx' created successfully
Indexed 5 nodes
Build time: 0.15ms
When to Use Composite Indexes
Good for:
-- Uses city_age index efficiently
WHERE city = 'Seattle' AND age > 30
WHERE city = 'Seattle' -- Uses first column only
Not optimal:
-- Cannot use city_age index (skips first column)
WHERE age > 30
Index Column Order
Rule: Most selective column first, then range column
Example:
cityhas few distinct values (low selectivity)agehas many distinct values (high selectivity)- For equality + range: put equality first
Optimal order: (city, age) for WHERE city = X AND age > Y
What You Learned
- Composite indexes span multiple properties
- Column order matters for query matching
- Leftmost prefix rule applies
- Trade-off: storage vs query performance
Step 7: Index Types Comparison
Create indexes for different use cases:
B-tree Index (Range Queries)
CREATE INDEX product_price_idx ON Product(price) USING btree;
-- Optimal for:
WHERE price > 100
WHERE price BETWEEN 50 AND 200
ORDER BY price
Hash Index (Equality Only)
CREATE INDEX user_email_idx ON User(email) USING hash;
-- Optimal for:
WHERE email = 'user@example.com'
-- NOT optimal for:
WHERE email LIKE '%@example.com' -- Use btree instead
Full-Text Index (Text Search)
CREATE INDEX doc_content_idx ON Document(content) USING fulltext;
-- Optimal for:
WHERE fulltext_match(content, "graph database")
Vector Index (Similarity Search)
CREATE INDEX product_emb_idx ON Product(embedding) USING vector;
-- Optimal for:
WHERE vector_distance_cosine(embedding, $query_vec) < 0.5
ORDER BY vector_distance_cosine(embedding, $query_vec)
LIMIT 10
Bloom Filter Index (Membership Testing)
CREATE INDEX tags_idx ON Article(tags) USING bloom;
-- Optimal for:
WHERE 'database' IN tags
Index Type Summary
| Index Type | Best For | Operations | Space | Build Time |
|---|---|---|---|---|
| B-tree | Ranges, ordering | >, <, =, ORDER BY | Low | Fast |
| Hash | Exact matches | = | Very Low | Very Fast |
| Full-text | Text search | fulltext_match() | Medium | Medium |
| Vector | Similarity | vector_distance_*() | High | Slow |
| Bloom | Membership | IN, ANY | Very Low | Very Fast |
What You Learned
- Different index types for different query patterns
- B-tree is general-purpose (most common)
- Specialized indexes for specific operations
- Trade-offs between space, build time, and query speed
Step 8: Cost-Based Optimizer
Observe optimizer choosing between indexes:
-- Create overlapping indexes
CREATE INDEX person_age_idx ON Person(age) USING btree;
CREATE INDEX person_age_city_idx ON Person(age, city) USING btree;
-- Optimizer chooses best index
EXPLAIN
MATCH (p:Person)
WHERE p.age > 30 AND p.city = 'Seattle'
RETURN p.name;
Expected output:
QUERY PLAN:
└─ IndexScan [person_age_city_idx]
Index: person_age_city_idx (btree, composite)
Condition: age > 30 AND city = 'Seattle'
Estimated rows: 2
Estimated cost: 1.8
OPTIMIZER DECISION:
Considered indexes:
- person_age_idx: cost 3.2 (single column)
- person_age_city_idx: cost 1.8 (both columns match) ✓ CHOSEN
Reason: Composite index covers both predicates
Optimizer Selection Criteria
- Index coverage: More matched columns = better
- Selectivity: Fewer estimated rows = better
- Index type: Appropriate for operation
- Cost estimate: Logarithmic scaling for trees
What You Learned
- Optimizer automatically chooses best index
- Multiple indexes don’t hurt (optimizer picks one)
- Composite indexes preferred when applicable
- Cost model guides selection
Step 9: Index Statistics
Update statistics for accurate cost estimation:
-- Analyze table to collect statistics
ANALYZE Person;
Expected output:
Statistics updated for label 'Person'
Rows: 5
Distinct values (age): 5
Distinct values (city): 3
Min age: 25, Max age: 42
Null fraction (age): 0.0
Statistics Purpose
- Cardinality: Row counts per label
- Distinct values: Property selectivity
- Value distribution: Histograms for ranges
- Null fraction: Percentage of nulls
When to ANALYZE
- After bulk data loads
- After significant updates (>10% of data)
- Before major query optimization work
- Periodically (daily/weekly for active databases)
Auto-Update Configuration
# geode.yaml
optimizer:
statistics:
auto_update: true
update_interval: '1h'
sample_size: 10000
What You Learned
- Statistics improve optimizer accuracy
ANALYZEupdates statistics- Stale statistics lead to poor plans
- Auto-update recommended for production
Step 10: Index Maintenance
Monitor and maintain indexes:
List All Indexes
SHOW INDEXES ON Person;
Expected output:
name | label | properties | type | unique | status
----------------------|--------|--------------|--------|--------|--------
person_age_idx | Person | [age] | btree | false | ready
person_city_age_idx | Person | [city, age] | btree | false | ready
person_email_idx | Person | [email] | hash | true | ready
Drop Unused Index
DROP INDEX person_age_idx;
Expected output:
Index 'person_age_idx' dropped successfully
Rebuild Index
-- Drop and recreate
DROP INDEX person_age_idx;
CREATE INDEX person_age_idx ON Person(age) USING btree;
-- Or use REINDEX (if supported)
REINDEX person_age_idx;
Index Health Check
-- Check index fragmentation (future feature)
SHOW INDEX STATS person_age_idx;
What You Learned
SHOW INDEXESlists all indexes- Drop unused indexes to save space
- Rebuild for fragmentation or corruption
- Monitor index health regularly
Complete Example: E-Commerce Optimization
Optimize a product catalog query:
Initial Setup
CREATE GRAPH ECommerce;
USE ECommerce;
CREATE
(:Product {id: 1, name: "Laptop", category: "Electronics", price: 1200, rating: 4.5}),
(:Product {id: 2, name: "Mouse", category: "Electronics", price: 25, rating: 4.2}),
(:Product {id: 3, name: "Desk", category: "Furniture", price: 350, rating: 4.7}),
(:Product {id: 4, name: "Chair", category: "Furniture", price: 200, rating: 4.6}),
(:Product {id: 5, name: "Monitor", category: "Electronics", price: 400, rating: 4.4});
Slow Query
PROFILE
MATCH (p:Product)
WHERE p.category = 'Electronics' AND p.price < 500 AND p.rating > 4.0
RETURN p.name, p.price, p.rating
ORDER BY p.rating DESC;
Result: SeqScan, 5 rows scanned
Create Optimal Indexes
-- Composite index for category + price range
CREATE INDEX prod_cat_price_idx ON Product(category, price) USING btree;
-- Separate index for rating (used in ORDER BY)
CREATE INDEX prod_rating_idx ON Product(rating) USING btree;
-- Update statistics
ANALYZE Product;
Verify Optimization
EXPLAIN
MATCH (p:Product)
WHERE p.category = 'Electronics' AND p.price < 500 AND p.rating > 4.0
RETURN p.name, p.price, p.rating
ORDER BY p.rating DESC;
Expected plan:
└─ Project [p.name, p.price, p.rating]
└─ Sort [p.rating DESC] using prod_rating_idx
└─ Filter [p.rating > 4.0]
└─ IndexScan [prod_cat_price_idx]
Condition: category = 'Electronics' AND price < 500
Estimated rows: 2
Performance Comparison
PROFILE
MATCH (p:Product)
WHERE p.category = 'Electronics' AND p.price < 500 AND p.rating > 4.0
RETURN p.name, p.price, p.rating
ORDER BY p.rating DESC;
Before: 5 rows scanned, 0.25ms After: 2 rows scanned, 0.08ms Improvement: 68% faster, 60% fewer rows
What You Learned
- Combine indexes for complex queries
- Separate indexes for filter vs sort
- Composite indexes reduce scan size
- ORDER BY can use indexes too
Best Practices
Index Creation Strategy
- Profile first: Use PROFILE to find slow queries
- Identify patterns: Look for common WHERE/ORDER BY columns
- Create selectively: Don’t over-index (balance read vs write performance)
- Test impact: Compare EXPLAIN before and after
- Monitor usage: Drop unused indexes
Index Selection Guidelines
Always index:
- Primary keys and unique constraints
- Foreign key columns (join conditions)
- Columns in frequent WHERE clauses
- ORDER BY and GROUP BY columns
Consider indexing:
- Columns with high cardinality (many distinct values)
- Columns in complex queries
- Frequently accessed properties
Don’t index:
- Low cardinality columns (few distinct values, like boolean)
- Rarely queried properties
- Columns updated frequently (write overhead)
Performance Rules of Thumb
- Indexes speed up reads: Significant improvement for large datasets
- Indexes add write overhead: Consider trade-offs when creating indexes
- Composite indexes: 2-3 columns maximum recommended
- Index space: Varies based on data and index type
- Statistics: Update after significant data changes
Practice Exercises
Exercise 1: Optimize Range Query
-- Create test data
CREATE GRAPH RangeTest;
USE RangeTest;
CREATE
(:Event {name: "Event_1", timestamp: 1609459200, priority: 1}),
(:Event {name: "Event_2", timestamp: 1609545600, priority: 2}),
(:Event {name: "Event_3", timestamp: 1609632000, priority: 1}),
(:Event {name: "Event_4", timestamp: 1609718400, priority: 3});
-- Slow query
MATCH (e:Event)
WHERE e.timestamp > 1609500000 AND e.timestamp < 1609700000
RETURN e.name, e.timestamp
ORDER BY e.timestamp;
Task: Create an index to optimize this query. Verify with EXPLAIN.
Exercise 2: Composite Index Design
-- Query pattern
MATCH (u:User)
WHERE u.country = 'USA' AND u.age > 25 AND u.active = true
RETURN u.name, u.email;
Task: Design optimal composite index. Consider column order.
Exercise 3: Index Type Selection
For each query, choose the best index type:
-- Query A
WHERE user.email = 'alice@example.com'
-- Query B
WHERE product.price BETWEEN 50 AND 100
-- Query C
WHERE fulltext_match(article.content, "graph database")
-- Query D
WHERE vector_distance_l2(image.embedding, $query_vec) < 0.3
-- Query E
WHERE 'featured' IN product.tags
Solutions
Exercise 1 Solution
-- Create B-tree index for range query
CREATE INDEX event_timestamp_idx ON Event(timestamp) USING btree;
-- Verify usage
EXPLAIN
MATCH (e:Event)
WHERE e.timestamp > 1609500000 AND e.timestamp < 1609700000
RETURN e.name, e.timestamp
ORDER BY e.timestamp;
-- Should show IndexScan with event_timestamp_idx
Exercise 2 Solution
-- Column order: equality first, then range, then equality
-- country (equality, medium selectivity)
-- age (range, high selectivity)
-- active (equality, low selectivity)
CREATE INDEX user_country_age_active_idx ON User(country, age, active) USING btree;
-- Alternative (if age range is wide):
CREATE INDEX user_country_active_age_idx ON User(country, active, age) USING btree;
-- Verify
EXPLAIN
MATCH (u:User)
WHERE u.country = 'USA' AND u.age > 25 AND u.active = true
RETURN u.name, u.email;
Exercise 3 Solution
-- Query A: Hash index (exact match)
CREATE INDEX user_email_idx ON User(email) USING hash;
-- Query B: B-tree index (range query)
CREATE INDEX product_price_idx ON Product(price) USING btree;
-- Query C: Full-text index
CREATE INDEX article_content_idx ON Article(content) USING fulltext;
-- Query D: Vector index (similarity search)
CREATE INDEX image_embedding_idx ON Image(embedding) USING vector;
-- Query E: Bloom filter (membership testing)
CREATE INDEX product_tags_idx ON Product(tags) USING bloom;
Troubleshooting
Index Not Used by Optimizer
Problem: EXPLAIN shows SeqScan despite index existing
Solutions:
- Update statistics:
ANALYZE Person; - Check index covers query columns
- Verify WHERE clause uses indexed columns
- Try composite index for multiple predicates
- Check data distribution (very small tables may use SeqScan)
Slow Index Creation
Problem: CREATE INDEX takes too long
Solutions:
- Build indexes during maintenance windows
- Use
CONCURRENTLYoption (if available) - Increase memory for index build
- Consider smaller sample size for partial indexes
Query Slower After Adding Index
Problem: Performance degraded post-indexing
Solutions:
- Update statistics:
ANALYZE - Check if wrong index chosen
- Consider dropping conflicting indexes
- Verify index type matches query pattern
- Check write overhead on heavily updated tables
Next Steps
Continue your learning journey:
- Graph Algorithms Tutorial - PageRank, centrality, community detection
- Performance Tuning Guide - Advanced optimization techniques
- Query Language Reference - Complete GQL syntax
- Indexing Guide - Comprehensive indexing documentation
Quick Reference
Index Commands
-- Create index
CREATE INDEX idx_name ON Label(property) USING btree;
-- Composite index
CREATE INDEX idx_name ON Label(prop1, prop2) USING btree;
-- Drop index
DROP INDEX idx_name;
-- List indexes
SHOW INDEXES ON Label;
-- Update statistics
ANALYZE Label;
Index Types
USING btree -- Range queries, ordering
USING hash -- Exact matches only
USING fulltext -- Text search
USING vector -- Similarity search
USING bloom -- Membership testing
Query Analysis
EXPLAIN ... -- Show query plan (no execution)
PROFILE ... -- Execute and show performance
ANALYZE Label; -- Update statistics
Tutorial Complete! You now understand indexing and query optimization in Geode.