The Graph Algorithms and Analytics category provides comprehensive documentation for implementing advanced computational techniques on graph data using Geode. This collection covers everything from classical graph algorithms to modern machine learning approaches, enabling you to extract deep insights from connected data.
Introduction to Graph Algorithms
Graph algorithms are computational procedures designed to solve problems on graph-structured data. Unlike traditional database queries that retrieve data, graph algorithms analyze the structure and relationships within your data to discover patterns, measure importance, identify communities, and predict connections. These techniques are essential for modern applications in social networks, fraud detection, recommendation systems, knowledge graphs, and network analysis.
Geode provides native support for executing graph algorithms through its ISO-standard GQL query language, offering both built-in operations and the flexibility to implement custom algorithms using GQL’s powerful pattern matching and aggregation capabilities.
Core Algorithm Categories
Centrality Algorithms
Centrality algorithms identify the most important nodes in a graph based on various criteria. PageRank measures influence by analyzing the quality and quantity of incoming relationships, making it ideal for ranking web pages, identifying key influencers in social networks, or finding critical infrastructure nodes. Betweenness Centrality identifies nodes that act as bridges between different parts of the graph, useful for finding bottlenecks or key intermediaries. Degree Centrality simply counts connections, while Closeness Centrality measures how quickly information can spread from a node.
In Geode, you can compute centrality using aggregation queries:
// Find most connected users (degree centrality)
MATCH (u:User)-[r]-()
RETURN u.name, COUNT(r) AS connections
ORDER BY connections DESC
LIMIT 10
Community Detection
Community detection algorithms identify clusters of nodes that are more densely connected to each other than to the rest of the graph. Label Propagation is a fast, scalable approach where nodes adopt labels from their neighbors through iterative voting. Louvain Modularity optimizes for modularity, a measure of community structure quality. Weakly Connected Components finds isolated groups with no connections between them.
These algorithms are crucial for market segmentation, detecting fraud rings, organizing knowledge bases, and understanding social group dynamics.
# Python example: Finding communities through pattern analysis
import geode_client
client = geode_client.open_database("localhost:3141")
async with client.connection() as client:
# Find tightly connected user groups
result, _ = await client.query("""
MATCH (u1:User)-[:FOLLOWS]->(u2:User)
WHERE EXISTS {
MATCH (u2)-[:FOLLOWS]->(u1)
}
RETURN u1.community, COUNT(*) AS size
GROUP BY u1.community
ORDER BY size DESC
""")
Pathfinding Algorithms
Pathfinding algorithms discover routes through the graph. Shortest Path finds the minimum-hop route between nodes, essential for navigation, network routing, and supply chain optimization. Dijkstra’s Algorithm extends this to weighted graphs, considering relationship costs. All Shortest Paths finds all minimal routes, useful when multiple equivalent options exist.
Geode’s GQL implementation provides native pathfinding through path patterns:
// Find shortest path between people
MATCH path = SHORTEST (a:Person {name: 'Alice'})-[:KNOWS*]-(b:Person {name: 'Bob'})
RETURN path, LENGTH(path) AS hops
Graph Neural Networks and Embeddings
Graph Neural Networks (GNNs) learn vector representations of nodes that capture both their attributes and structural position in the graph. Node2Vec creates embeddings through random walks, enabling traditional machine learning on graph data. Graph Convolutional Networks aggregate information from neighborhoods to learn representations.
Geode supports storing and querying vector embeddings, enabling similarity search and machine learning workflows:
# Store node embeddings for similarity search
await client.execute("""
MATCH (p:Product {id: $product_id})
SET p.embedding = $embedding_vector
""", {"product_id": 123, "embedding_vector": [0.23, 0.45, -0.12, ...]})
# Find similar products using cosine similarity
result, _ = await client.query("""
MATCH (target:Product {id: $id})
MATCH (candidate:Product)
WHERE candidate.id <> $id
RETURN candidate.name,
vector.cosine_similarity(target.embedding, candidate.embedding) AS similarity
ORDER BY similarity DESC
LIMIT 5
""", {"id": 123})
Real-Time Analytics
Modern applications require analytics on live, changing data. Geode’s MVCC (Multi-Version Concurrency Control) architecture enables real-time queries without blocking writes, making it ideal for streaming analytics, live dashboards, and operational intelligence.
Real-time pattern detection identifies emerging trends as data arrives:
// Detect unusual activity patterns in real-time
MATCH (u:User)-[:TRANSACTION]->(m:Merchant)
WHERE m.timestamp > current_timestamp() - INTERVAL '5' MINUTE
GROUP BY u.id
HAVING COUNT(*) > u.avg_transactions_per_5min * 3
RETURN u.id, u.name, COUNT(*) AS suspicious_activity
Fraud Detection and Anomaly Detection
Graph algorithms excel at fraud detection because fraudsters create recognizable patterns in relationship data. Ring detection identifies circular money flows, velocity checks measure transaction speed, and network analysis reveals organized fraud networks that would be invisible in traditional databases.
// Detect potential fraud rings
MATCH (a:Account)-[:TRANSFER]->(b:Account)-[:TRANSFER]->(c:Account)
WHERE c.id = a.id
AND edge_timestamp(a, b) > current_timestamp() - INTERVAL '1' DAY
RETURN a.id, b.id, c.id, SUM(edge_amount) AS total_amount
Anomaly detection uses statistical measures and machine learning to identify outliers:
# Detect accounts with unusual transaction patterns
result, _ = await client.query("""
MATCH (a:Account)-[t:TRANSACTION]->()
WITH a,
AVG(t.amount) AS avg_amount,
STDDEV(t.amount) AS stddev_amount,
COUNT(t) AS transaction_count
WHERE stddev_amount > 0
MATCH (a)-[t:TRANSACTION]->()
WHERE ABS(t.amount - avg_amount) > 3 * stddev_amount
RETURN a.id, t.amount, avg_amount, stddev_amount
""")
Recommendation Systems
Graph-based recommendations leverage network structure to suggest relevant items. Collaborative filtering finds users with similar taste graphs and recommends items they’ve liked. Content-based filtering uses node properties and embeddings. Hybrid approaches combine both for superior results.
// Collaborative filtering recommendations
MATCH (me:User {id: $user_id})-[:LIKES]->(item)
<-[:LIKES]-(similar:User)
-[:LIKES]->(recommendation)
WHERE NOT EXISTS {
MATCH (me)-[:LIKES]->(recommendation)
}
RETURN recommendation.title, COUNT(similar) AS score
ORDER BY score DESC
LIMIT 10
Performance Considerations
When implementing graph algorithms in Geode:
- Use indexes strategically: Create indexes on node labels and frequently queried properties to accelerate pattern matching
- Leverage query profiling: Use
PROFILEto understand query execution and optimize bottlenecks - Consider materialization: For frequently computed metrics, store results as node properties and update incrementally
- Partition large computations: Break algorithms into smaller chunks using pagination or time windows
- Utilize prepared statements: Pre-compile complex analytical queries for better performance
-- Profile a centrality computation
PROFILE
MATCH (n:Node)-[r]-()
RETURN n.id, COUNT(r) AS degree
ORDER BY degree DESC
Best Practices
Start simple: Begin with basic aggregations and pattern matching before implementing complex algorithms. Many insights can be discovered through straightforward queries.
Validate results: Compare algorithm outputs against known ground truth when possible, especially for machine learning applications.
Monitor performance: Track query execution times and resource usage as your graph grows. What works at 1,000 nodes may need optimization at 1,000,000.
Document your models: Graph algorithms depend heavily on how you model relationships. Document which relationship types mean what and how they should be traversed.
Combine algorithms: The most powerful insights often come from combining multiple techniques - for example, using community detection to partition the graph, then running PageRank within each community.
Related Topics
- Machine Learning - Integration with ML frameworks
- PageRank - Importance ranking algorithm
- Community Detection - Finding clusters in graphs
- Fraud Detection - Detecting fraudulent patterns
- Recommendations - Building recommendation engines
- Real-Time Analytics - Streaming graph analytics
- Anomaly Detection - Identifying unusual patterns
- Performance Optimization - Optimizing analytical queries
Advanced Algorithm Implementation
Implementing Custom Algorithms
While Geode provides built-in implementations of common algorithms, many use cases require custom analytics tailored to specific domains. GQL’s pattern matching and aggregation capabilities enable implementing sophisticated algorithms directly in queries.
Iterative Computation with WITH Clauses
-- Simplified PageRank iteration
WITH 0.85 AS damping_factor
MATCH (n:Node)
SET n.rank = 1.0
// Iterate multiple times
WITH damping_factor
MATCH (n:Node)<-[r]-(m:Node)
WITH n, damping_factor, SUM(m.rank / degree(m)) AS incoming_rank
SET n.rank = (1 - damping_factor) + damping_factor * incoming_rank
Breadth-First Search Implementation
async def bfs_traversal(client, start_id, max_depth=5):
"""Implement BFS using iterative queries."""
visited = set([start_id])
current_level = [start_id]
for depth in range(max_depth):
if not current_level:
break
result, _ = await client.query("""
MATCH (n)-[:CONNECTED]-(neighbor)
WHERE n.id IN $current_ids
AND NOT neighbor.id IN $visited
RETURN DISTINCT neighbor.id AS id
""", {"current_ids": current_level, "visited": list(visited)})
next_level = []
for row in result.rows:
next_level.append(row['id'])
visited.add(row['id'])
current_level = next_level
return visited
Graph Sampling and Statistics
For very large graphs, computing exact metrics may be prohibitive. Sampling techniques provide approximate results with bounded error.
-- Random node sampling for statistics
MATCH (n:User)
WHERE random() < 0.01 // 1% sample
WITH COUNT(n) AS sample_size,
AVG(degree(n)) AS avg_degree,
STDDEV(degree(n)) AS stddev_degree
RETURN sample_size * 100 AS estimated_total,
avg_degree,
stddev_degree
Random Walk Sampling
-- Random walk for graph exploration
MATCH path = (start:Node {id: $start_id})-[:EDGE*10]->(end)
WHERE ALL (r IN relationships(path) WHERE random() > 0.5)
RETURN nodes(path), length(path)
Temporal Graph Analytics
Temporal graphs capture how relationships evolve over time, enabling analysis of dynamic networks.
Time-Windowed Analysis
-- Analyze community evolution over time windows
WITH DATE '2025-01-01' AS window_start,
DATE '2025-02-01' AS window_end
MATCH (u1:User)-[r:INTERACTION]->(u2:User)
WHERE r.timestamp >= window_start
AND r.timestamp < window_end
WITH u1.community AS community,
COUNT(DISTINCT r) AS interactions
RETURN community,
interactions,
interactions / 31.0 AS daily_avg
ORDER BY interactions DESC
Change Detection
-- Detect nodes with changing centrality
MATCH (n:Node)
WITH n,
n.centrality_last_week AS old_centrality,
calculate_centrality(n) AS new_centrality
WHERE ABS(new_centrality - old_centrality) / old_centrality > 0.5
RETURN n.id,
old_centrality,
new_centrality,
(new_centrality - old_centrality) / old_centrality AS pct_change
ORDER BY ABS(pct_change) DESC
Machine Learning Integration Patterns
Feature Engineering from Graphs
Graph structure provides rich features for machine learning models.
async def extract_node_features(client, node_id):
"""Extract graph-based features for ML."""
features, _ = await client.query("""
MATCH (n {id: $node_id})
OPTIONAL MATCH (n)-[r]-()
WITH n,
COUNT(DISTINCT r) AS degree,
COUNT{(n)-[:FRIEND]-()} AS friend_count,
COUNT{(n)-[:POSTED]->()} AS post_count
OPTIONAL MATCH (n)-[:FRIEND*2]-(foaf)
WITH n, degree, friend_count, post_count,
COUNT(DISTINCT foaf) AS network_size
RETURN degree,
friend_count,
post_count,
network_size,
friend_count::FLOAT / NULLIF(degree, 0) AS friend_ratio
""", {"node_id": node_id})
return features.rows[0] if features.rows else None
Graph Embedding Training
import numpy as np
from geode_client import Client
async def train_node2vec_embeddings(client, dimensions=128):
"""Generate Node2Vec embeddings."""
# Generate random walks
walks = []
result, _ = await client.query("""
MATCH (start:Node)
MATCH path = (start)-[:EDGE*10..20]-(end)
RETURN [n IN nodes(path) | n.id] AS walk
LIMIT 10000
""")
for row in result.rows:
walks.append(row['walk'])
# Train Word2Vec on walks (node sequences)
from gensim.models import Word2Vec
model = Word2Vec(walks, vector_size=dimensions, window=5, min_count=1)
# Store embeddings back in graph
for node_id in model.wv.index_to_key:
embedding = model.wv[node_id].tolist()
await client.execute("""
MATCH (n:Node {id: $node_id})
SET n.embedding = $embedding
""", {"node_id": int(node_id), "embedding": embedding})
Performance Optimization Strategies
Query Optimization for Algorithms
Materialized Views for Repeated Computations
-- Pre-compute and store degree centrality
MATCH (n:Node)-[r]-()
WITH n, COUNT(r) AS degree
SET n.degree = degree
-- Later queries use pre-computed values
MATCH (n:Node)
WHERE n.degree > 10
RETURN n.id, n.degree
Incremental Updates
Instead of recomputing metrics from scratch, update them incrementally:
async def update_centrality_incremental(client, new_edge_source, new_edge_target):
"""Update centrality scores incrementally when edge is added."""
await client.execute("""
MATCH (source {id: $source}), (target {id: $target})
SET source.degree = COALESCE(source.degree, 0) + 1,
target.degree = COALESCE(target.degree, 0) + 1
""", {"source": new_edge_source, "target": new_edge_target})
Parallel Algorithm Execution
import asyncio
async def parallel_community_detection(client, communities):
"""Run PageRank on multiple communities in parallel."""
tasks = []
for community_id in communities:
task = compute_pagerank_for_community(client, community_id)
tasks.append(task)
results = await asyncio.gather(*tasks)
return dict(zip(communities, results))
async def compute_pagerank_for_community(client, community_id):
"""Compute PageRank within a single community."""
result, _ = await client.query("""
MATCH (n:Node {community: $community_id})-[r]->()
WITH n, COUNT(r) AS out_degree
RETURN n.id, out_degree
""", {"community_id": community_id})
# PageRank computation logic here
return await process_pagerank(result)
Graph Algorithm Benchmarking
Performance Testing
import time
import statistics
async def benchmark_algorithm(client, algorithm_query, params, iterations=10):
"""Benchmark graph algorithm performance."""
timings = []
for i in range(iterations):
start = time.perf_counter()
await client.execute(algorithm_query, params)
end = time.perf_counter()
timings.append(end - start)
return {
'mean': statistics.mean(timings),
'median': statistics.median(timings),
'stddev': statistics.stdev(timings),
'min': min(timings),
'max': max(timings)
}
Scalability Analysis
-- Profile algorithm performance at different scales
PROFILE
MATCH (n:Node)-[r:EDGE*1..5]-(m:Node)
WHERE n.id = $start_id
RETURN COUNT(DISTINCT m) AS reachable_nodes,
COUNT(r) AS total_paths
Industry-Specific Applications
Bioinformatics: Protein Interaction Networks
-- Find proteins in same pathway
MATCH (p1:Protein {name: $protein_name})-[:INTERACTS_WITH*1..3]-(p2:Protein)
-[:PART_OF]->(pathway:Pathway)
RETURN pathway.name,
COUNT(DISTINCT p2) AS proteins_in_pathway
ORDER BY proteins_in_pathway DESC
Transportation: Route Optimization
-- Find optimal route with capacity constraints
MATCH path = SHORTEST (start:Location {name: $origin})
-[:ROUTE*]->
(end:Location {name: $destination})
WHERE ALL (r IN relationships(path) WHERE r.capacity >= $required_capacity)
RETURN path,
REDUCE(cost = 0, r IN relationships(path) | cost + r.distance) AS total_distance
Cybersecurity: Attack Graph Analysis
-- Identify critical vulnerabilities in attack paths
MATCH path = (attacker:ExternalNode)-[:EXPLOITS*]->(target:CriticalAsset)
WITH path,
[n IN nodes(path) WHERE n:Vulnerability] AS vulnerabilities
RETURN path,
vulnerabilities,
length(path) AS attack_steps
ORDER BY length(path)
LIMIT 10
Visualization and Exploration
While Geode focuses on data storage and querying, visualization tools help explore algorithm results.
async def export_for_visualization(client, community_id):
"""Export subgraph for visualization tools."""
result, _ = await client.query("""
MATCH (n:Node {community: $community_id})-[r]-(m:Node {community: $community_id})
RETURN COLLECT(DISTINCT n) AS nodes,
COLLECT(DISTINCT r) AS relationships
""", {"community_id": community_id})
data = result.rows[0] if result.rows else None
# Export to formats like GraphML, GEXF, or D3.js JSON
return convert_to_graphml(data['nodes'], data['relationships'])
Error Handling and Validation
import asyncio
from geode_client import QueryError
async def robust_algorithm_execution(client, query, params):
"""Execute algorithm with proper error handling."""
try:
result, _ = await client.query(query, params)
return result
except QueryError as e:
if "timeout" in str(e).lower():
# Retry with increased timeout
result, _ = await asyncio.wait_for(
client.query(query, params),
timeout=60
)
return result
elif "memory" in str(e).lower():
# Suggest reducing dataset
raise ValueError("Graph too large, consider sampling")
else:
raise
Further Reading
- GQL Pattern Matching - Foundation for graph algorithms
- Aggregation Functions - Statistical computations
- Vector Search - Similarity search and embeddings
- Query Profiling - Understanding algorithm performance
- Best Practices - Optimization techniques
- Performance Tuning - Query optimization guide
- Temporal Graphs - Time-aware graph analytics