Graph Modeling Guide
Designing an effective graph schema is crucial for query performance and application maintainability. This guide covers best practices, common patterns, and anti-patterns for modeling data in Geode.
Graph Thinking
Shift from Tables to Graphs
Relational Thinking:
-- Tables with foreign keys
Users (id, name, email)
Posts (id, user_id, content)
Likes (user_id, post_id, timestamp)
Graph Thinking:
// Entities as nodes, connections as relationships
(:User {id, name, email})
(:Post {id, content})
(User)-[:POSTED]->(Post)
(User)-[:LIKES {timestamp}]->(Post)
Key Differences:
- Relationships are first-class: Not just foreign keys
- Traversal is natural: Follow relationships directly
- Flexible schema: Add new relationships easily
Core Principles
1. Nodes for Entities
Rule: Use nodes for entities you’ll query independently.
Good:
(:Person {name, age, email})
(:Company {name, industry})
(:Product {name, price})
Bad:
// Don't use nodes for simple values
(:Email {address}) // Just use property
(:Age {value}) // Just use property
2. Relationships for Connections
Rule: Use relationships to connect entities with meaningful semantics.
Good:
(:Person)-[:WORKS_AT {since, role}]->(:Company)
(:Person)-[:KNOWS {since, context}]->(:Person)
(:Person)-[:PURCHASED {date, price}]->(:Product)
Bad:
// Don't use properties for relationships
(:Person {company_id}) // Use relationship instead
// Don't use generic relationships
(:Person)-[:RELATED_TO]->(:Company) // Be specific
3. Properties for Attributes
Rule: Use properties for intrinsic attributes of entities.
Good:
(:Person {
name: "Alice",
age: 30,
email: "[email protected]",
created_at: timestamp()
})
Bad:
// Don't create nodes for attributes
(:Person)-[:HAS_NAME]->(:Name {value: "Alice"})
// Don't duplicate data unnecessarily
(:Person {
name: "Alice",
uppercase_name: "ALICE", // Compute on demand
name_length: 5 // Compute on demand
})
Common Patterns
Pattern 1: Hierarchies
Use Case: Organizational structures, categories, taxonomies
Implementation:
// Tree structure
(:Category {name: "Electronics"})
-[:PARENT]->(:Category {name: "Computers"})
-[:PARENT]->(:Category {name: "Laptops"})
-[:PARENT]->(:Category {name: "Desktops"})
// Query: Find all descendants
MATCH (root:Category {name: "Electronics"})-[:PARENT*]->(child)
RETURN child.name
// Query: Find path to root
MATCH path = (leaf:Category {name: "Laptops"})-[:PARENT*]->(root)
WHERE NOT (root)-[:PARENT]->()
RETURN path
Best Practices:
- Consider max depth when querying (
[:PARENT*1..5]) - Add level property for quick filtering
- Cache common paths if frequently accessed
Pattern 2: Timeline/Activity
Use Case: Events, posts, messages, logs
Implementation:
// Timeline with linked list
(:Post {content, created_at})-[:NEXT]->(:Post)
// Or time-based relationships
(:User)-[:POSTED {timestamp}]->(:Post)
// Query: Recent activity
MATCH (u:User)-[p:POSTED]->(post:Post)
WHERE p.timestamp > timestamp() - duration('P7D')
RETURN post
ORDER BY p.timestamp DESC
LIMIT 20
Best Practices:
- Index timestamp properties
- Consider time-bucketing for very active users
- Use LIMIT to control result size
Pattern 3: Recommendations
Use Case: Friend suggestions, product recommendations
Implementation:
// Collaborative filtering
MATCH (me:User)-[:PURCHASED]->(p:Product)<-[:PURCHASED]-(other)
MATCH (other)-[:PURCHASED]->(recommendation)
WHERE NOT (me)-[:PURCHASED]->(recommendation)
RETURN recommendation, count(other) AS score
ORDER BY score DESC
LIMIT 10
// Content-based
MATCH (me:User)-[:LIKES]->(item)
WITH me, collect(item.category) AS my_categories
MATCH (recommendation)
WHERE recommendation.category IN my_categories
AND NOT (me)-[:LIKES]->(recommendation)
RETURN recommendation
Best Practices:
- Limit traversal depth
- Use aggregations for scoring
- Consider pre-computing for popular items
Pattern 4: Access Control
Use Case: Permissions, privacy, authorization
Implementation:
// Role-based access
(:User)-[:HAS_ROLE]->(:Role {name: "admin"})
(:Role)-[:CAN_ACCESS]->(:Resource)
// Permission check
MATCH (user:User {id: $user_id})-[:HAS_ROLE]->(:Role)
-[:CAN_ACCESS]->(resource:Resource {id: $resource_id})
RETURN count(resource) > 0 AS has_access
// Ownership
(:User)-[:OWNS]->(:Document)
// Sharing
(:User)-[:CAN_VIEW]->(:Document)
(:User)-[:CAN_EDIT]->(:Document)
Best Practices:
- Cache permission checks
- Use groups/roles instead of individual permissions
- Consider separate authorization service for complex cases
Pattern 5: Versioning
Use Case: Document history, audit trails
Implementation:
// Version chain
(:Document {current_version: 3})
-[:HAS_VERSION]->(:Version {number: 3, content, timestamp})
-[:PREVIOUS]->(:Version {number: 2})
-[:PREVIOUS]->(:Version {number: 1})
// Query: Get history
MATCH (doc:Document {id: $id})-[:HAS_VERSION]->(v:Version)
MATCH path = (v)-[:PREVIOUS*0..]->(older)
RETURN older
ORDER BY older.number DESC
// Query: Diff versions
MATCH (doc:Document {id: $id})-[:HAS_VERSION]->(current)
MATCH (current)-[:PREVIOUS*3]->(old)
RETURN current.content, old.content
Best Practices:
- Limit history depth or archive old versions
- Consider external storage for large content
- Add version metadata (author, timestamp, message)
Schema Design Process
Step 1: Identify Entities
List the main entities in your domain:
E-commerce example:
- Users (customers, sellers)
- Products
- Categories
- Orders
- Reviews
- Addresses
Step 2: Model Relationships
Determine how entities connect:
(:User)-[:SELLS]->(:Product)
(:User)-[:PURCHASED]->(:Product)
(:User)-[:REVIEWED]->(:Product)
(:Product)-[:IN_CATEGORY]->(:Category)
(:Order)-[:CONTAINS]->(:Product)
(:User)-[:PLACED]->(:Order)
(:User)-[:HAS_ADDRESS]->(:Address)
Step 3: Add Properties
Determine what data belongs on each entity:
(:User {
id: INTEGER,
email: STRING,
name: STRING,
created_at: TIMESTAMP
})
(:Product {
id: INTEGER,
name: STRING,
price: FLOAT,
description: STRING,
stock: INTEGER
})
(:Order {
id: INTEGER,
total: FLOAT,
status: STRING,
created_at: TIMESTAMP
})
Step 4: Add Relationship Properties
Capture context about connections:
(:User)-[:REVIEWED {
rating: INTEGER,
comment: STRING,
timestamp: TIMESTAMP
}]->(:Product)
(:User)-[:PURCHASED {
quantity: INTEGER,
price: FLOAT,
timestamp: TIMESTAMP
}]->(:Product)
Step 5: Plan for Queries
Design schema to support your queries efficiently:
// Query: Products in category
// Ensure relationship direction supports this
MATCH (cat:Category)<-[:IN_CATEGORY]-(product:Product)
RETURN product
// Query: User's purchase history
// Add timestamp for ordering
MATCH (user:User)-[p:PURCHASED]->(product:Product)
RETURN product
ORDER BY p.timestamp DESC
Advanced Patterns
Hypernodes
Problem: Some nodes have millions of relationships (celebrity accounts, popular products)
Solution: Introduce intermediate nodes
Bad:
// Popular product with 10M likes
(:Product {id: 1})<-[:LIKES]-(:User) // x10,000,000
Good:
// Bucket by time
(:Product)-[:LIKES_2024_01]->(:LikeBucket)
(:LikeBucket)<-[:LIKES]->(:User) // x1,000
// Query with bucket
MATCH (product:Product {id: 1})-[:LIKES_2024*]->(bucket)
MATCH (bucket)<-[:LIKES]-(user)
RETURN count(user)
Denormalization
When: Read performance critical, data rarely changes
Example:
// Normalized
(:User)-[:LIVES_IN]->(:City)-[:IN_STATE]->(:State)-[:IN_COUNTRY]->(:Country)
// Denormalized for fast queries
(:User {
city: "San Francisco",
state: "California",
country: "USA"
})
// Trade-off: Faster queries, but updates affect multiple nodes
Materialized Paths
When: Frequent queries on hierarchies
Example:
// Store full path
(:Category {
name: "Laptops",
path: ["Electronics", "Computers", "Laptops"],
level: 3
})
// Fast queries
MATCH (cat:Category)
WHERE "Electronics" IN cat.path
RETURN cat
// Fast level queries
MATCH (cat:Category {level: 2})
RETURN cat
Event Sourcing
When: Need complete audit trail
Example:
// Events as nodes
(:User)-[:TRIGGERED]->(:Event {
type: "user.created",
timestamp: timestamp(),
data: {...}
})
(:User)-[:TRIGGERED]->(:Event {
type: "user.updated",
timestamp: timestamp(),
data: {...}
})
// Rebuild state from events
MATCH (user:User)-[:TRIGGERED]->(event:Event)
WHERE event.type STARTS WITH "user."
RETURN event
ORDER BY event.timestamp
Anti-Patterns
1. Deep Nesting
Bad:
// Too many levels
(:Person)-[:KNOWS]->(:Person)-[:KNOWS]->(:Person)-[:KNOWS*10]->(:Person)
Solution: Limit depth or add shortcuts
// Add direct connections for common paths
(:Person)-[:KNOWS_INDIRECTLY {hops: 3}]->(:Person)
2. Property Explosions
Bad:
(:User {
friend_1_id: 123,
friend_2_id: 456,
friend_3_id: 789,
// ...
friend_1000_id: 999
})
Solution: Use relationships
(:User)-[:FRIENDS_WITH]->(:User)
3. Generic Relationships
Bad:
(:Person)-[:RELATED_TO]->(:Company)
(:Person)-[:RELATED_TO]->(:Product)
Solution: Specific relationship types
(:Person)-[:WORKS_AT]->(:Company)
(:Person)-[:PURCHASED]->(:Product)
4. Nodes for Properties
Bad:
(:Person)-[:HAS_EMAIL]->(:Email {address: "[email protected]"})
(:Person)-[:HAS_PHONE]->(:Phone {number: "555-0123"})
Solution: Use properties
(:Person {
email: "[email protected]",
phone: "555-0123"
})
5. Bi-directional Duplication
Bad:
// Creating both directions unnecessarily
(:Person)-[:KNOWS]->(:Person)
(:Person)<-[:KNOWN_BY]-(:Person)
Solution: Use one direction, query both ways
// Create one direction
(:Person)-[:KNOWS]->(:Person)
// Query either direction
MATCH (p1:Person)-[:KNOWS]-(p2:Person)
RETURN p2
Migration Strategy
From Relational
-- SQL Schema
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name VARCHAR,
email VARCHAR
);
CREATE TABLE friendships (
user_id INT REFERENCES users(id),
friend_id INT REFERENCES users(id),
since DATE
);
To Graph:
// Import users
LOAD CSV WITH HEADERS FROM 'file:///users.csv' AS row
CREATE (:User {
id: toInteger(row.id),
name: row.name,
email: row.email
})
// Import friendships
LOAD CSV WITH HEADERS FROM 'file:///friendships.csv' AS row
MATCH (u1:User {id: toInteger(row.user_id)})
MATCH (u2:User {id: toInteger(row.friend_id)})
CREATE (u1)-[:KNOWS {since: date(row.since)}]->(u2)
Incremental Migration
# Dual-write to both systems during migration
async def create_user(data):
# Write to PostgreSQL
await pg_conn.execute(
"INSERT INTO users (name, email) VALUES ($1, $2)",
data['name'], data['email']
)
# Write to Geode
await geode_client.query(
"CREATE (:User {name: $name, email: $email})",
params=data
)
Testing Your Schema
Unit Tests
import pytest
@pytest.mark.asyncio
async def test_user_creation():
await client.query(
"CREATE (:User {id: 1, name: 'Test'})"
)
result, _ = await client.query(
"MATCH (u:User {id: 1}) RETURN u.name"
)
user = result.rows[0] if result.rows else None
assert user['u.name'] == 'Test'
@pytest.mark.asyncio
async def test_friendship():
# Create users
await client.query(
"CREATE (:User {id: 1, name: 'Alice'})"
)
await client.query(
"CREATE (:User {id: 2, name: 'Bob'})"
)
# Create friendship
await client.query("""
MATCH (a:User {id: 1})
MATCH (b:User {id: 2})
CREATE (a)-[:KNOWS]->(b)
""")
# Verify
result, _ = await client.query("""
MATCH (a:User {id: 1})-[:KNOWS]->(b)
RETURN b.name
""")
friend = result.rows[0] if result.rows else None
assert friend['b.name'] == 'Bob'
Documentation
Document your schema:
# Schema Documentation
## Nodes
### User
Represents a user account.
Properties:
- `id` (INTEGER): Unique identifier
- `username` (STRING): Unique username
- `email` (STRING): Email address
- `created_at` (TIMESTAMP): Account creation time
Relationships:
- `[:KNOWS]->(:User)`: Friend connections
- `[:POSTED]->(:Post)`: Created posts
- `[:LIKES]->(:Post)`: Liked posts
Constraints:
- `user_id_unique`: Unique user ID
- `user_username_unique`: Unique username
Indexes:
- `user_username`: B-tree on username
- `user_email`: B-tree on email
Conclusion
Good graph modeling:
- Makes queries natural and efficient
- Evolves with requirements
- Balances normalization and performance
- Documents design decisions
Start simple, iterate based on usage patterns, and always profile your queries.
Resources
Questions? Discuss schema design in our forum .