Comprehensive telemetry and observability are essential for operating Geode in production. Geode provides built-in instrumentation, metrics export, structured logging, and distributed tracing support for understanding system behavior, identifying performance bottlenecks, and troubleshooting issues efficiently.
Metrics Collection
Geode exposes detailed metrics for monitoring database health and performance.
Core Metrics
Query Performance:
geode_query_duration_seconds{quantile="0.50|0.95|0.99"}
geode_query_count_total{status="success|error"}
geode_query_result_rows{quantile="0.50|0.95|0.99"}
Connection Pool:
geode_pool_connections_active
geode_pool_connections_idle
geode_pool_connections_total
geode_pool_wait_duration_seconds{quantile="0.95|0.99"}
Resource Utilization:
geode_memory_usage_bytes{type="heap|graph|index"}
geode_cpu_usage_percent
geode_disk_usage_bytes{type="data|wal|index"}
Graph Statistics:
geode_nodes_total{label="User|Product|..."}
geode_relationships_total{type="FRIENDS_WITH|PURCHASED|..."}
geode_properties_total
Prometheus Integration
Geode natively exports metrics in Prometheus format.
Configuration
Enable Prometheus endpoint:
# Start Geode with metrics endpoint
./geode serve --metrics-port 9090
Or configure via config file:
[telemetry]
enabled = true
metrics_port = 9090
metrics_path = "/metrics"
Prometheus Scrape Config
# prometheus.yml
scrape_configs:
- job_name: 'geode'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s
scrape_timeout: 10s
Example Metrics Output
# HELP geode_query_duration_seconds Query execution time
# TYPE geode_query_duration_seconds summary
geode_query_duration_seconds{quantile="0.5"} 0.012
geode_query_duration_seconds{quantile="0.95"} 0.089
geode_query_duration_seconds{quantile="0.99"} 0.234
geode_query_duration_seconds_sum 1234.56
geode_query_duration_seconds_count 98765
# HELP geode_pool_connections_active Active database connections
# TYPE geode_pool_connections_active gauge
geode_pool_connections_active 42
# HELP geode_nodes_total Total number of nodes by label
# TYPE geode_nodes_total gauge
geode_nodes_total{label="User"} 1000000
geode_nodes_total{label="Product"} 500000
Grafana Dashboards
Pre-built Grafana dashboards provide visual monitoring.
Key Dashboard Panels
Query Performance:
# P95 Query Latency
histogram_quantile(0.95,
rate(geode_query_duration_seconds_bucket[5m])
)
# Query Throughput
rate(geode_query_count_total[1m])
# Error Rate
rate(geode_query_count_total{status="error"}[5m]) /
rate(geode_query_count_total[5m])
Resource Utilization:
# Memory Usage
geode_memory_usage_bytes / geode_memory_limit_bytes * 100
# CPU Usage
rate(geode_cpu_usage_seconds_total[5m]) * 100
# Disk Usage
geode_disk_usage_bytes / geode_disk_capacity_bytes * 100
Connection Pool Health:
# Pool Utilization
geode_pool_connections_active / geode_pool_connections_total * 100
# Connection Wait Time
histogram_quantile(0.99,
rate(geode_pool_wait_duration_seconds_bucket[5m])
)
Dashboard Import
# Download pre-built dashboard
curl -O https://geodedb.com/grafana/geode-dashboard.json
# Import via Grafana UI or API
curl -X POST http://localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @geode-dashboard.json
Distributed Tracing
Geode supports OpenTelemetry for distributed tracing across services.
OpenTelemetry Configuration
# geode.toml
[telemetry.tracing]
enabled = true
exporter = "jaeger" # or "zipkin", "otlp"
endpoint = "http://jaeger:14268/api/traces"
sample_rate = 0.1 # Trace 10% of requests
Trace Context Propagation
Geode automatically propagates trace context from client libraries:
Go Client:
import "go.opentelemetry.io/otel"
ctx, span := tracer.Start(ctx, "fetch_user_network")
defer span.End()
// Trace context propagated to Geode
rows, err := db.QueryContext(ctx, `
MATCH (u:User {id: ?})-[:FRIENDS_WITH]->(f)
RETURN f
`, userID)
Python Client:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("fetch_user_network"):
# Trace context propagated automatically
async with client.connection() as conn:
results, _ = await conn.query(
"""
MATCH (u:User {id: $id})-[:FRIENDS_WITH]->(f)
RETURN f
""",
{"id": user_id},
)
Trace Spans
Geode creates spans for internal operations:
Service Request
└─ Query Parsing (2ms)
└─ Query Planning (5ms)
└─ Query Execution (15ms)
├─ Index Lookup (3ms)
├─ Graph Traversal (10ms)
└─ Result Serialization (2ms)
└─ Response Transmission (1ms)
Structured Logging
Geode uses structured JSON logging for machine-readable logs.
Log Levels
# Set log level
export GEODE_LOG_LEVEL=info # debug|info|warn|error
# Start with specific level
./geode serve --log-level debug
Log Format
{
"timestamp": "2026-01-24T14:30:00.123Z",
"level": "info",
"message": "Query executed successfully",
"query_id": "q-12345",
"duration_ms": 45,
"result_rows": 100,
"user_id": "user-456",
"client_ip": "192.168.1.100"
}
Log Aggregation
Forward logs to aggregation services:
Loki:
# promtail.yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: geode
static_configs:
- targets:
- localhost
labels:
job: geode
__path__: /var/log/geode/*.log
ELK Stack:
# Forward to Elasticsearch
./geode serve --log-format json | \
filebeat -e -c filebeat.yml
Query Profiling
Detailed query profiling for performance optimization.
PROFILE Command
PROFILE
MATCH (u:User {email: $email})-[:PURCHASED]->(p:Product)
RETURN p.name, p.price
ORDER BY p.price DESC
LIMIT 10
Profile Output
Query Plan:
1. Index Lookup: User(email) [2ms, 1 row]
2. Expand Relationships: PURCHASED [8ms, 45 rows]
3. Project: p.name, p.price [1ms]
4. Sort: p.price DESC [3ms]
5. Limit: 10 [<1ms]
Total Execution Time: 15ms
Memory Allocated: 256KB
Rows Processed: 45
Rows Returned: 10
Index Usage:
- User.email: HIT (selectivity: 0.0001%)
Query Metrics per Execution
{
"query_id": "q-67890",
"execution_time_ms": 15,
"planning_time_ms": 2,
"rows_processed": 45,
"rows_returned": 10,
"memory_allocated_bytes": 262144,
"index_lookups": 1,
"index_hits": 1,
"index_misses": 0,
"traversal_depth": 1
}
Performance Alerts
Configure alerts for performance degradation.
Prometheus AlertManager Rules
groups:
- name: geode_performance
rules:
- alert: HighQueryLatency
expr: |
histogram_quantile(0.99,
rate(geode_query_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High query latency detected"
description: "P99 query latency is {{ $value }}s"
- alert: HighErrorRate
expr: |
rate(geode_query_count_total{status="error"}[5m]) /
rate(geode_query_count_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High query error rate"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: ConnectionPoolExhaustion
expr: |
geode_pool_connections_active /
geode_pool_connections_total > 0.90
for: 2m
labels:
severity: warning
annotations:
summary: "Connection pool near capacity"
- alert: MemoryPressure
expr: |
geode_memory_usage_bytes /
geode_memory_limit_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
Client-Side Telemetry
Client libraries emit telemetry for end-to-end visibility.
Go Client Metrics
import "geodedb.com/geode/telemetry"
// Enable Prometheus metrics
telemetry.EnablePrometheus(":9091")
// Metrics exposed:
// - geode_client_queries_total
// - geode_client_query_duration_seconds
// - geode_client_connection_errors_total
// - geode_client_pool_connections_active
Python Client Metrics
from geode_client import Client, enable_telemetry
# Enable telemetry export
enable_telemetry(
exporter="prometheus",
port=9091
)
# Metrics tracked:
# - Query count and duration
# - Connection pool statistics
# - Error rates by type
Real-Time Monitoring
Stream real-time metrics for operational dashboards.
WebSocket Metrics Stream
# Subscribe to real-time metrics
wscat -c ws://localhost:3141/metrics/stream
# Receives JSON messages:
{
"timestamp": "2026-01-24T14:30:00Z",
"queries_per_second": 1250,
"avg_latency_ms": 15,
"active_connections": 42,
"memory_usage_mb": 2048
}
Metrics API
# HTTP endpoint for current metrics
curl http://localhost:3141/metrics/current
{
"timestamp": "2026-01-24T14:30:00Z",
"queries": {
"total": 12345678,
"rate_per_second": 1250,
"avg_duration_ms": 15,
"p95_duration_ms": 45,
"p99_duration_ms": 120
},
"connections": {
"active": 42,
"idle": 8,
"total": 50
},
"resources": {
"memory_used_mb": 2048,
"memory_total_mb": 4096,
"cpu_percent": 45,
"disk_used_gb": 120
}
}
Debugging Tools
Built-in tools for troubleshooting production issues.
Slow Query Log
[logging]
slow_query_threshold_ms = 1000
slow_query_log = "/var/log/geode/slow-queries.log"
Logs queries exceeding threshold:
{
"timestamp": "2026-01-24T14:30:00Z",
"duration_ms": 2340,
"query": "MATCH (u:User)-[:FRIENDS_WITH*3..5]->(f) RETURN f",
"parameters": {"user_id": "user-123"},
"rows_returned": 50000,
"execution_plan": "..."
}
Connection Debugging
# List active connections
./geode admin connections list
# Connection details
./geode admin connections show <connection-id>
# Kill stuck connection
./geode admin connections kill <connection-id>
Query Debugging
# Explain query execution plan
./geode shell --explain <<< "MATCH (u:User) WHERE u.email = '[email protected]' RETURN u"
# Profile query performance
./geode shell --profile <<< "MATCH (u:User)-[:PURCHASED]->(p:Product) RETURN p"
Best Practices
Set Appropriate Alert Thresholds
Tune thresholds based on baseline performance:
- Latency alerts: p99 > 2x baseline
- Error rate alerts: >1% for critical queries
- Resource alerts: >80% utilization
Sample Traces Intelligently
Balance observability with overhead:
- Production: 1-10% sampling
- Staging: 50-100% sampling
- Development: 100% sampling
Retain Metrics Appropriately
Configure retention policies:
- Real-time metrics: 1-7 days (high resolution)
- Historical metrics: 90 days (downsampled)
- Long-term trends: 1 year (aggregated)
Monitor Client and Server
Track metrics on both sides:
- Server metrics: Query performance, resource usage
- Client metrics: Connection pool, request latency
- End-to-end tracing: Full request lifecycle
Comprehensive telemetry enables proactive monitoring, rapid troubleshooting, and data-driven optimization of Geode deployments. Integration with standard observability tools like Prometheus, Grafana, and OpenTelemetry provides seamless operational visibility.