Prometheus is the industry-standard monitoring solution for cloud-native applications, and Geode provides first-class Prometheus integration. Geode exposes comprehensive metrics covering queries, transactions, connections, storage, memory, and system health through a Prometheus-compatible /metrics endpoint.
By integrating Geode with Prometheus, you gain real-time visibility into database performance, can set up intelligent alerting on critical conditions, and build comprehensive dashboards for operational insights. Combined with Grafana for visualization, Prometheus and Geode form a powerful observability stack for production graph database deployments.
This guide covers Prometheus integration patterns, essential metrics, alerting rules, and visualization strategies for monitoring Geode effectively.
Prometheus Architecture with Geode
Prometheus uses a pull-based model where it periodically scrapes metrics from instrumented applications. Geode exposes metrics at http://geode-host:8080/metrics in Prometheus exposition format:
# HELP geode_queries_total Total number of queries executed
# TYPE geode_queries_total counter
geode_queries_total{status="success"} 125847
geode_queries_total{status="error"} 342
# HELP geode_query_duration_seconds Query execution duration
# TYPE geode_query_duration_seconds histogram
geode_query_duration_seconds_bucket{le="0.01"} 45234
geode_query_duration_seconds_bucket{le="0.05"} 89432
geode_query_duration_seconds_bucket{le="0.1"} 112847
geode_query_duration_seconds_sum 12847.3
geode_query_duration_seconds_count 125847
Configuring Prometheus Scraping
Basic Prometheus Configuration:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'geode-production'
scrape_configs:
- job_name: 'geode'
static_configs:
- targets: ['localhost:8080']
labels:
instance: 'geode-primary'
environment: 'production'
scrape_interval: 10s
scrape_timeout: 5s
metrics_path: '/metrics'
Multi-Instance Configuration:
scrape_configs:
- job_name: 'geode-cluster'
static_configs:
- targets:
- 'geode-node-1:8080'
- 'geode-node-2:8080'
- 'geode-node-3:8080'
labels:
cluster: 'geode-prod'
datacenter: 'us-east-1'
Kubernetes Service Discovery:
scrape_configs:
- job_name: 'geode-k8s'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- geode
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: geode
- source_labels: [__meta_kubernetes_pod_name]
target_label: instance
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
Essential Geode Metrics
Query Performance Metrics:
# Query rate by status
rate(geode_queries_total[5m])
# Average query duration (p50, p95, p99)
histogram_quantile(0.50, rate(geode_query_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(geode_query_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(geode_query_duration_seconds_bucket[5m]))
# Slow query rate (queries > 1s)
rate(geode_slow_queries_total[5m])
# Query errors by type
sum by (error_type) (rate(geode_query_errors_total[5m]))
Transaction Metrics:
# Transaction commit rate
rate(geode_transactions_total{status="committed"}[5m])
# Transaction rollback rate
rate(geode_transactions_total{status="rolled_back"}[5m])
# Transaction conflict rate
rate(geode_transaction_conflicts_total[5m])
# Active transactions
geode_active_transactions
# Transaction duration
histogram_quantile(0.95, rate(geode_transaction_duration_seconds_bucket[5m]))
Connection Metrics:
# Active connections
geode_active_connections
# Connection pool utilization
geode_active_connections / geode_max_connections * 100
# Connection error rate
rate(geode_connection_errors_total[5m])
# Connections by client type
sum by (client_type) (geode_active_connections)
Memory and Resource Metrics:
# Memory usage
geode_memory_used_bytes
# Memory usage percentage
geode_memory_used_bytes / geode_memory_total_bytes * 100
# Cache hit rate
rate(geode_cache_hits_total[5m]) /
(rate(geode_cache_hits_total[5m]) + rate(geode_cache_misses_total[5m]))
# MVCC version count
geode_mvcc_versions_total
# Garbage collection time
rate(geode_gc_duration_seconds_sum[5m])
Storage Metrics:
# Disk space used
geode_disk_used_bytes
# Disk space available
geode_disk_free_bytes
# WAL size
geode_wal_size_bytes
# Disk I/O operations
rate(geode_disk_io_operations_total[5m])
# Checkpoint duration
geode_checkpoint_duration_seconds
Index Metrics:
# Index size
geode_index_size_bytes
# Index lookups
rate(geode_index_lookups_total[5m])
# Index hit rate
rate(geode_index_hits_total[5m]) / rate(geode_index_lookups_total[5m])
Alerting Rules
Critical Alerts:
groups:
- name: geode_critical
interval: 30s
rules:
- alert: GeodeDown
expr: up{job="geode"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Geode instance {{ $labels.instance }} is down"
description: "Geode has been unreachable for more than 1 minute"
- alert: HighQueryErrorRate
expr: |
rate(geode_queries_total{status="error"}[5m]) > 10
for: 5m
labels:
severity: critical
annotations:
summary: "High query error rate on {{ $labels.instance }}"
description: "Query error rate is {{ $value }} errors/sec"
- alert: DiskSpaceCritical
expr: |
geode_disk_free_bytes / geode_disk_total_bytes < 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
- alert: MemoryExhaustion
expr: |
geode_memory_used_bytes / geode_memory_total_bytes > 0.95
for: 10m
labels:
severity: critical
annotations:
summary: "Memory near exhaustion on {{ $labels.instance }}"
description: "Memory usage at {{ $value | humanizePercentage }}"
Warning Alerts:
- name: geode_warnings
interval: 1m
rules:
- alert: SlowQueryRateIncreasing
expr: |
rate(geode_slow_queries_total[5m]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Slow query rate increasing on {{ $labels.instance }}"
description: "{{ $value }} slow queries per second"
- alert: HighTransactionConflicts
expr: |
rate(geode_transaction_conflicts_total[5m]) > 100
for: 10m
labels:
severity: warning
annotations:
summary: "High transaction conflict rate"
description: "{{ $value }} conflicts per second"
- alert: ConnectionPoolPressure
expr: |
geode_active_connections / geode_max_connections > 0.8
for: 15m
labels:
severity: warning
annotations:
summary: "Connection pool pressure on {{ $labels.instance }}"
description: "{{ $value | humanizePercentage }} of connections used"
- alert: LongRunningQueries
expr: |
geode_active_queries{duration_seconds=">60"} > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Multiple long-running queries detected"
description: "{{ $value }} queries running for >60 seconds"
- alert: WalSizeGrowing
expr: |
deriv(geode_wal_size_bytes[30m]) > 1e9
for: 30m
labels:
severity: warning
annotations:
summary: "WAL size growing rapidly on {{ $labels.instance }}"
description: "WAL growing at {{ $value | humanize1024 }}/sec"
Grafana Dashboard Queries
Query Performance Dashboard:
# Panel: Query Rate
sum(rate(geode_queries_total[5m])) by (status)
# Panel: Query Latency Percentiles
histogram_quantile(0.50, sum(rate(geode_query_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(geode_query_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(geode_query_duration_seconds_bucket[5m])) by (le))
# Panel: Top Slow Queries
topk(10, geode_query_duration_seconds{quantile="0.99"})
# Panel: Query Errors by Type
sum by (error_type) (rate(geode_query_errors_total[5m]))
System Health Dashboard:
# Panel: CPU Usage
rate(geode_cpu_seconds_total[5m]) * 100
# Panel: Memory Usage
geode_memory_used_bytes / geode_memory_total_bytes * 100
# Panel: Disk Usage
geode_disk_used_bytes / geode_disk_total_bytes * 100
# Panel: Active Connections
geode_active_connections
Transaction Dashboard:
# Panel: Transaction Throughput
rate(geode_transactions_total{status="committed"}[5m])
# Panel: Transaction Success Rate
rate(geode_transactions_total{status="committed"}[5m]) /
rate(geode_transactions_total[5m]) * 100
# Panel: Transaction Conflicts
rate(geode_transaction_conflicts_total[5m])
# Panel: Transaction Duration Heatmap
sum(rate(geode_transaction_duration_seconds_bucket[5m])) by (le)
Recording Rules for Performance
Pre-aggregate expensive queries using recording rules:
groups:
- name: geode_recordings
interval: 15s
rules:
- record: job:geode_query_rate:5m
expr: sum(rate(geode_queries_total[5m])) by (job, instance)
- record: job:geode_query_latency_p95:5m
expr: histogram_quantile(0.95, sum(rate(geode_query_duration_seconds_bucket[5m])) by (job, instance, le))
- record: job:geode_memory_usage_percent:current
expr: |
geode_memory_used_bytes / geode_memory_total_bytes * 100
- record: job:geode_cache_hit_rate:5m
expr: |
rate(geode_cache_hits_total[5m]) /
(rate(geode_cache_hits_total[5m]) + rate(geode_cache_misses_total[5m]))
Advanced Monitoring Patterns
Predict Disk Fullness:
predict_linear(geode_disk_free_bytes[1h], 4 * 3600) < 0
Detect Anomalies in Query Rate:
abs(rate(geode_queries_total[5m]) -
avg_over_time(rate(geode_queries_total[5m])[1h:5m])) >
2 * stddev_over_time(rate(geode_queries_total[5m])[1h:5m])
Correlate Query Performance with Memory Pressure:
rate(geode_query_duration_seconds_sum[5m]) /
rate(geode_query_duration_seconds_count[5m])
and
geode_memory_used_bytes / geode_memory_total_bytes > 0.8
Federation for Multi-Cluster Monitoring
Monitor multiple Geode clusters from a central Prometheus:
# Central Prometheus configuration
scrape_configs:
- job_name: 'federate-us-east'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="geode"}'
static_configs:
- targets:
- 'prometheus-us-east:9090'
labels:
region: 'us-east-1'
- job_name: 'federate-us-west'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="geode"}'
static_configs:
- targets:
- 'prometheus-us-west:9090'
labels:
region: 'us-west-2'
Advanced Prometheus Patterns
Dynamic Service Discovery
For dynamic Geode clusters, use service discovery:
# Consul service discovery
scrape_configs:
- job_name: 'geode-consul'
consul_sd_configs:
- server: 'localhost:8500'
services: ['geode']
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
- source_labels: [__meta_consul_node]
target_label: instance
- source_labels: [__meta_consul_tags]
regex: '.*,environment=([^,]+),.*'
target_label: environment
Custom Exporters
Create custom exporters for Geode-specific metrics:
# geode_exporter.py
from prometheus_client import start_http_server, Gauge, Counter
import asyncio
from geode_client import Client
# Define custom metrics
active_queries_gauge = Gauge('geode_custom_active_queries',
'Currently executing queries',
['query_type'])
cache_memory_gauge = Gauge('geode_custom_cache_memory_bytes',
'Memory used by query cache')
async def collect_custom_metrics(client):
"""Collect custom Geode metrics"""
while True:
# Active queries by type
result, _ = await client.query("""
CALL system.active_queries()
RETURN query_type, count(*) as count
""")
for row in result.rows:
active_queries_gauge.labels(
query_type=row['query_type']
).set(row['count'])
# Cache memory usage
cache_stats, _ = await client.query("""
CALL system.cache_stats()
RETURN memory_bytes
""")
cache_memory_gauge.set(cache_stats.rows[0]['memory_bytes'])
await asyncio.sleep(15) # Scrape interval
if __name__ == '__main__':
start_http_server(9090)
client = Client("localhost:3141")
asyncio.run(collect_custom_metrics(client))
Metric Relabeling
Transform metrics during scraping:
scrape_configs:
- job_name: 'geode'
static_configs:
- targets: ['localhost:8080']
metric_relabel_configs:
# Drop high-cardinality debug metrics in production
- source_labels: [__name__]
regex: 'geode_debug_.*'
action: drop
# Rename legacy metrics
- source_labels: [__name__]
regex: 'geode_old_metric_name'
target_label: __name__
replacement: 'geode_new_metric_name'
# Aggregate instance-level metrics to cluster level
- source_labels: [instance]
regex: 'geode-node-.*'
target_label: cluster
replacement: 'geode-production'
Alert Management
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-database'
routes:
- match:
severity: critical
receiver: 'team-database-pager'
continue: true
- match:
severity: warning
receiver: 'team-database-slack'
receivers:
- name: 'team-database'
email_configs:
- to: '[email protected]'
- name: 'team-database-pager'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
- name: 'team-database-slack'
slack_configs:
- channel: '#database-alerts'
title: 'Geode Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Alert Inhibition Rules
Prevent alert storms with inhibition:
# alertmanager.yml (continued)
inhibit_rules:
# Don't alert on high memory if instance is down
- source_match:
severity: 'critical'
alertname: 'GeodeDown'
target_match:
severity: 'warning'
equal: ['instance']
# Don't alert on slow queries if database is overloaded
- source_match:
alertname: 'HighQueryErrorRate'
target_match:
alertname: 'SlowQueryRateIncreasing'
equal: ['instance']
Alert Templates
Create informative alert messages:
groups:
- name: geode_alerts
rules:
- alert: HighErrorRate
expr: |
rate(geode_queries_total{status="error"}[5m]) > 10
annotations:
summary: "High query error rate on {{ $labels.instance }}"
description: |
Query error rate is {{ $value | humanize }} errors/sec.
Current stats:
- Instance: {{ $labels.instance }}
- Error rate: {{ $value | humanize }}/sec
- Time: {{ $labels.timestamp }}
Runbook: https://docs.geodedb.com/runbooks/high-error-rate
dashboard: "https://grafana.example.com/d/geode-errors"
runbook_url: "https://docs.geodedb.com/runbooks/high-error-rate"
Grafana Integration
Provisioning Dashboards
# grafana/provisioning/dashboards/geode.yml
apiVersion: 1
providers:
- name: 'Geode'
orgId: 1
folder: 'Database'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/dashboards/geode
Dashboard Variables
Create dynamic dashboards with variables:
{
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(geode_up, instance)",
"multi": true,
"includeAll": true
},
{
"name": "time_range",
"type": "interval",
"query": "1m,5m,15m,30m,1h,6h,12h,1d",
"current": {
"text": "5m",
"value": "5m"
}
}
]
},
"panels": [
{
"title": "Query Rate",
"targets": [
{
"expr": "sum(rate(geode_queries_total{instance=~\"$instance\"}[$time_range]))"
}
]
}
]
}
Panel Examples
Heatmap Panel for query duration distribution:
{
"type": "heatmap",
"title": "Query Duration Heatmap",
"targets": [
{
"expr": "sum(rate(geode_query_duration_seconds_bucket{instance=~\"$instance\"}[$time_range])) by (le)",
"format": "heatmap",
"legendFormat": "{{le}}"
}
],
"dataFormat": "tsbuckets",
"yAxis": {
"format": "s"
}
}
Status Panel for system health:
{
"type": "stat",
"title": "System Status",
"targets": [
{
"expr": "up{job=\"geode\"}",
"instant": true
}
],
"mappings": [
{
"value": 1,
"text": "UP",
"color": "green"
},
{
"value": 0,
"text": "DOWN",
"color": "red"
}
]
}
Advanced Querying Techniques
Quantile Aggregation
Calculate percentiles across multiple instances:
# P99 query latency across all instances
histogram_quantile(0.99,
sum(rate(geode_query_duration_seconds_bucket[5m])) by (le)
)
# P99 per instance
histogram_quantile(0.99,
sum(rate(geode_query_duration_seconds_bucket[5m])) by (le, instance)
)
Rate Calculation Windows
Choose appropriate time windows:
# Very short window (1m) - noisy but responsive
rate(geode_queries_total[1m])
# Medium window (5m) - balanced
rate(geode_queries_total[5m])
# Long window (1h) - smooth but slow to react
rate(geode_queries_total[1h])
# Use irate for instant rate (last 2 samples)
irate(geode_queries_total[5m])
Subquery Patterns
Calculate rate of rate (acceleration):
# Query acceleration (change in query rate)
deriv(
rate(geode_queries_total[5m])[10m:1m]
)
# Predict future value
predict_linear(geode_queries_total[1h], 3600)
Monitoring at Scale
Hierarchical Federation
For large deployments with multiple regions:
# Regional Prometheus federates from local instances
- job_name: 'federate-local'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="geode"}'
static_configs:
- targets:
- 'prometheus-az1:9090'
- 'prometheus-az2:9090'
- 'prometheus-az3:9090'
# Global Prometheus federates from regional instances
- job_name: 'federate-regional'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="geode"}'
static_configs:
- targets:
- 'prometheus-us-east:9090'
- 'prometheus-us-west:9090'
- 'prometheus-eu-west:9090'
Remote Write for Long-Term Storage
Send metrics to long-term storage:
remote_write:
- url: "https://prometheus-remote-storage.example.com/api/v1/write"
basic_auth:
username: 'geode-metrics'
password: 'SECRET'
write_relabel_configs:
# Only send aggregate metrics for long-term storage
- source_labels: [__name__]
regex: 'geode_(queries_total|query_duration_seconds_bucket|memory_used_bytes)'
action: keep
queue_config:
capacity: 10000
max_shards: 10
min_shards: 1
max_samples_per_send: 5000
Thanos for Global View
Deploy Thanos for unlimited retention and global querying:
# Prometheus with Thanos sidecar
global:
external_labels:
cluster: 'geode-production'
region: 'us-east-1'
# Thanos sidecar configuration
--storage.tsdb.path=/prometheus
--objstore.config-file=/etc/thanos/bucket.yml
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
Best Practices
Appropriate Scrape Intervals: Balance between data resolution and storage overhead. 15-30 seconds works well for most deployments.
Use Recording Rules: Pre-compute expensive queries to reduce dashboard load times.
Set Alert Thresholds Based on Baselines: Monitor normal behavior before setting alert thresholds to minimize false positives.
Implement Alert Routing: Use Alertmanager to route alerts to appropriate teams and communication channels.
Retain Historical Data: Configure appropriate retention periods to support trend analysis and capacity planning.
Monitor Prometheus Itself: Ensure your monitoring system is healthy and performant.
Label Consistently: Use consistent label naming across all metrics for easier querying and aggregation.
Dashboard Organization: Create focused dashboards for different personas (developers, operators, executives).
Metric Naming: Follow Prometheus naming conventions (unit suffix, descriptive names).
Cardinality Management: Keep label cardinality bounded to prevent memory issues.
Troubleshooting Common Issues
Missing Metrics: Verify Geode metrics endpoint is accessible and Prometheus can reach it.
High Cardinality: Avoid labels with unbounded values (e.g., user IDs, session IDs).
Storage Issues: Monitor Prometheus disk usage and adjust retention policies as needed.
Slow Queries: Use recording rules to pre-aggregate complex queries used in dashboards.
Scrape Failures: Check network connectivity, TLS certificates, and authentication.
Out of Memory: Reduce retention period, use recording rules, or add more RAM.
Slow Dashboards: Optimize queries, use recording rules, reduce time ranges.
Production Deployment Checklist
- Scrape intervals configured appropriately
- Recording rules created for expensive queries
- Alert rules defined and tested
- Alertmanager configured with proper routing
- Dashboards created for key metrics
- Retention period set based on requirements
- Backup strategy for Prometheus data
- Monitoring of Prometheus itself
- High availability setup (if required)
- Documentation of metrics and alerts
- Runbooks created for common alerts
- Load testing of monitoring stack
Related Topics
- Monitoring and Observability
- Metrics and Performance
- Observability Best Practices
- Grafana Dashboards
- Alerting Strategies
- Performance Tuning
Further Reading
- Prometheus Configuration Guide
- Grafana Dashboard Templates for Geode
- Alert Runbook Examples
- Performance Tuning with Metrics
- Multi-Cluster Monitoring Patterns
- Prometheus Best Practices
- Recording Rules Guide
- Federation Patterns