Overview
Geode provides comprehensive telemetry capabilities for production observability, including streaming paging events, Prometheus metrics integration, QUIC transport metrics, and customizable monitoring dashboards.
Version: v0.1.1+ (includes QUIC+TLS metrics)
What You’ll Learn
- How to enable and configure streaming telemetry
- Prometheus metrics integration and scraping
- Grafana dashboard setup and customization
- QUIC transport metrics monitoring
- Custom metric instrumentation
- Alerting and incident response patterns
Prerequisites
- Geode v0.1.1+ installed
- Basic understanding of metrics and monitoring
- Prometheus and Grafana (optional, for integration sections)
Streaming Telemetry
Paging Events
Geode can emit optional paging telemetry as JSON Lines on stderr. This feature is disabled by default to avoid noisy logs.
Status: Optional, single-node, development/ops aid
Environment Variables
Core Toggles:
# Enable pagination telemetry
export GEODE_TELEMETRY_PAGING=1
# Client-side: Capture server stderr for e2e tests
export GEODE_CAPTURE_SERVER_STDERR=1
CI/Testing Variables:
# Enable telemetry smoke test
export GEODE_CI_TELEMETRY_SMOKE=1
# Strict mode: Fail if telemetry missing
export GEODE_CI_TELEMETRY_STRICT=1
# Safety: Avoid vendor teardown crashes in short-lived tests
export GEODE_QUIC_PASSIVE_TEARDOWN=1
Event Shape
Each page emission produces a single JSON line on stderr:
{
"ts": "1758732000",
"level": "INFO",
"component": "server",
"type": "TELEMETRY",
"event": "PULL_PAGE",
"page_size": "1000",
"rows_emitted": "1000",
"final": "false",
"page": { "index": 0, "size": 1000 },
"ordered": true,
"order_keys": ["timestamp"],
"request_id": "550e8400-e29b-41d4-a716-446655440000"
}
Field Descriptions:
type: AlwaysTELEMETRYfor these eventsevent: AlwaysPULL_PAGEfor paging emissionspage_size: Requested page size (stringified)rows_emitted: Actual rows in this page (stringified)final:trueif last page, elsefalseordered: Whether result set is orderedorder_keys: Keys used for orderingrequest_id: Unique identifier for the request
Emission Location: src/server/main.zig (guarded by state.telemetry_paging)
Usage Examples
Server with Telemetry:
# Start server with paging telemetry enabled
GEODE_TELEMETRY_PAGING=1 \
geode serve \
--listen 127.0.0.1:7567 \
--log-json \
--result-format json \
2>server.log &
PID=$!
# Run queries
geode query "RETURN 1 AS x ORDER BY x LIMIT 1" --format json
# Check telemetry
kill $PID
cat server.log | grep '"event":"PULL_PAGE"'
Client with Server Stderr Capture:
GEODE_CAPTURE_SERVER_STDERR=1 \
geode query "RETURN 1 AS x ORDER BY x LIMIT 1" --format json
Helper Script:
# Quick local validation
make build && bash scripts/telemetry-smoke.sh 1
The script:
- Builds
zig-out/bin/geode - Starts server with
GEODE_TELEMETRY_PAGING=1 - Runs ordered+limited query
- Prints telemetry lines containing
"event":"PULL_PAGE" - Cleans up server process
Testing Telemetry
CANARY Integration:
- Test:
TestCANARY_REQ_GQL_090_TelemetryPagingPull - Requirement:
REQ-GQL-090- Streaming Telemetry, Paging Events - Status:
TESTED
Smoke Tests:
# Basic smoke test (non-fatal)
GEODE_TELEMETRY_PAGING=1 \
GEODE_CI_TELEMETRY_SMOKE=1 \
zig test test_telemetry_smoke.zig
# Strict mode (fails if telemetry missing)
make quic-smoke-strict
# Convenience (non-fatal)
make quic-smoke
QUIC Transport Metrics
New in v0.1.1: QUIC+TLS transport telemetry replaces TCP metrics.
QUIC Metrics
Handshake Metrics:
geode_quic_handshake_duration_seconds{quantile="0.5"} 0.015
geode_quic_handshake_duration_seconds{quantile="0.95"} 0.045
geode_quic_handshake_duration_seconds{quantile="0.99"} 0.120
geode_quic_handshake_success_total{} 1234
geode_quic_handshake_failure_total{} 5
Stream Multiplexing:
geode_quic_active_streams{} 42
geode_quic_stream_create_total{} 5678
geode_quic_stream_close_total{} 5636
geode_quic_max_concurrent_streams{} 100
Loss Recovery:
geode_quic_packet_loss_rate{} 0.001
geode_quic_rtt_milliseconds{quantile="0.5"} 12.5
geode_quic_rtt_milliseconds{quantile="0.95"} 45.0
geode_quic_retransmit_total{} 23
geode_quic_congestion_events_total{} 3
Data Transfer:
geode_quic_bytes_sent_total{} 12345678
geode_quic_bytes_received_total{} 98765432
geode_quic_throughput_mbps{} 125.5
TLS Metrics
geode_tls_version{version="1.3"} 1
geode_tls_cipher_suite{suite="TLS_AES_256_GCM_SHA384"} 1
geode_tls_handshake_duration_seconds{quantile="0.95"} 0.035
geode_tls_session_reuse_total{} 456
geode_tls_cert_verification_errors_total{} 0
Prometheus Integration
Server Configuration
Enable Prometheus metrics endpoint:
# geode.yaml
monitoring:
prometheus:
enable: true
port: 9090
path: /metrics
interval: 15s
Command-Line:
geode serve \
--prometheus-enable \
--prometheus-port 9090 \
--prometheus-path /metrics
Environment Variables:
export GEODE_PROMETHEUS_ENABLE=1
export GEODE_PROMETHEUS_PORT=9090
export GEODE_PROMETHEUS_PATH=/metrics
Prometheus Configuration
Add Geode as a scrape target in prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'geode'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'geode-primary'
environment: 'production'
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: '/metrics'
- job_name: 'geode-federation'
static_configs:
- targets:
- 'geode-shard-1:9090'
- 'geode-shard-2:9090'
- 'geode-shard-3:9090'
scrape_interval: 30s
Available Metrics
Query Metrics:
geode_queries_total{status="success"} 12345
geode_queries_total{status="error"} 23
geode_query_duration_seconds{quantile="0.5"} 0.015
geode_query_duration_seconds{quantile="0.95"} 0.250
geode_query_duration_seconds{quantile="0.99"} 1.500
geode_slow_queries_total{threshold="1s"} 45
Connection Metrics:
geode_active_connections{} 42
geode_connection_create_total{} 1234
geode_connection_close_total{} 1192
geode_connection_errors_total{} 5
geode_connection_pool_size{type="max"} 5000
geode_connection_pool_size{type="current"} 42
Transaction Metrics:
geode_transactions_active{} 8
geode_transactions_total{status="committed"} 5678
geode_transactions_total{status="rolled_back"} 12
geode_transaction_duration_seconds{quantile="0.95"} 2.5
geode_transaction_conflicts_total{} 3
geode_deadlocks_detected_total{} 1
Storage Metrics:
geode_storage_size_bytes{type="data"} 10737418240
geode_storage_size_bytes{type="wal"} 536870912
geode_storage_size_bytes{type="index"} 2147483648
geode_wal_sync_duration_seconds{quantile="0.95"} 0.005
geode_checkpoint_duration_seconds{quantile="0.95"} 5.0
geode_page_faults_total{} 1234
Memory Metrics:
geode_memory_used_bytes{} 4294967296
geode_memory_limit_bytes{} 17179869184
geode_memory_usage_percent{} 25.0
geode_buffer_pool_size_bytes{} 8589934592
geode_buffer_pool_hit_rate{} 0.98
Index Metrics:
geode_index_operations_total{type="insert"} 12345
geode_index_operations_total{type="delete"} 234
geode_index_operations_total{type="search"} 56789
geode_index_size_bytes{name="hnsw_embeddings"} 1073741824
geode_hnsw_search_duration_seconds{quantile="0.95"} 0.012
Graph Metrics:
geode_node_count{label="Person"} 1000000
geode_node_count{label="Product"} 50000
geode_relationship_count{type="KNOWS"} 5000000
geode_relationship_count{type="PURCHASED"} 250000
geode_graph_size_nodes_total{} 1050000
geode_graph_size_relationships_total{} 5250000
Session Metrics:
geode_active_sessions{} 42
geode_session_create_total{} 1234
geode_session_timeout_total{} 5
geode_avg_session_duration_seconds{} 180.5
geode_session_parameters_total{} 156
Grafana Dashboards
Installation
- Install Grafana:
# Docker
docker run -d \
--name=grafana \
-p 3000:3000 \
grafana/grafana
# Or use package manager
sudo apt-get install -y grafana
sudo systemctl start grafana-server
Access Grafana: http://localhost:3000 (default: admin/admin)
Add Prometheus Data Source:
- Configuration → Data Sources → Add data source
- Select Prometheus
- URL: http://localhost:9091 (Prometheus server)
- Save & Test
Pre-Built Dashboards
Geode includes pre-built Grafana dashboards:
Location: monitoring/grafana/dashboards/
Available Dashboards:
geode-overview.json- System overviewgeode-queries.json- Query performancegeode-transactions.json- Transaction monitoringgeode-storage.json- Storage and I/Ogeode-quic.json- QUIC transport metrics (v0.1.1+)
Import Dashboard:
- Dashboards → Import
- Upload JSON file or paste JSON
- Select Prometheus data source
- Import
Custom Dashboard Examples
System Overview Dashboard
Key Panels:
{
"dashboard": {
"title": "Geode System Overview",
"panels": [
{
"title": "Query Rate",
"type": "graph",
"targets": [{
"expr": "rate(geode_queries_total[5m])",
"legendFormat": "{{status}}"
}]
},
{
"title": "Active Connections",
"type": "stat",
"targets": [{
"expr": "geode_active_connections"
}]
},
{
"title": "Memory Usage",
"type": "gauge",
"targets": [{
"expr": "(geode_memory_used_bytes / geode_memory_limit_bytes) * 100"
}]
},
{
"title": "Query Duration (p95)",
"type": "graph",
"targets": [{
"expr": "geode_query_duration_seconds{quantile=\"0.95\"}"
}]
}
]
}
}
QUIC Transport Dashboard (v0.1.1+)
{
"panels": [
{
"title": "QUIC Handshake Duration",
"targets": [{
"expr": "geode_quic_handshake_duration_seconds"
}]
},
{
"title": "Active QUIC Streams",
"targets": [{
"expr": "geode_quic_active_streams"
}]
},
{
"title": "Packet Loss Rate",
"targets": [{
"expr": "geode_quic_packet_loss_rate"
}]
},
{
"title": "RTT Distribution",
"targets": [{
"expr": "geode_quic_rtt_milliseconds"
}]
}
]
}
Dashboard Variables
Create dynamic dashboards with variables:
{
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(geode_queries_total, instance)",
"multi": true
},
{
"name": "label",
"type": "query",
"query": "label_values(geode_node_count, label)"
}
]
}
}
Use in queries:
geode_node_count{instance=~"$instance", label="$label"}
Custom Metrics
Application-Level Metrics
Instrument your application code:
Go Example:
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
userLoginCounter = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "app_user_logins_total",
Help: "Total number of user logins",
},
[]string{"status"},
)
queryLatency = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "app_query_duration_seconds",
Help: "Query execution duration",
Buckets: prometheus.DefBuckets,
},
[]string{"query_type"},
)
)
func handleLogin(username, password string) error {
start := time.Now()
defer func() {
duration := time.Since(start).Seconds()
queryLatency.WithLabelValues("user_auth").Observe(duration)
}()
err := authenticateUser(username, password)
if err != nil {
userLoginCounter.WithLabelValues("failure").Inc()
return err
}
userLoginCounter.WithLabelValues("success").Inc()
return nil
}
Python Example:
from prometheus_client import Counter, Histogram
user_login_counter = Counter(
'app_user_logins_total',
'Total number of user logins',
['status']
)
query_latency = Histogram(
'app_query_duration_seconds',
'Query execution duration',
['query_type']
)
async def handle_login(username: str, password: str) -> bool:
with query_latency.labels('user_auth').time():
try:
authenticated = await authenticate_user(username, password)
if authenticated:
user_login_counter.labels('success').inc()
return True
else:
user_login_counter.labels('failure').inc()
return False
except Exception as e:
user_login_counter.labels('error').inc()
raise
Business Metrics
Track domain-specific metrics:
# Fraud detection rate
sum(rate(fraud_detected_total[5m])) / sum(rate(transactions_total[5m]))
# Recommendation click-through rate
sum(rate(recommendation_clicks_total[5m])) / sum(rate(recommendations_shown_total[5m]))
# Knowledge graph coverage
geode_node_count{label="Entity"} / geode_target_entities
Alerting
Prometheus Alerting Rules
Create alerts.yml:
groups:
- name: geode
interval: 30s
rules:
- alert: GeodeDown
expr: up{job="geode"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Geode instance is down"
description: "Geode instance {{ $labels.instance }} has been down for more than 5 minutes."
- alert: HighQueryLatency
expr: geode_query_duration_seconds{quantile="0.95"} > 5
for: 2m
labels:
severity: warning
annotations:
summary: "High query latency detected"
description: "P95 query latency is {{ $value }}s on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (geode_memory_used_bytes / geode_memory_limit_bytes) * 100 > 90
for: 2m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
- alert: HighConnectionCount
expr: geode_active_connections > geode_max_connections * 0.9
for: 1m
labels:
severity: warning
annotations:
summary: "High connection count"
description: "{{ $value }} active connections on {{ $labels.instance }}"
- alert: TransactionDeadlock
expr: rate(geode_deadlocks_detected_total[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Transaction deadlocks detected"
description: "{{ $value }} deadlocks/sec on {{ $labels.instance }}"
- alert: SlowQueries
expr: rate(geode_slow_queries_total[5m]) > 10
for: 5m
labels:
severity: info
annotations:
summary: "High rate of slow queries"
description: "{{ $value }} slow queries/sec on {{ $labels.instance }}"
Alertmanager Configuration
Configure alert routing in alertmanager.yml:
global:
smtp_smarthost: 'localhost:587'
smtp_from: '[email protected]'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: critical
repeat_interval: 5m
- match:
severity: warning
receiver: warnings
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:5001/webhook'
send_resolved: true
- name: 'critical'
email_configs:
- to: '[email protected]'
subject: '🚨 Critical Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: 'Geode Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'warnings'
email_configs:
- to: '[email protected]'
Advanced Monitoring Patterns
SLI/SLO Monitoring
Track Service Level Indicators and Objectives:
# Availability SLI (target: 99.9%)
(sum(rate(geode_queries_total{status="success"}[30d])) /
sum(rate(geode_queries_total[30d]))) * 100
# Latency SLI (target: p95 < 100ms)
geode_query_duration_seconds{quantile="0.95"} < 0.1
# Error budget remaining (1 - availability target)
1 - ((sum(rate(geode_queries_total{status="success"}[30d])) /
sum(rate(geode_queries_total[30d]))))
Grafana SLO Dashboard:
{
"panels": [
{
"title": "Availability (30d rolling)",
"targets": [{
"expr": "(sum(rate(geode_queries_total{status=\"success\"}[30d])) / sum(rate(geode_queries_total[30d]))) * 100"
}],
"thresholds": [
{ "value": 99.9, "color": "green" },
{ "value": 99.0, "color": "yellow" },
{ "value": 0, "color": "red" }
]
}
]
}
Capacity Planning
Monitor resource trends:
# Predict storage exhaustion (linear regression)
predict_linear(geode_storage_size_bytes{type="data"}[7d], 30 * 24 * 3600)
# Connection pool utilization trend
avg_over_time(geode_active_connections[7d]) / geode_connection_pool_size{type="max"}
# Memory growth rate
deriv(geode_memory_used_bytes[1h])
Anomaly Detection
Use recording rules for anomaly detection:
groups:
- name: anomaly_detection
interval: 1m
rules:
- record: query_latency_baseline
expr: avg_over_time(geode_query_duration_seconds{quantile="0.95"}[7d])
- alert: LatencyAnomaly
expr: geode_query_duration_seconds{quantile="0.95"} > query_latency_baseline * 2
for: 5m
labels:
severity: warning
annotations:
summary: "Query latency anomaly detected"
Troubleshooting
Missing Metrics
Symptom: Prometheus shows no targets or metrics unavailable
Solutions:
Verify server is running with metrics enabled:
curl http://localhost:9090/metricsCheck Prometheus configuration:
promtool check config prometheus.ymlVerify firewall allows port 9090
Check Prometheus logs:
journalctl -u prometheus -f
High Cardinality Metrics
Symptom: Prometheus consuming excessive memory
Solutions:
Limit label cardinality:
# ❌ Bad: Unique label per user geode_user_queries_total{user_id="12345"} # ✅ Good: Aggregated labels geode_user_queries_total{user_type="premium"}Use recording rules to pre-aggregate:
- record: query_rate_by_type expr: sum(rate(geode_queries_total[5m])) by (type)Adjust Prometheus retention:
prometheus --storage.tsdb.retention.time=15d
Telemetry Performance Impact
Symptom: Telemetry causing performance degradation
Solutions:
Disable paging telemetry in production:
unset GEODE_TELEMETRY_PAGINGReduce Prometheus scrape interval:
scrape_interval: 60s # Instead of 15sUse sampling for high-frequency events
Next Steps
Explore More:
- Observability Guide - Logging and tracing
- Deployment - Production deployment
- Performance Tuning - Query optimization
Related Topics:
- Alerting Best Practices - Alert design
- Backup and Recovery - Data protection
- Security Monitoring - Security events
Tools:
- Prometheus - Metrics collection
- Grafana - Visualization
- Alertmanager - Alert routing
Version: v0.1.1+ (QUIC+TLS metrics) Telemetry: Optional streaming events (stderr JSON Lines) Prometheus: Full integration with 50+ metrics Status: Production-ready