Monitoring and Telemetry

Comprehensive observability for Geode: health checks, Prometheus metrics, optional telemetry, and audit logging.

Health and Readiness Endpoints

From USAGE.md:

Health Check

Endpoint: GET /health

Purpose: Verify server is running and responsive

curl -f http://localhost:8080/health || echo "Health check failed"

Response (healthy):

{
  "status": "healthy",
  "timestamp": "2024-01-15T14:30:00Z",
  "version": "0.1.3"
}

Response (unhealthy):

{
  "status": "unhealthy",
  "error": "database connection failed",
  "timestamp": "2024-01-15T14:30:00Z"
}

HTTP status codes:

  • 200 OK - Healthy
  • 503 Service Unavailable - Unhealthy

Readiness Check

Endpoint: GET /ready

Purpose: Verify server is ready to accept traffic (useful for Kubernetes liveness probes)

curl -f http://localhost:8080/ready

Response (ready):

{
  "status": "ready",
  "timestamp": "2024-01-15T14:30:00Z",
  "connections": 42,
  "active_transactions": 3
}

Response (not ready):

{
  "status": "not_ready",
  "reason": "initialization in progress",
  "timestamp": "2024-01-15T14:30:00Z"
}

Kubernetes liveness probe:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

Kubernetes readiness probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Prometheus Metrics

From USAGE.md:

Metrics Endpoint

Endpoint: GET /metrics

Format: Prometheus text exposition format

curl http://localhost:8080/metrics

Key Metrics

From USAGE.md, key metrics list:

Query Metrics
# HELP geode_queries_total Total number of queries executed
# TYPE geode_queries_total counter
geode_queries_total{graph="SocialNetwork",status="success"} 12345

# HELP geode_query_duration_seconds Query execution time
# TYPE geode_query_duration_seconds histogram
geode_query_duration_seconds_bucket{graph="SocialNetwork",le="0.001"} 100
geode_query_duration_seconds_bucket{graph="SocialNetwork",le="0.01"} 500
geode_query_duration_seconds_bucket{graph="SocialNetwork",le="0.1"} 1200
geode_query_duration_seconds_bucket{graph="SocialNetwork",le="1.0"} 1250
geode_query_duration_seconds_sum{graph="SocialNetwork"} 45.67
geode_query_duration_seconds_count{graph="SocialNetwork"} 1250

# HELP geode_query_errors_total Total number of query errors
# TYPE geode_query_errors_total counter
geode_query_errors_total{graph="SocialNetwork",error_type="syntax"} 12
geode_query_errors_total{graph="SocialNetwork",error_type="constraint_violation"} 5
Transaction Metrics
# HELP geode_transactions_active Currently active transactions
# TYPE geode_transactions_active gauge
geode_transactions_active 3

# HELP geode_transactions_committed Total committed transactions
# TYPE geode_transactions_committed counter
geode_transactions_committed 5678

# HELP geode_transactions_aborted Total aborted transactions
# TYPE geode_transactions_aborted counter
geode_transactions_aborted{reason="serialization_error"} 23
geode_transactions_aborted{reason="constraint_violation"} 15

# HELP geode_transaction_duration_seconds Transaction execution time
# TYPE geode_transaction_duration_seconds histogram
geode_transaction_duration_seconds_bucket{le="0.1"} 4500
geode_transaction_duration_seconds_bucket{le="1.0"} 5600
geode_transaction_duration_seconds_bucket{le="10.0"} 5670
Storage Metrics
# HELP geode_storage_pages_total Total number of data pages
# TYPE geode_storage_pages_total gauge
geode_storage_pages_total 1234567

# HELP geode_storage_cache_hits Cache hit count
# TYPE geode_storage_cache_hits counter
geode_storage_cache_hits 98765432

# HELP geode_storage_cache_misses Cache miss count
# TYPE geode_storage_cache_misses counter
geode_storage_cache_misses 1234567

# HELP geode_storage_cache_hit_ratio Cache hit ratio
# TYPE geode_storage_cache_hit_ratio gauge
geode_storage_cache_hit_ratio 0.987

# HELP geode_wal_writes_bytes Write-Ahead Log bytes written
# TYPE geode_wal_writes_bytes counter
geode_wal_writes_bytes 123456789012
Connection Metrics
# HELP geode_connections_active Currently active client connections
# TYPE geode_connections_active gauge
geode_connections_active 42

# HELP geode_connections_total Total connections since start
# TYPE geode_connections_total counter
geode_connections_total 12345
Index Metrics
# HELP geode_index_lookups_total Index lookup count
# TYPE geode_index_lookups_total counter
geode_index_lookups_total{index="person_age_idx",type="btree"} 45678

# HELP geode_index_size_bytes Index size in bytes
# TYPE geode_index_size_bytes gauge
geode_index_size_bytes{index="person_age_idx",type="btree"} 12345678

Prometheus Configuration

prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'geode'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'

Start Prometheus:

prometheus --config.file=prometheus.yml

Query examples:

# Query rate (QPS)
rate(geode_queries_total[5m])

# 95th percentile query latency
histogram_quantile(0.95, rate(geode_query_duration_seconds_bucket[5m]))

# Cache hit ratio
geode_storage_cache_hit_ratio

# Transaction abort rate
rate(geode_transactions_aborted[5m])

Optional Telemetry

From TELEMETRY.md:

Paging telemetry outputs detailed system events to stderr (JSONL format).

Purpose: Debugging, CI/testing, performance analysis

Enable:

# Enable paging telemetry
export GEODE_TELEMETRY_PAGING=1

./geode serve

Output (stderr, JSONL):

{"event":"page_read","page_id":12345,"timestamp":"2024-01-15T14:30:00.123Z","duration_us":42}
{"event":"page_write","page_id":12346,"timestamp":"2024-01-15T14:30:00.456Z","duration_us":156}
{"event":"index_lookup","index":"person_age_idx","key":30,"timestamp":"2024-01-15T14:30:00.789Z","duration_us":23}

CI/Testing toggles (from TELEMETRY.md):

# Disable telemetry in tests (reduce noise)
export GEODE_TELEMETRY_PAGING=0

# Enable verbose telemetry for debugging
export GEODE_TELEMETRY_VERBOSE=1

Note: Paging telemetry has performance overhead (~5-10%). Use only for debugging/testing.

Audit Logs and Tracing

From AUDIT_LOGGING.md:

Audit Log Configuration

Enable audit logging (geode.yaml):

security:
  audit:
    enabled: true
    log_path: "/var/log/geode/audit.jsonl"

    # Syslog/CEF forwarding
    syslog:
      enabled: true
      address: "syslog.example.com:514"
      format: "CEF"  # Common Event Format

    # Retention
    retention_days: 365
    max_size: "10GB"

What’s Logged

Events logged:

  • ✅ Authentication (login, logout, failed attempts)
  • ✅ Authorization decisions (policy evaluations)
  • ✅ Schema changes (CREATE/ALTER/DROP)
  • ✅ Administrative actions (user/role management)
  • ✅ Query metadata (timestamp, user, graph, execution time)

Events NOT logged (privacy/security):

  • ❌ Query text (avoid logging sensitive data)
  • ❌ Query parameters
  • ❌ Result sets

Audit Log Entry Format

Example (JSONL):

{
  "timestamp": "2024-01-15T14:30:00.123Z",
  "event_type": "query_executed",
  "user": "alice",
  "graph": "SocialNetwork",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "trace_id": "7c9e8d6f-5b4a-3c2d-1e0f-9a8b7c6d5e4f",
  "execution_time_ms": 23.5,
  "rows_returned": 150,
  "prev_log_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
  "signature": "3045022100..."
}

Tamper-evident properties:

  • Hash chain: Each entry includes prev_log_hash (SHA-256 of previous entry)
  • Signatures: Entries signed with server private key for non-repudiation
  • Tracing IDs: trace_id correlates events across distributed systems

Tracing IDs

Purpose: Correlate events across:

  • Multiple queries in a transaction
  • Distributed query execution (federated queries)
  • CDC webhooks and downstream processing

Example: Distributed trace:

// Client query
{"trace_id": "7c9e8d6f-5b4a-3c2d-1e0f-9a8b7c6d5e4f", "event": "query_started", ...}

// Federated execution on shard 1
{"trace_id": "7c9e8d6f-5b4a-3c2d-1e0f-9a8b7c6d5e4f", "event": "shard_query", "shard": 1, ...}

// Federated execution on shard 2
{"trace_id": "7c9e8d6f-5b4a-3c2d-1e0f-9a8b7c6d5e4f", "event": "shard_query", "shard": 2, ...}

// Result merge
{"trace_id": "7c9e8d6f-5b4a-3c2d-1e0f-9a8b7c6d5e4f", "event": "result_merged", ...}

// Query completed
{"trace_id": "7c9e8d6f-5b4a-3c2d-1e0f-9a8b7c6d5e4f", "event": "query_completed", ...}

Audit Log Analysis

Verify hash chain integrity:

# Python script to verify audit log integrity
import json
import hashlib

prev_hash = None
with open('/var/log/geode/audit.jsonl') as f:
    for line_num, line in enumerate(f, 1):
        entry = json.loads(line)

        # Compute expected hash
        if prev_hash is not None:
            if entry['prev_log_hash'] != prev_hash:
                print(f"Hash chain broken at line {line_num}!")
                exit(1)

        # Update for next iteration
        prev_hash = hashlib.sha256(line.encode()).hexdigest()

print("Audit log integrity verified")

Query audit logs:

# Find all queries by user 'alice'
cat /var/log/geode/audit.jsonl | jq 'select(.user == "alice")'

# Find failed authentication attempts
cat /var/log/geode/audit.jsonl | jq 'select(.event_type == "auth_failed")'

# Find slow queries (>1 second)
cat /var/log/geode/audit.jsonl | jq 'select(.execution_time_ms > 1000)'

Logging Configuration

From DOCKER_LOG_LEVEL_CONFIG.md:

Log Levels

# Set log level (debug/info/warn/error)
export LOG_LEVEL=info

./geode serve

Docker (docker-compose.yml):

services:
  geode:
    image: geodedb/geode:latest
    environment:
      LOG_LEVEL: info  # or debug, warn, error

Defaults:

  • Development: debug
  • Production: info

Log Formats

Text format (human-readable):

2024-01-15T14:30:00.123Z INFO  [geode::server] Server started on 0.0.0.0:3141
2024-01-15T14:30:01.456Z DEBUG [geode::query] Executing query for user 'alice'
2024-01-15T14:30:01.789Z WARN  [geode::storage] Cache miss for page 12345

JSON format (machine-readable):

{"timestamp":"2024-01-15T14:30:00.123Z","level":"INFO","module":"geode::server","message":"Server started on 0.0.0.0:3141"}
{"timestamp":"2024-01-15T14:30:01.456Z","level":"DEBUG","module":"geode::query","message":"Executing query for user 'alice'"}

Configure (geode.yaml):

logging:
  level: 'info'
  format: 'json'  # or 'text'

Grafana Dashboards

From deployment/DEPLOYMENT.md:

Access Grafana: http://localhost:3000 (admin/admin)

Import Geode Dashboard

  1. Download dashboard JSON from docs/deployment/grafana-dashboard.json
  2. Import in Grafana: Home → Dashboards → Import
  3. Configure data source: Select Prometheus

Dashboard Panels

Query Performance:

  • Query rate (QPS)
  • 50th/95th/99th percentile latency
  • Error rate

Transaction Metrics:

  • Active transactions
  • Commit/abort rate
  • Serialization error rate

Storage:

  • Cache hit ratio
  • Page read/write rate
  • WAL write throughput

Connections:

  • Active connections
  • Connection rate

Alerts:

  • High error rate (>1%)
  • High latency (p95 >1s)
  • Cache hit ratio low (<80%)
  • Transaction abort rate high (>10%)

Loki Log Aggregation

From deployment/DEPLOYMENT.md:

Loki + Promtail aggregate logs from all Geode instances.

Promtail configuration (promtail-config.yaml):

server:
  http_listen_port: 9080
  grpc_listen_port: 0

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: geode
    static_configs:
      - targets:
          - localhost
        labels:
          job: geode
          __path__: /var/log/geode/*.log

Query logs in Grafana:

# All logs from geode
{job="geode"}

# Error logs only
{job="geode"} |= "ERROR"

# Query execution logs
{job="geode"} |= "query_executed"

# Slow queries
{job="geode"} | json | execution_time_ms > 1000

Next Steps