System Observability

Observability is the ability to understand a system’s internal state by examining its outputs. Unlike traditional monitoring that focuses on predefined metrics, observability enables you to ask arbitrary questions about your system’s behavior, making it essential for debugging complex distributed systems and understanding emergent behaviors.

Geode implements comprehensive observability through the three pillars: metrics for quantitative measurements, logs for detailed event records, and traces for request flow visualization. This multi-dimensional approach provides complete visibility into query execution, transaction behavior, resource utilization, and system health.

This guide covers observability architecture, implementation patterns, debugging strategies, and best practices for maintaining observable Geode deployments.

The Three Pillars of Observability

Metrics: What is Happening

Metrics provide aggregated, time-series data about system behavior:

# Query throughput over time
rate(geode_queries_total[5m])

# p95 query latency trend
histogram_quantile(0.95, rate(geode_query_duration_seconds_bucket[5m]))

# Memory usage pattern
geode_memory_used_bytes

Use Cases:

Trend analysis and anomaly detection
Performance baseline establishment
Capacity planning
SLO monitoring
Alerting on threshold violations

Logs: Why is it Happening

Logs capture discrete events with contextual details:

{
  "timestamp": "2026-01-24T10:15:30.123Z",
  "level": "ERROR",
  "message": "Query execution failed",
  "query_id": "q-12847",
  "user": "analyst",
  "error": "Index out of bounds",
  "query_text": "MATCH (n:User) WHERE n.id > $limit RETURN n",
  "stack_trace": "...",
  "duration_ms": 234.5
}

Use Cases:

Root cause analysis
Error investigation
Audit trails
Security forensics
Understanding specific execution paths

Traces: How is it Flowing

Traces show request flows through distributed systems:

Trace ID: abc123-def456-ghi789
Span: http_handler [250ms]
  └─ Span: authenticate_user [15ms]
  └─ Span: execute_gql_query [220ms]
      ├─ Span: parse_query [5ms]
      ├─ Span: optimize_plan [10ms]
      ├─ Span: execute_plan [200ms]
      │   ├─ Span: index_lookup [30ms]
      │   ├─ Span: expand_relationships [150ms]
      │   └─ Span: aggregate_results [20ms]
      └─ Span: serialize_response [5ms]

Use Cases:

Performance bottleneck identification
Understanding service dependencies
Latency attribution
Distributed debugging
Optimization targeting

Structured Logging

Geode uses structured JSON logging for machine-readable, queryable log data:

Log Configuration

# geode.toml
[logging]
level = "INFO"           # DEBUG, INFO, WARN, ERROR
format = "json"          # json or text
output = "stdout"        # stdout, stderr, file
file = "/var/log/geode/geode.log"
rotate_size = "100MB"
rotate_count = 10
include_caller = true    # Include file:line information

Log Levels

DEBUG: Detailed diagnostic information

{
  "level": "DEBUG",
  "message": "Query plan generated",
  "query_id": "q-12847",
  "plan_type": "indexed_lookup",
  "estimated_rows": 1250,
  "index_used": "User.email"
}

INFO: General operational events

{
  "level": "INFO",
  "message": "Query executed successfully",
  "query_id": "q-12847",
  "duration_ms": 45.3,
  "rows_returned": 1250
}

WARN: Warning conditions

{
  "level": "WARN",
  "message": "Slow query detected",
  "query_id": "q-12847",
  "duration_ms": 1234.5,
  "threshold_ms": 1000,
  "query_text": "MATCH (n) RETURN n"
}

ERROR: Error events requiring attention

{
  "level": "ERROR",
  "message": "Transaction conflict",
  "transaction_id": "tx-456",
  "conflict_type": "write_write",
  "retry_count": 3,
  "error_code": "40001"
}

Contextual Logging

Add context to log entries for correlation:

# Python client with logging context
import logging
import geode_client

logger = logging.getLogger(__name__)

async def process_user_query(user_id, query):
    # Add context to all subsequent logs
    with logger.contextualize(user_id=user_id, request_id=generate_id()):
        logger.info("Processing user query", query_type="recommendation")

        try:
            result, _ = await client.query(query)
            logger.info(
                "Query completed",
                rows_returned=len(result.rows),
                duration_ms=result.duration
            )
            return result
        except Exception as e:
            logger.error(
                "Query failed",
                error=str(e),
                query_text=query
            )
            raise

Log Aggregation

Centralize logs with popular aggregation tools:

Elasticsearch Integration:

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/geode/*.log
    json.keys_under_root: true
    json.add_error_key: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "geode-logs-%{+yyyy.MM.dd}"

Loki Integration:

# promtail.yml
clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: geode
    static_configs:
      - targets:
          - localhost
        labels:
          job: geode
          __path__: /var/log/geode/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            timestamp: timestamp
            message: message
      - labels:
          level:

Distributed Tracing

OpenTelemetry Integration

Geode supports OpenTelemetry for standardized distributed tracing:

# geode.toml
[tracing]
enabled = true
exporter = "otlp"
endpoint = "http://localhost:4317"
sample_rate = 0.1            # Sample 10% of traces
service_name = "geode"
environment = "production"

# Trace specific operations
trace_queries = true
trace_transactions = true
trace_index_operations = true

Trace Instrumentation

Automatic instrumentation captures spans for:

Query Execution:

Span: execute_gql_query
  Attributes:
    - query.text: "MATCH (u:User) WHERE u.age > 25 RETURN u"
    - query.id: "q-12847"
    - query.status: "success"
    - query.rows: 1250
  Duration: 145ms

  Child Spans:
    - parse_query (5ms)
    - optimize_plan (10ms)
    - execute_plan (125ms)
    - serialize_response (5ms)

Transaction Lifecycle:

Span: transaction
  Attributes:
    - tx.id: "tx-456"
    - tx.isolation_level: "SERIALIZABLE"
    - tx.status: "committed"
  Duration: 2340ms

  Child Spans:
    - begin (2ms)
    - execute_query_1 (145ms)
    - execute_query_2 (234ms)
    - commit (15ms)

Index Operations:

Span: index_build
  Attributes:
    - index.name: "User.email"
    - index.type: "btree"
    - index.rows: 1000000
  Duration: 45000ms

Custom Spans

Add application-specific tracing:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

async def recommend_products(user_id):
    with tracer.start_as_current_span("recommend_products") as span:
        span.set_attribute("user_id", user_id)

        # Fetch user preferences
        with tracer.start_as_current_span("fetch_preferences"):
            prefs, _ = await client.query(
                "MATCH (u:User {id: $id})-[:LIKES]->(p:Product) RETURN p",
                {"id": user_id}
            )

        # Generate recommendations
        with tracer.start_as_current_span("generate_recommendations") as rec_span:
            rec_span.set_attribute("input_products", len(prefs))
            recommendations = await compute_recommendations(prefs)
            rec_span.set_attribute("recommendations_count", len(recommendations))

        return recommendations

Trace Sampling

Control trace volume with intelligent sampling:

Tail-Based Sampling: Sample based on trace characteristics

[tracing.sampling]
strategy = "tail_based"

# Always sample errors
sample_on_error = true

# Always sample slow requests
slow_threshold_ms = 1000
sample_slow = true

# Sample 10% of normal requests
default_rate = 0.1

Probabilistic Sampling: Random sampling percentage

[tracing.sampling]
strategy = "probabilistic"
rate = 0.05  # Sample 5% of traces

Correlation Across Pillars

Link metrics, logs, and traces for comprehensive debugging:

Request ID Propagation

import uuid

# Generate request ID
request_id = str(uuid.uuid4())

# Include in query metadata
result, _ = await client.query(
    query,
    params,
    metadata={"request_id": request_id}
)

# Request ID appears in:
# - Trace (span attribute)
# - Logs (log field)
# - Metrics (optional label for custom metrics)

Trace ID in Logs

{
  "timestamp": "2026-01-24T10:15:30.123Z",
  "level": "ERROR",
  "message": "Query execution failed",
  "trace_id": "abc123def456ghi789",
  "span_id": "xyz789",
  "query_id": "q-12847",
  "error": "Index out of bounds"
}

Query logs in trace UI or trace from logs:

# Find logs for specific trace
jq 'select(.trace_id == "abc123def456ghi789")' /var/log/geode/geode.log

# Find trace from log entry
curl "http://jaeger:16686/api/traces/abc123def456ghi789"

Observability-Driven Debugging

Performance Regression Investigation

Detect anomaly in metrics:

# p95 latency spiked
histogram_quantile(0.95, rate(geode_query_duration_seconds_bucket[5m])) > 0.5

Identify affected queries from logs:

jq 'select(.duration_ms > 500) | {query_id, query_text, duration_ms}' \
  /var/log/geode/geode.log

Analyze traces for bottlenecks: Look for spans with unexpectedly high duration in trace viewer.
Profile slow query:

PROFILE
MATCH (u:User)-[:FOLLOWS]->(other)
RETURN u.name, count(other);

Error Rate Spike Investigation

Detect in metrics:

rate(geode_queries_total{status="error"}[5m]) > 10

Analyze error logs:

jq 'select(.level == "ERROR") | {timestamp, message, error, query_text}' \
  /var/log/geode/geode.log | tail -100

Group errors by type:

jq -r 'select(.level == "ERROR") | .error' /var/log/geode/geode.log | \
  sort | uniq -c | sort -rn

Examine failed traces: Filter traces by error status to see failure patterns.

Observability Best Practices

High Cardinality Awareness: Avoid unbounded label/field values (user IDs, session IDs) that explode storage requirements.

Consistent Naming: Use consistent naming conventions across metrics, logs, and traces for easy correlation.

Contextual Enrichment: Include relevant context (user, query type, client) in all observability signals.

Sampling Strategy: Sample traces appropriately to balance coverage and overhead (1-10% for high-volume systems).

Retention Policies: Define appropriate retention periods for each pillar based on use case (metrics: 30d, logs: 7d, traces: 3d).

Alert on Symptoms, Not Causes: Alert on user-impacting symptoms (high latency, errors) rather than internal metrics (CPU, memory).

Documentation: Maintain runbooks linking alerts to investigation procedures using observability tools.

Cost Management: Monitor observability pipeline costs and optimize sampling, retention, and cardinality.

System Monitoring - Monitoring strategies
Performance Metrics - Metrics collection
Application Logging - Logging best practices
Distributed Tracing - Tracing implementation
Prometheus Integration - Metrics collection
Performance Tuning - Optimization techniques
Troubleshooting - Debugging guide

Popular

The Three Pillars of Observability

Metrics: What is Happening

Logs: Why is it Happening

Traces: How is it Flowing

Structured Logging

Log Configuration

Log Levels

Contextual Logging

Log Aggregation

Distributed Tracing

OpenTelemetry Integration

Trace Instrumentation

Custom Spans

Trace Sampling

Correlation Across Pillars

Request ID Propagation

Trace ID in Logs

Observability-Driven Debugging

Performance Regression Investigation

Error Rate Spike Investigation

Observability Best Practices

Further Reading

Related Articles

Monitoring

Monitoring and Telemetry

Statistics and Metrics Reference

Advanced Telemetry and Monitoring Guide

Monitoring Guide

The Three Pillars of Observability Share link

Metrics: What is Happening Share link

Logs: Why is it Happening Share link

Traces: How is it Flowing Share link

Structured Logging Share link

Log Configuration Share link

Log Levels Share link

Contextual Logging Share link

Log Aggregation Share link

Distributed Tracing Share link

OpenTelemetry Integration Share link

Trace Instrumentation Share link

Custom Spans Share link

Trace Sampling Share link

Correlation Across Pillars Share link

Request ID Propagation Share link

Trace ID in Logs Share link

Observability-Driven Debugging Share link

Performance Regression Investigation Share link

Error Rate Spike Investigation Share link

Observability Best Practices Share link

Related Topics Share link

Further Reading Share link

Related Articles

Monitoring

Monitoring and Telemetry

Statistics and Metrics Reference

Advanced Telemetry and Monitoring Guide

Monitoring Guide

The Three Pillars of Observability

Metrics: What is Happening

Logs: Why is it Happening

Traces: How is it Flowing

Structured Logging

Log Configuration

Log Levels

Contextual Logging

Log Aggregation

Distributed Tracing

OpenTelemetry Integration

Trace Instrumentation

Custom Spans

Trace Sampling

Correlation Across Pillars

Request ID Propagation

Trace ID in Logs

Observability-Driven Debugging

Performance Regression Investigation

Error Rate Spike Investigation

Observability Best Practices

Related Topics

Further Reading