Operations
Comprehensive operational guides for deploying, managing, and maintaining Geode in production environments. From single-node deployments to globally distributed clusters, these guides provide battle-tested practices for operational excellence.
Overview
Running Geode in production requires careful attention to:
- Backup and Recovery: Protecting data with automated backups and tested restore procedures
- Monitoring: Observability with metrics, logging, and alerting
- Disaster Recovery: Business continuity planning and tested failover procedures
- Upgrades: Safe version upgrades with minimal downtime
- Migration: Moving data between environments and versions
Whether you’re a DevOps engineer, SRE, or database administrator, these guides cover the operational aspects of Geode that ensure reliability, performance, and data protection.
Topics in This Section
- Backup Procedures - Automated backup strategies including S3 cloud storage, incremental backups, and point-in-time recovery
- Monitoring - Set up comprehensive monitoring with Prometheus, Grafana, and alerting
- Disaster Recovery - DR planning, RTO/RPO objectives, and failover procedures
- Upgrade Procedures - Safe upgrade strategies including rolling upgrades and blue-green deployments
- Migration Guide - Migrate data between versions, environments, and from other databases
Operational Checklist
Pre-Production Checklist
Before deploying to production, verify:
- Backup configured: Automated backups to S3-compatible storage
- Backup tested: Restore procedure verified in staging
- Monitoring deployed: Prometheus metrics scraping, Grafana dashboards
- Alerts configured: Critical alerts for availability and performance
- TLS certificates: Valid certificates from trusted CA (not self-signed)
- Authentication enabled: Strong authentication with MFA for admins
- Authorization configured: RBAC/RLS policies for data access
- Audit logging enabled: Compliance-ready audit trail
- DR plan documented: Recovery procedures tested and documented
- Runbooks created: Operational procedures for common tasks
Day 2 Operations
Ongoing operational tasks:
| Task | Frequency | Description |
|---|---|---|
| Backup verification | Weekly | Test restore from latest backup |
| Log review | Daily | Check for errors and anomalies |
| Metrics review | Daily | Monitor resource utilization trends |
| Certificate renewal | 30 days before expiry | Renew TLS certificates |
| Password rotation | 90 days | Rotate service account passwords |
| Security patches | As released | Apply security updates |
| Capacity review | Monthly | Plan for growth |
| DR drill | Quarterly | Test failover procedures |
Quick Reference
Health Checks
# Server health
curl http://localhost:8080/health
# Readiness check (for load balancers)
curl http://localhost:8080/ready
# Liveness check (for orchestrators)
curl http://localhost:8080/live
# Detailed status
geode admin status
Backup Operations
# Create full backup
geode backup --dest s3://bucket/backups --mode full
# Create incremental backup
geode backup --dest s3://bucket/backups --mode incremental
# List backups
geode backup --dest s3://bucket/backups --list
# Restore from backup
geode restore --source s3://bucket/backups --backup-id <id>
Monitoring
# View metrics
curl http://localhost:8080/metrics
# View logs
journalctl -u geode -f
# Check resource usage
geode admin stats
Common Operations
# Graceful shutdown
systemctl stop geode
# Force restart
systemctl restart geode
# Check configuration
geode config validate
# Database maintenance
geode admin maintenance --compact
Architecture Overview
Standalone Deployment
┌─────────────────────────────────────────────────┐
│ Geode Server │
│ (QUIC:3141) │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ Query Engine │ │
│ │ ┌──────┐ ┌──────────┐ ┌───────────┐ │ │
│ │ │Parser│─>│ Optimizer│─>│ Executor │ │ │
│ │ └──────┘ └──────────┘ └───────────┘ │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ Storage Engine │ │
│ │ ┌──────┐ ┌──────────┐ ┌───────────┐ │ │
│ │ │ WAL │ │ Indexes │ │ Pages │ │ │
│ │ └──────┘ └──────────┘ └───────────┘ │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ Backups │ │Monitoring│
│ (S3) │ │(Prometheus)│
└─────────┘ └─────────┘
Distributed Deployment
┌─────────────────┐
│ Load Balancer │
│ (Nginx/HAProxy)│
└────────┬────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Geode 1 │ │ Geode 2 │ │ Geode 3 │
│(Primary)│ │(Replica)│ │(Replica)│
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────────────────┼────────────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Vault │ │ MinIO │ │Prometheus│
│ (KMS) │ │(Backups)│ │ Grafana │
└─────────┘ └─────────┘ └──────────┘
Operational Metrics
Key Performance Indicators (KPIs)
| Metric | Target | Critical Threshold |
|---|---|---|
| Availability | 99.9% | < 99% |
| Query latency (p50) | < 10ms | > 100ms |
| Query latency (p99) | < 100ms | > 1s |
| Error rate | < 0.1% | > 1% |
| Backup success rate | 100% | < 95% |
| RTO (Recovery Time) | < 5 min | > 15 min |
| RPO (Recovery Point) | < 15 min | > 1 hour |
SLO Examples
# Service Level Objectives
slos:
availability:
target: 99.9%
window: 30d
latency:
p50_target: 10ms
p99_target: 100ms
window: 24h
error_rate:
target: 0.1%
window: 1h
backup:
success_rate: 100%
max_age: 26h
Best Practices
Availability
- Deploy multiple replicas: At least 3 nodes for high availability
- Use load balancing: Distribute traffic across healthy nodes
- Implement health checks: Automatic removal of unhealthy nodes
- Test failover regularly: Quarterly DR drills
- Monitor continuously: Real-time alerting for issues
Data Protection
- Automate backups: Daily incremental, weekly full
- Test restores: Monthly restore verification
- Offsite storage: Backups in different region/provider
- Encrypt backups: Server-side encryption for compliance
- Monitor backup age: Alert if backup older than 26 hours
Security
- Enable TLS: TLS 1.3 for all connections
- Require authentication: No anonymous access
- Implement RBAC: Role-based permissions
- Enable audit logging: Compliance-ready audit trail
- Rotate credentials: Regular password and key rotation
Performance
- Monitor resource usage: CPU, memory, disk, network
- Set up alerts: Proactive notification of issues
- Capacity planning: Regular growth projections
- Index optimization: Regular index analysis
- Query tuning: Identify and optimize slow queries
Related Documentation
- Deployment Patterns - Deployment architectures
- Configuration Reference - Server configuration
- Security Overview - Security architecture
- Observability - Monitoring and telemetry
- Troubleshooting - Common issues and solutions
Getting Help
For operational issues:
- Check Troubleshooting Guide
- Review Error Codes
- Check logs:
journalctl -u geode -f - Review metrics:
/metricsendpoint - Contact support with diagnostics