Disaster Recovery | Geode Database

<h2 id="disaster-recovery" class="position-relative d-flex align-items-center group"> Disaster Recovery <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="disaster-recovery" aria-haspopup="dialog" aria-label="Share link: Disaster Recovery"> Share link </button> </h2><div id="headingShareModal" class="heading-share-modal" role="dialog" aria-modal="true" aria-labelledby="headingShareTitle" hidden> <div class="hsm-dialog" role="document"> <div class="hsm-header"> <h2 id="headingShareTitle" class="h6 mb-0 fw-bold">Share this section</h2> <button type="button" class="hsm-close" aria-label="Close"> </button> </div> <div class="hsm-body"> <label for="headingShareInput" class="form-label small text-muted mb-1 text-uppercase fw-bold" style="font-size: 0.7rem; letter-spacing: 0.5px;">Permalink</label> <div class="input-group mb-4 hsm-url-group"> <input id="headingShareInput" type="text" class="form-control font-monospace" readonly aria-readonly="true" style="font-size: 0.85rem;" /> <button class="btn btn-primary hsm-copy" type="button" aria-label="Copy" title="Copy"> </button> </div> <div class="small fw-bold mb-2 text-muted text-uppercase" style="font-size: 0.7rem; letter-spacing: 0.5px;">Share via</div> <div class="hsm-share-grid"> <a id="share-twitter" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> Twitter </a> <a id="share-linkedin" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> LinkedIn </a> <a id="share-facebook" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> Facebook </a> </div> </div> </div> </div> <style> .heading-share-modal { position: fixed; inset: 0; display: flex; justify-content: center; align-items: center; background: rgba(0, 0, 0, 0.6); z-index: 1050; padding: 1rem; backdrop-filter: blur(4px); -webkit-backdrop-filter: blur(4px); } .heading-share-modal[hidden] { display: none !important; } .hsm-dialog { max-width: 420px; width: 100%; background: var(--bs-body-bg, #fff); color: var(--bs-body-color, #212529); border: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); border-radius: 1rem; box-shadow: 0 25px 50px -12px rgba(0, 0, 0, 0.25); overflow: hidden; animation: hsm-fade-in 0.2s ease-out; } @keyframes hsm-fade-in { from { opacity: 0; transform: scale(0.95); } to { opacity: 1; transform: scale(1); } } [data-bs-theme="dark"] .hsm-dialog { background: #1e293b; border-color: rgba(255,255,255,0.1); color: #f8f9fa; } .hsm-header { display: flex; justify-content: space-between; align-items: center; padding: 1rem 1.5rem; border-bottom: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); background: rgba(0,0,0,0.02); } [data-bs-theme="dark"] .hsm-header { background: rgba(255,255,255,0.02); border-color: rgba(255,255,255,0.1); } .hsm-close { background: transparent; border: none; color: inherit; opacity: 0.5; padding: 0.25rem 0.5rem; border-radius: 0.25rem; font-size: 1.2rem; line-height: 1; transition: opacity 0.2s; } .hsm-close:hover { opacity: 1; } .hsm-body { padding: 1.5rem; } .hsm-url-group { display: flex !important; align-items: stretch; } .hsm-url-group .form-control { flex: 1; min-width: 0; margin: 0; background: var(--bs-secondary-bg, #f8f9fa); border-color: var(--bs-border-color, #dee2e6); border-top-right-radius: 0; border-bottom-right-radius: 0; height: 42px; } .hsm-url-group .btn { flex: 0 0 auto; margin: 0; margin-left: -1px; border-top-left-radius: 0; border-bottom-left-radius: 0; height: 42px; display: flex; align-items: center; justify-content: center; padding: 0 1.25rem; z-index: 2; } [data-bs-theme="dark"] .hsm-url-group .form-control { background: #0f172a; border-color: #334155; color: #e2e8f0; } .hsm-share-grid { display: flex; flex-direction: column; gap: 0.5rem; } .hsm-share-grid .btn { display: flex; align-items: center; justify-content: center; font-size: 0.9rem; padding: 0.6rem; border-color: var(--bs-border-color); width: 100%; } [data-bs-theme="dark"] .hsm-share-grid .btn { color: #e2e8f0; border-color: #475569; } [data-bs-theme="dark"] .hsm-share-grid .btn:hover { background: #334155; border-color: #cbd5e1; } </style> <script> (function(){ const modal = document.getElementById('headingShareModal'); if(!modal) return; const input = modal.querySelector('#headingShareInput'); const copyBtn = modal.querySelector('.hsm-copy'); const twitter = modal.querySelector('#share-twitter'); const linkedin = modal.querySelector('#share-linkedin'); const facebook = modal.querySelector('#share-facebook'); const closeBtn = modal.querySelector('.hsm-close'); let lastFocus=null; let trapBound=false; function buildUrl(id){ return window.location.origin + window.location.pathname + '#' + id; } function isOpen(){ return !modal.hasAttribute('hidden'); } function hydrate(id){ const url=buildUrl(id); input.value=url; const enc=encodeURIComponent(url); const text=encodeURIComponent(document.title); if(twitter) twitter.href=`https://twitter.com/intent/tweet?url=${enc}&text=${text}`; if(linkedin) linkedin.href=`https://www.linkedin.com/sharing/share-offsite/?url=${enc}`; if(facebook) facebook.href=`https://www.facebook.com/sharer/sharer.php?u=${enc}`; } function openModal(id){ lastFocus=document.activeElement; hydrate(id); if(!isOpen()){ modal.removeAttribute('hidden'); } requestAnimationFrame(()=>{ input.focus(); }); trapFocus(); } function closeModal(){ if(!isOpen()) return; modal.setAttribute('hidden',''); if(lastFocus && typeof lastFocus.focus==='function') lastFocus.focus(); } function copyCurrent(){ try{ navigator.clipboard.writeText(input.value).then(()=>feedback(true),()=>fallback()); } catch(e){ fallback(); } } function fallback(){ input.select(); try{ document.execCommand('copy'); feedback(true);}catch(e){ feedback(false);} } function feedback(ok){ if(!copyBtn) return; const icon=copyBtn.querySelector('i'); if(!icon) return; const prev=copyBtn.getAttribute('data-prev')||icon.className; if(!copyBtn.getAttribute('data-prev')) copyBtn.setAttribute('data-prev',prev); icon.className= ok ? 'fa-duotone fa-clipboard-check':'fa-duotone fa-circle-exclamation'; setTimeout(()=>{ icon.className=prev; },1800); } function handleShareClick(e){ e.preventDefault(); const btn=e.currentTarget; const id=btn.getAttribute('data-share-target'); if(id) openModal(id); } function bindShareButtons(){ document.querySelectorAll('.h-share').forEach(btn=>{ if(!btn.dataset.hShareBound){ btn.addEventListener('click', handleShareClick); btn.dataset.hShareBound='1'; } }); } bindShareButtons(); if(document.readyState==='loading'){ document.addEventListener('DOMContentLoaded', bindShareButtons); } else { requestAnimationFrame(bindShareButtons); } document.addEventListener('click', function(e){ const shareBtn=e.target.closest && e.target.closest('.h-share'); if(shareBtn && !shareBtn.dataset.hShareBound){ handleShareClick.call(shareBtn, e); } }, true); document.addEventListener('click', e=>{ if(e.target===modal) closeModal(); if(e.target.closest && e.target.closest('.hsm-close')){ e.preventDefault(); closeModal(); } if(copyBtn && (e.target===copyBtn || (e.target.closest && e.target.closest('.hsm-copy')))) { e.preventDefault(); copyCurrent(); } }); document.addEventListener('keydown', e=>{ if(e.key==='Escape' && isOpen()) closeModal(); }); function trapFocus(){ if(trapBound) return; trapBound=true; modal.addEventListener('keydown', f=>{ if(f.key==='Tab' && isOpen()){ const focusable=[...modal.querySelectorAll('a[href],button,input,textarea,select,[tabindex]:not([tabindex="-1"])')].filter(el=>!el.hasAttribute('disabled')); if(!focusable.length) return; const first=focusable[0]; const last=focusable[focusable.length-1]; if(f.shiftKey && document.activeElement===first){ f.preventDefault(); last.focus(); } else if(!f.shiftKey && document.activeElement===last){ f.preventDefault(); first.focus(); } } }); } if(closeBtn) closeBtn.addEventListener('click', e=>{ e.preventDefault(); closeModal(); }); })(); </script>This guide covers disaster recovery (DR) planning and procedures for Geode, including RTO/RPO objectives, failover strategies, and business continuity planning. <h3 id="overview" class="position-relative d-flex align-items-center group"> Overview <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="overview" aria-haspopup="dialog" aria-label="Share link: Overview"> Share link </button> </h3>Disaster recovery ensures business continuity when failures occur: <table> <thead> <tr> <th>Scenario</th> <th>Impact</th> <th>Recovery Strategy</th> </tr> </thead> <tbody> <tr> <td>Server crash</td> <td>Single node unavailable</td> <td>Automatic restart, replica failover</td> </tr> <tr> <td>Data center outage</td> <td>Full DC unavailable</td> <td>Cross-DC failover</td> </tr> <tr> <td>Data corruption</td> <td>Data integrity compromised</td> <td>Point-in-time recovery</td> </tr> <tr> <td>Ransomware</td> <td>Data encrypted/lost</td> <td>Offline backup restore</td> </tr> <tr> <td>Region failure</td> <td>Cloud region unavailable</td> <td>Multi-region failover</td> </tr> </tbody> </table> <h3 id="recovery-objectives" class="position-relative d-flex align-items-center group"> Recovery Objectives <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="recovery-objectives" aria-haspopup="dialog" aria-label="Share link: Recovery Objectives"> Share link </button> </h3> <h4 id="rto-recovery-time-objective" class="position-relative d-flex align-items-center group"> RTO (Recovery Time Objective) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="rto-recovery-time-objective" aria-haspopup="dialog" aria-label="Share link: RTO (Recovery Time Objective)"> Share link </button> </h4>Maximum acceptable downtime: <table> <thead> <tr> <th>Tier</th> <th>RTO</th> <th>Use Case</th> </tr> </thead> <tbody> <tr> <td>Tier 1</td> <td>< 1 minute</td> <td>Real-time, financial</td> </tr> <tr> <td>Tier 2</td> <td>< 15 minutes</td> <td>Production critical</td> </tr> <tr> <td>Tier 3</td> <td>< 4 hours</td> <td>Standard production</td> </tr> <tr> <td>Tier 4</td> <td>< 24 hours</td> <td>Non-critical</td> </tr> </tbody> </table> <h4 id="rpo-recovery-point-objective" class="position-relative d-flex align-items-center group"> RPO (Recovery Point Objective) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="rpo-recovery-point-objective" aria-haspopup="dialog" aria-label="Share link: RPO (Recovery Point Objective)"> Share link </button> </h4>Maximum acceptable data loss: <table> <thead> <tr> <th>Tier</th> <th>RPO</th> <th>Method</th> </tr> </thead> <tbody> <tr> <td>Zero</td> <td>0</td> <td>Synchronous replication</td> </tr> <tr> <td>Near-zero</td> <td>< 1 minute</td> <td>Async replication + WAL</td> </tr> <tr> <td>Standard</td> <td>< 15 minutes</td> <td>Incremental backups</td> </tr> <tr> <td>Extended</td> <td>< 24 hours</td> <td>Daily backups</td> </tr> </tbody> </table> <h4 id="geode-capabilities" class="position-relative d-flex align-items-center group"> Geode Capabilities <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="geode-capabilities" aria-haspopup="dialog" aria-label="Share link: Geode Capabilities"> Share link </button> </h4><table> <thead> <tr> <th>Feature</th> <th>RTO</th> <th>RPO</th> </tr> </thead> <tbody> <tr> <td>Automatic restart</td> <td>< 30s</td> <td>0</td> </tr> <tr> <td>Replica failover</td> <td>< 1 min</td> <td>< 1s</td> </tr> <tr> <td>PITR (Point-in-Time)</td> <td>< 5 min</td> <td>< 5 min</td> </tr> <tr> <td>Backup restore</td> <td>< 30 min</td> <td>< 24h</td> </tr> </tbody> </table> <h3 id="dr-architecture-patterns" class="position-relative d-flex align-items-center group"> DR Architecture Patterns <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="dr-architecture-patterns" aria-haspopup="dialog" aria-label="Share link: DR Architecture Patterns"> Share link </button> </h3> <h4 id="single-region-ha" class="position-relative d-flex align-items-center group"> Single-Region HA <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="single-region-ha" aria-haspopup="dialog" aria-label="Share link: Single-Region HA"> Share link </button> </h4>High availability within a single region: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> ┌─────────────────┐ │ Load Balancer │ │ (Active) │ └────────┬────────┘ │ ┌────────────────────┼────────────────────┐ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ Geode 1 │◄────────►│ Geode 2 │◄────────►│ Geode 3 │ │(Primary)│ Sync │(Replica)│ Sync │(Replica)│ └────┬────┘ Repl └────┬────┘ Repl └────┬────┘ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ Zone A │ │ Zone B │ │ Zone C │ └─────────┘ └─────────┘ └─────────┘ </code></pre></div>Characteristics: <ul> <li>RTO: < 1 minute</li> <li>RPO: < 1 second (sync replication)</li> <li>Protects against: Server failure, zone failure</li> </ul> <h4 id="multi-region-active-passive" class="position-relative d-flex align-items-center group"> Multi-Region Active-Passive <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="multi-region-active-passive" aria-haspopup="dialog" aria-label="Share link: Multi-Region Active-Passive"> Share link </button> </h4>Cross-region disaster recovery: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">┌─────────────────────────────────────┐ │ Primary Region │ │ │ │ ┌─────────┐ ┌─────────┐ │ │ │ Geode 1 │──│ Geode 2 │──┐ │ │ │(Primary)│ │(Replica)│ │ │ │ └────┬────┘ └─────────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌─────────────────────────┤ │ │ │ Async Replication │ │ │ └─────────────────────────┘ │ └─────────────────┬───────────────────┘ │ Async ▼ ┌─────────────────────────────────────┐ │ DR Region │ │ │ │ ┌─────────┐ ┌─────────┐ │ │ │ Geode 1 │──│ Geode 2 │ │ │ │(Standby)│ │(Standby)│ │ │ └─────────┘ └─────────┘ │ │ │ └─────────────────────────────────────┘ </code></pre></div>Characteristics: <ul> <li>RTO: 15-60 minutes (manual failover)</li> <li>RPO: < 5 minutes (async replication)</li> <li>Protects against: Region failure, DC failure</li> </ul> <h4 id="multi-region-active-active" class="position-relative d-flex align-items-center group"> Multi-Region Active-Active <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="multi-region-active-active" aria-haspopup="dialog" aria-label="Share link: Multi-Region Active-Active"> Share link </button> </h4>Global deployment with bidirectional replication: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">┌─────────────────────────┐ ┌─────────────────────────┐ │ Region US-East │ │ Region EU-West │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Geode │◄─────────────────────────────│ Geode │ │ │ │ Cluster │ Bidirectional│ │Replication│ Cluster │ │ │ └────┬────┘ │ │ └────┬────┘ │ │ │ │ │ │ │ │ ┌────▼────┐ │ │ ┌────▼────┐ │ │ │ Users │ │ │ │ Users │ │ │ │ US/LATAM│ │ │ │ EMEA │ │ │ └─────────┘ │ │ └─────────┘ │ └─────────────────────────┘ └─────────────────────────┘ </code></pre></div>Characteristics: <ul> <li>RTO: 0 (automatic)</li> <li>RPO: Conflict resolution dependent</li> <li>Protects against: Regional failures</li> <li>Note: Requires conflict resolution strategy</li> </ul> <h3 id="configuration" class="position-relative d-flex align-items-center group"> Configuration <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="configuration" aria-haspopup="dialog" aria-label="Share link: Configuration"> Share link </button> </h3> <h4 id="replication-setup" class="position-relative d-flex align-items-center group"> Replication Setup <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="replication-setup" aria-haspopup="dialog" aria-label="Share link: Replication Setup"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># geode.yaml - Primary replication: mode: primary sync_replicas: - host: geode-replica-1.example.com port: 3141 - host: geode-replica-2.example.com port: 3141 async_replicas: - host: geode-dr.us-west.example.com port: 3141 lag_threshold: 5m # Alert if lag > 5 minutes settings: sync_commit: true # Wait for sync replicas max_lag_bytes: 100MB # Max replication lag </code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># geode.yaml - DR Site replication: mode: standby upstream: host: geode-primary.us-east.example.com port: 3141 restore_command: 'geode wal-restore %f %p' recovery: target_timeline: latest recovery_target_action: pause # Pause on recovery </code></pre></div> <h4 id="backup-configuration-for-dr" class="position-relative d-flex align-items-center group"> Backup Configuration for DR <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="backup-configuration-for-dr" aria-haspopup="dialog" aria-label="Share link: Backup Configuration for DR"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># geode.yaml backup: # Local backup (primary site) local: enabled: true path: /backups/local retention_days: 7 # S3 backup (same region) s3_primary: enabled: true bucket: geode-backups-us-east region: us-east-1 retention_days: 30 # S3 backup (DR region) s3_dr: enabled: true bucket: geode-backups-us-west region: us-west-2 retention_days: 90 storage_class: STANDARD_IA # WAL archiving wal_archive: enabled: true destination: s3://geode-wal-archive-us-east interval: 1m dr_copy: s3://geode-wal-archive-us-west </code></pre></div> <h3 id="failover-procedures" class="position-relative d-flex align-items-center group"> Failover Procedures <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="failover-procedures" aria-haspopup="dialog" aria-label="Share link: Failover Procedures"> Share link </button> </h3> <h4 id="automatic-failover-single-region" class="position-relative d-flex align-items-center group"> Automatic Failover (Single Region) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="automatic-failover-single-region" aria-haspopup="dialog" aria-label="Share link: Automatic Failover (Single Region)"> Share link </button> </h4>For replica failover within a region: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># geode.yaml high_availability: enabled: true auto_failover: true failover_timeout: 30s min_replicas: 2 health_check: interval: 5s timeout: 10s unhealthy_threshold: 3 </code></pre></div>The system automatically: <ol> <li>Detects primary failure (missed health checks)</li> <li>Elects new primary from replicas</li> <li>Updates routing configuration</li> <li>Notifies connected clients</li> </ol> <h4 id="manual-failover-cross-region" class="position-relative d-flex align-items-center group"> Manual Failover (Cross-Region) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="manual-failover-cross-region" aria-haspopup="dialog" aria-label="Share link: Manual Failover (Cross-Region)"> Share link </button> </h4>For planned DR failover: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">#!/bin/bash # failover-to-dr.sh set -euo pipefail PRIMARY_REGION="us-east" DR_REGION="us-west" echo "=== Starting DR Failover ===" echo "From: $PRIMARY_REGION" echo "To: $DR_REGION" # 1. Verify DR site is ready echo "Checking DR site status..." DR_STATUS=$(geode admin status --host geode-dr.us-west.example.com) echo "$DR_STATUS" # 2. Check replication lag LAG=$(geode admin replication-lag --host geode-dr.us-west.example.com) echo "Replication lag: $LAG" if [ "$LAG" -gt 300 ]; then # > 5 minutes echo "WARNING: High replication lag. Potential data loss." read -p "Continue? (yes/no): " CONFIRM [ "$CONFIRM" != "yes" ] && exit 1 fi # 3. Stop writes to primary (if accessible) echo "Stopping writes to primary..." geode admin read-only --host geode-primary.us-east.example.com 2>/dev/null || true # 4. Wait for replication to catch up echo "Waiting for replication to synchronize..." sleep 30 # 5. Promote DR to primary echo "Promoting DR site to primary..." geode admin promote --host geode-dr.us-west.example.com # 6. Verify promotion echo "Verifying promotion..." geode admin status --host geode-dr.us-west.example.com # 7. Update DNS/load balancer echo "Updating DNS..." # aws route53 change-resource-record-sets ... # 8. Notify monitoring echo "Sending notification..." curl -X POST https://hooks.slack.com/... \ -d '{"text": "DR Failover completed. Active region: '"$DR_REGION"'"}' echo "=== Failover Complete ===" </code></pre></div> <h4 id="emergency-failover" class="position-relative d-flex align-items-center group"> Emergency Failover <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="emergency-failover" aria-haspopup="dialog" aria-label="Share link: Emergency Failover"> Share link </button> </h4>For unplanned primary failure: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">#!/bin/bash # emergency-failover.sh set -euo pipefail DR_HOST="geode-dr.us-west.example.com" echo "=== EMERGENCY FAILOVER ===" echo "WARNING: Primary is unavailable. Data loss may occur." # 1. Check DR status echo "Checking DR site..." geode admin status --host $DR_HOST # 2. Get last known state LAST_WAL=$(geode admin last-wal --host $DR_HOST) echo "Last WAL received: $LAST_WAL" # 3. Force promote (no wait for sync) echo "Force promoting DR site..." geode admin promote --force --host $DR_HOST # 4. Verify geode admin status --host $DR_HOST # 5. Update routing echo "Updating DNS/routing..." # Implement DNS update # 6. Alert echo "Sending critical alert..." # Implement alerting echo "=== Emergency Failover Complete ===" echo "IMPORTANT: Document data loss window and investigate primary failure" </code></pre></div> <h3 id="recovery-procedures" class="position-relative d-flex align-items-center group"> Recovery Procedures <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="recovery-procedures" aria-haspopup="dialog" aria-label="Share link: Recovery Procedures"> Share link </button> </h3> <h4 id="point-in-time-recovery" class="position-relative d-flex align-items-center group"> Point-in-Time Recovery <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="point-in-time-recovery" aria-haspopup="dialog" aria-label="Share link: Point-in-Time Recovery"> Share link </button> </h4>Recover to a specific timestamp: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">#!/bin/bash # pitr-recovery.sh BACKUP_SOURCE="s3://geode-backups/production" RECOVERY_TARGET="2026-01-28 10:30:00" DATA_DIR="/var/lib/geode/data" echo "=== Point-in-Time Recovery ===" echo "Target: $RECOVERY_TARGET" # 1. Stop server sudo systemctl stop geode # 2. Backup current state sudo mv $DATA_DIR ${DATA_DIR}.before-recovery-$(date +%Y%m%d-%H%M%S) # 3. Find appropriate base backup BASE_BACKUP=$(geode backup --list --dest $BACKUP_SOURCE \ --before "$RECOVERY_TARGET" \ --type full \ --format json | jq -r '.backups[0].id') echo "Base backup: $BASE_BACKUP" # 4. Restore base backup geode restore \ --source $BACKUP_SOURCE \ --backup-id $BASE_BACKUP \ --target $DATA_DIR # 5. Apply WAL to target time geode restore \ --source $BACKUP_SOURCE \ --backup-id $BASE_BACKUP \ --target $DATA_DIR \ --pitr-timestamp "$RECOVERY_TARGET" # 6. Verify integrity geode verify --data-dir $DATA_DIR # 7. Start server in recovery mode sudo systemctl start geode # 8. Verify recovery geode query "MATCH (n) RETURN count(n) as count" echo "=== Recovery Complete ===" </code></pre></div> <h4 id="full-backup-restore" class="position-relative d-flex align-items-center group"> Full Backup Restore <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="full-backup-restore" aria-haspopup="dialog" aria-label="Share link: Full Backup Restore"> Share link </button> </h4>Restore from backup after complete loss: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">#!/bin/bash # full-restore.sh BACKUP_SOURCE="s3://geode-backups/production" BACKUP_ID="$1" DATA_DIR="/var/lib/geode/data" if [ -z "$BACKUP_ID" ]; then echo "Usage: $0 <backup-id>" echo "Available backups:" geode backup --list --dest $BACKUP_SOURCE exit 1 fi echo "=== Full Restore from Backup ===" echo "Backup ID: $BACKUP_ID" # 1. Verify backup exists and is valid geode backup --verify --dest $BACKUP_SOURCE --backup-id $BACKUP_ID # 2. Stop server sudo systemctl stop geode # 3. Clear existing data sudo rm -rf $DATA_DIR/* # 4. Restore geode restore \ --source $BACKUP_SOURCE \ --backup-id $BACKUP_ID \ --target $DATA_DIR \ --include-incrementals # Apply all incrementals # 5. Verify geode verify --data-dir $DATA_DIR # 6. Start server sudo systemctl start geode # 7. Health check sleep 10 geode admin status echo "=== Restore Complete ===" </code></pre></div> <h3 id="dr-testing" class="position-relative d-flex align-items-center group"> DR Testing <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="dr-testing" aria-haspopup="dialog" aria-label="Share link: DR Testing"> Share link </button> </h3> <h4 id="monthly-dr-test" class="position-relative d-flex align-items-center group"> Monthly DR Test <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="monthly-dr-test" aria-haspopup="dialog" aria-label="Share link: Monthly DR Test"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">#!/bin/bash # dr-test-monthly.sh TEST_DIR="/tmp/geode-dr-test-$(date +%Y%m%d)" REPORT_FILE="/var/log/geode/dr-test-$(date +%Y%m%d).log" BACKUP_SOURCE="s3://geode-backups/production" log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*" | tee -a "$REPORT_FILE" } log "=== Monthly DR Test Started ===" # Get latest backup LATEST_BACKUP=$(geode backup --list --dest $BACKUP_SOURCE \ --type full --format json | jq -r '.backups[0].id') log "Testing backup: $LATEST_BACKUP" # Create test directory mkdir -p "$TEST_DIR" # Measure restore time (RTO test) START_TIME=$(date +%s) log "Starting restore..." geode restore \ --source $BACKUP_SOURCE \ --backup-id $LATEST_BACKUP \ --target "$TEST_DIR" >> "$REPORT_FILE" 2>&1 END_TIME=$(date +%s) RTO_SECONDS=$((END_TIME - START_TIME)) RTO_MINUTES=$((RTO_SECONDS / 60)) log "Restore completed in ${RTO_SECONDS}s (${RTO_MINUTES}m)" # Verify data integrity log "Verifying data integrity..." geode verify --data-dir "$TEST_DIR" >> "$REPORT_FILE" 2>&1 # Start test server log "Starting test server..." geode serve \ --data-dir "$TEST_DIR" \ --listen 127.0.0.1:3142 \ --config-only & SERVER_PID=$! sleep 10 # Run validation queries log "Running validation queries..." NODE_COUNT=$(geode query "MATCH (n) RETURN count(n) as count" \ --server 127.0.0.1:3142 --format json | jq -r '.rows[0].count') log "Node count: $NODE_COUNT" # Stop test server kill $SERVER_PID 2>/dev/null # Cleanup rm -rf "$TEST_DIR" # Generate report log "=== DR Test Summary ===" log "Backup ID: $LATEST_BACKUP" log "RTO: ${RTO_MINUTES} minutes (target: 5 minutes)" log "RTO Status: $([ $RTO_MINUTES -le 5 ] && echo 'PASS' || echo 'FAIL')" log "Data Integrity: VERIFIED" log "Node Count: $NODE_COUNT" log "Test Status: SUCCESS" # Send report cat "$REPORT_FILE" | mail -s "Geode DR Test Report - $(date +%Y-%m-%d)" [email protected] </code></pre></div> <h4 id="quarterly-full-dr-drill" class="position-relative d-flex align-items-center group"> Quarterly Full DR Drill <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="quarterly-full-dr-drill" aria-haspopup="dialog" aria-label="Share link: Quarterly Full DR Drill"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">#!/bin/bash # dr-drill-quarterly.sh # This script performs a full DR drill including: # 1. Simulated primary failure # 2. DR site promotion # 3. Application failover # 4. Data validation # 5. Failback DRILL_ID="drill-$(date +%Y%m%d-%H%M%S)" LOG_FILE="/var/log/geode/dr-drill-$DRILL_ID.log" log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE" } log "=== Quarterly DR Drill: $DRILL_ID ===" log "This drill will:" log "1. Put primary in read-only mode" log "2. Promote DR site" log "3. Run validation tests" log "4. Fail back to primary" read -p "Proceed with DR drill? (yes/no): " CONFIRM [ "$CONFIRM" != "yes" ] && exit 1 # Phase 1: Simulate failure log "Phase 1: Simulating primary failure..." # ... implementation # Phase 2: Promote DR log "Phase 2: Promoting DR site..." # ... implementation # Phase 3: Validate log "Phase 3: Running validation..." # ... implementation # Phase 4: Failback log "Phase 4: Failing back to primary..." # ... implementation log "=== DR Drill Complete ===" </code></pre></div> <h3 id="runbooks" class="position-relative d-flex align-items-center group"> Runbooks <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="runbooks" aria-haspopup="dialog" aria-label="Share link: Runbooks"> Share link </button> </h3> <h4 id="runbook-primary-server-failure" class="position-relative d-flex align-items-center group"> Runbook: Primary Server Failure <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="runbook-primary-server-failure" aria-haspopup="dialog" aria-label="Share link: Runbook: Primary Server Failure"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-markdown" data-lang="markdown"># Runbook: Primary Server Failure ## Symptoms - Primary server unreachable - Health checks failing - Client connection errors ## Immediate Actions 1. **Verify failure** ```bash ping geode-primary.example.com geode admin status --host geode-primary.example.com </code></pre></div><ol start="2"> <li> Check replica status <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">geode admin status --host geode-replica-1.example.com geode admin status --host geode-replica-2.example.com </code></pre></div></li> <li> Automatic failover should occur <ul> <li>If auto-failover enabled, new primary elected within 30s</li> <li>Verify: <code>geode admin cluster-status</code></li> </ul> </li> <li> If auto-failover fails <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">geode admin promote --host geode-replica-1.example.com </code></pre></div></li> <li> Update monitoring <ul> <li>Acknowledge alert</li> <li>Create incident ticket</li> </ul> </li> </ol> <h3 id="recovery" class="position-relative d-flex align-items-center group"> Recovery <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="recovery" aria-haspopup="dialog" aria-label="Share link: Recovery"> Share link </button> </h3><ol> <li>Investigate root cause</li> <li>Repair/replace failed server</li> <li>Rejoin as replica</li> <li>Conduct post-incident review</li> </ol> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> ### Runbook: Data Corruption ```markdown # Runbook: Data Corruption Detected ## Symptoms - Query errors: "checksum mismatch" - Unexpected query results - Verification failures ## Immediate Actions 1. **Stop writes** ```bash geode admin read-only </code></pre></div><ol start="2"> <li> Identify corruption scope <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">geode verify --data-dir /var/lib/geode/data --verbose </code></pre></div></li> <li> Check backup status <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">geode backup --list --dest s3://geode-backups </code></pre></div></li> <li> Determine recovery point <ul> <li>Last known good backup</li> <li>Or PITR to before corruption</li> </ul> </li> <li> Perform recovery <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash">./pitr-recovery.sh "2026-01-28 09:00:00" </code></pre></div></li> </ol> <h3 id="post-recovery" class="position-relative d-flex align-items-center group"> Post-Recovery <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="post-recovery" aria-haspopup="dialog" aria-label="Share link: Post-Recovery"> Share link </button> </h3><ol> <li>Validate data integrity</li> <li>Resume writes</li> <li>Investigate root cause</li> <li>Review and enhance monitoring</li> </ol> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> ## Best Practices ### DR Planning 1. **Define RTO/RPO**: Match business requirements 2. **Document procedures**: Detailed runbooks 3. **Automate where possible**: Reduce human error 4. **Regular testing**: Monthly tests, quarterly drills 5. **Update procedures**: After every change ### Replication 1. **Use sync replication for zero RPO**: Within region 2. **Use async for cross-region**: Accept lag tradeoff 3. **Monitor replication lag**: Alert on threshold 4. **Test failover regularly**: Validate automation 5. **Consider network latency**: For cross-region ### Backup Strategy 1. **3-2-1 rule**: 3 copies, 2 media, 1 offsite 2. **Automate backups**: No manual intervention 3. **Verify backups**: Regular integrity checks 4. **Test restores**: Monthly at minimum 5. **Encrypt backups**: At rest and in transit ### Documentation 1. **Maintain runbooks**: Step-by-step procedures 2. **Include contact info**: Escalation paths 3. **Version control**: Track changes 4. **Regular review**: Update quarterly 5. **Accessible offline**: DR docs available during outage ## Related Documentation - **[Backup Procedures](/docs/operations/backup/)** - Backup configuration and procedures - **[Monitoring](/docs/operations/monitoring/)** - DR-related monitoring - **[Multi-Datacenter Guide](/docs/guides/multi-datacenter/)** - Multi-DC deployment - **[High Availability](/docs/architecture/distributed-architecture/)** - HA architecture </code></pre></div>