High Availability

<h2 id="high-availability-in-geode" class="position-relative d-flex align-items-center group"> High Availability in Geode <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="high-availability-in-geode" aria-haspopup="dialog" aria-label="Share link: High Availability in Geode"> Share link </button> </h2><div id="headingShareModal" class="heading-share-modal" role="dialog" aria-modal="true" aria-labelledby="headingShareTitle" hidden> <div class="hsm-dialog" role="document"> <div class="hsm-header"> <h2 id="headingShareTitle" class="h6 mb-0 fw-bold">Share this section</h2> <button type="button" class="hsm-close" aria-label="Close"> </button> </div> <div class="hsm-body"> <label for="headingShareInput" class="form-label small text-muted mb-1 text-uppercase fw-bold" style="font-size: 0.7rem; letter-spacing: 0.5px;">Permalink</label> <div class="input-group mb-4 hsm-url-group"> <input id="headingShareInput" type="text" class="form-control font-monospace" readonly aria-readonly="true" style="font-size: 0.85rem;" /> <button class="btn btn-primary hsm-copy" type="button" aria-label="Copy" title="Copy"> </button> </div> <div class="small fw-bold mb-2 text-muted text-uppercase" style="font-size: 0.7rem; letter-spacing: 0.5px;">Share via</div> <div class="hsm-share-grid"> <a id="share-twitter" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> Twitter </a> <a id="share-linkedin" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> LinkedIn </a> <a id="share-facebook" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> Facebook </a> </div> </div> </div> </div> <style> .heading-share-modal { position: fixed; inset: 0; display: flex; justify-content: center; align-items: center; background: rgba(0, 0, 0, 0.6); z-index: 1050; padding: 1rem; backdrop-filter: blur(4px); -webkit-backdrop-filter: blur(4px); } .heading-share-modal[hidden] { display: none !important; } .hsm-dialog { max-width: 420px; width: 100%; background: var(--bs-body-bg, #fff); color: var(--bs-body-color, #212529); border: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); border-radius: 1rem; box-shadow: 0 25px 50px -12px rgba(0, 0, 0, 0.25); overflow: hidden; animation: hsm-fade-in 0.2s ease-out; } @keyframes hsm-fade-in { from { opacity: 0; transform: scale(0.95); } to { opacity: 1; transform: scale(1); } } [data-bs-theme="dark"] .hsm-dialog { background: #1e293b; border-color: rgba(255,255,255,0.1); color: #f8f9fa; } .hsm-header { display: flex; justify-content: space-between; align-items: center; padding: 1rem 1.5rem; border-bottom: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); background: rgba(0,0,0,0.02); } [data-bs-theme="dark"] .hsm-header { background: rgba(255,255,255,0.02); border-color: rgba(255,255,255,0.1); } .hsm-close { background: transparent; border: none; color: inherit; opacity: 0.5; padding: 0.25rem 0.5rem; border-radius: 0.25rem; font-size: 1.2rem; line-height: 1; transition: opacity 0.2s; } .hsm-close:hover { opacity: 1; } .hsm-body { padding: 1.5rem; } .hsm-url-group { display: flex !important; align-items: stretch; } .hsm-url-group .form-control { flex: 1; min-width: 0; margin: 0; background: var(--bs-secondary-bg, #f8f9fa); border-color: var(--bs-border-color, #dee2e6); border-top-right-radius: 0; border-bottom-right-radius: 0; height: 42px; } .hsm-url-group .btn { flex: 0 0 auto; margin: 0; margin-left: -1px; border-top-left-radius: 0; border-bottom-left-radius: 0; height: 42px; display: flex; align-items: center; justify-content: center; padding: 0 1.25rem; z-index: 2; } [data-bs-theme="dark"] .hsm-url-group .form-control { background: #0f172a; border-color: #334155; color: #e2e8f0; } .hsm-share-grid { display: flex; flex-direction: column; gap: 0.5rem; } .hsm-share-grid .btn { display: flex; align-items: center; justify-content: center; font-size: 0.9rem; padding: 0.6rem; border-color: var(--bs-border-color); width: 100%; } [data-bs-theme="dark"] .hsm-share-grid .btn { color: #e2e8f0; border-color: #475569; } [data-bs-theme="dark"] .hsm-share-grid .btn:hover { background: #334155; border-color: #cbd5e1; } </style> <script> (function(){ const modal = document.getElementById('headingShareModal'); if(!modal) return; const input = modal.querySelector('#headingShareInput'); const copyBtn = modal.querySelector('.hsm-copy'); const twitter = modal.querySelector('#share-twitter'); const linkedin = modal.querySelector('#share-linkedin'); const facebook = modal.querySelector('#share-facebook'); const closeBtn = modal.querySelector('.hsm-close'); let lastFocus=null; let trapBound=false; function buildUrl(id){ return window.location.origin + window.location.pathname + '#' + id; } function isOpen(){ return !modal.hasAttribute('hidden'); } function hydrate(id){ const url=buildUrl(id); input.value=url; const enc=encodeURIComponent(url); const text=encodeURIComponent(document.title); if(twitter) twitter.href=`https://twitter.com/intent/tweet?url=${enc}&text=${text}`; if(linkedin) linkedin.href=`https://www.linkedin.com/sharing/share-offsite/?url=${enc}`; if(facebook) facebook.href=`https://www.facebook.com/sharer/sharer.php?u=${enc}`; } function openModal(id){ lastFocus=document.activeElement; hydrate(id); if(!isOpen()){ modal.removeAttribute('hidden'); } requestAnimationFrame(()=>{ input.focus(); }); trapFocus(); } function closeModal(){ if(!isOpen()) return; modal.setAttribute('hidden',''); if(lastFocus && typeof lastFocus.focus==='function') lastFocus.focus(); } function copyCurrent(){ try{ navigator.clipboard.writeText(input.value).then(()=>feedback(true),()=>fallback()); } catch(e){ fallback(); } } function fallback(){ input.select(); try{ document.execCommand('copy'); feedback(true);}catch(e){ feedback(false);} } function feedback(ok){ if(!copyBtn) return; const icon=copyBtn.querySelector('i'); if(!icon) return; const prev=copyBtn.getAttribute('data-prev')||icon.className; if(!copyBtn.getAttribute('data-prev')) copyBtn.setAttribute('data-prev',prev); icon.className= ok ? 'fa-duotone fa-clipboard-check':'fa-duotone fa-circle-exclamation'; setTimeout(()=>{ icon.className=prev; },1800); } function handleShareClick(e){ e.preventDefault(); const btn=e.currentTarget; const id=btn.getAttribute('data-share-target'); if(id) openModal(id); } function bindShareButtons(){ document.querySelectorAll('.h-share').forEach(btn=>{ if(!btn.dataset.hShareBound){ btn.addEventListener('click', handleShareClick); btn.dataset.hShareBound='1'; } }); } bindShareButtons(); if(document.readyState==='loading'){ document.addEventListener('DOMContentLoaded', bindShareButtons); } else { requestAnimationFrame(bindShareButtons); } document.addEventListener('click', function(e){ const shareBtn=e.target.closest && e.target.closest('.h-share'); if(shareBtn && !shareBtn.dataset.hShareBound){ handleShareClick.call(shareBtn, e); } }, true); document.addEventListener('click', e=>{ if(e.target===modal) closeModal(); if(e.target.closest && e.target.closest('.hsm-close')){ e.preventDefault(); closeModal(); } if(copyBtn && (e.target===copyBtn || (e.target.closest && e.target.closest('.hsm-copy')))) { e.preventDefault(); copyCurrent(); } }); document.addEventListener('keydown', e=>{ if(e.key==='Escape' && isOpen()) closeModal(); }); function trapFocus(){ if(trapBound) return; trapBound=true; modal.addEventListener('keydown', f=>{ if(f.key==='Tab' && isOpen()){ const focusable=[...modal.querySelectorAll('a[href],button,input,textarea,select,[tabindex]:not([tabindex="-1"])')].filter(el=>!el.hasAttribute('disabled')); if(!focusable.length) return; const first=focusable[0]; const last=focusable[focusable.length-1]; if(f.shiftKey && document.activeElement===first){ f.preventDefault(); last.focus(); } else if(!f.shiftKey && document.activeElement===last){ f.preventDefault(); first.focus(); } } }); } if(closeBtn) closeBtn.addEventListener('click', e=>{ e.preventDefault(); closeModal(); }); })(); </script>High availability (HA) ensures that Geode remains operational despite hardware failures, network issues, or planned maintenance. For mission-critical applications, downtime translates directly to lost revenue, damaged reputation, and user frustration. Geode provides comprehensive HA capabilities including automatic failover, data replication, and self-healing clusters. This guide covers HA architecture, configuration, monitoring, and best practices for achieving enterprise-grade availability with Geode deployments. <h3 id="understanding-high-availability" class="position-relative d-flex align-items-center group"> Understanding High Availability <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="understanding-high-availability" aria-haspopup="dialog" aria-label="Share link: Understanding High Availability"> Share link </button> </h3> <h4 id="availability-metrics" class="position-relative d-flex align-items-center group"> Availability Metrics <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="availability-metrics" aria-haspopup="dialog" aria-label="Share link: Availability Metrics"> Share link </button> </h4>Uptime Percentage: The percentage of time the system is operational <table> <thead> <tr> <th>Availability</th> <th>Downtime/Year</th> <th>Downtime/Month</th> <th>Downtime/Week</th> </tr> </thead> <tbody> <tr> <td>99% (two 9s)</td> <td>3.65 days</td> <td>7.3 hours</td> <td>1.68 hours</td> </tr> <tr> <td>99.9% (three 9s)</td> <td>8.76 hours</td> <td>43.8 minutes</td> <td>10.1 minutes</td> </tr> <tr> <td>99.99% (four 9s)</td> <td>52.6 minutes</td> <td>4.38 minutes</td> <td>1.01 minutes</td> </tr> <tr> <td>99.999% (five 9s)</td> <td>5.26 minutes</td> <td>26.3 seconds</td> <td>6.05 seconds</td> </tr> </tbody> </table> Recovery Objectives: <ul> <li>RTO (Recovery Time Objective): Maximum acceptable downtime</li> <li>RPO (Recovery Point Objective): Maximum acceptable data loss</li> </ul> Geode’s HA features target 99.99%+ availability with near-zero RPO for synchronous replication. <h4 id="ha-components" class="position-relative d-flex align-items-center group"> HA Components <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="ha-components" aria-haspopup="dialog" aria-label="Share link: HA Components"> Share link </button> </h4>A highly available Geode deployment requires: <ol> <li>Redundant Nodes: Multiple instances to survive failures</li> <li>Data Replication: Copies of data across nodes</li> <li>Automatic Failover: Seamless transition when nodes fail</li> <li>Health Monitoring: Detection of failures and degradation</li> <li>Load Balancing: Distribution of traffic across healthy nodes</li> </ol> <h3 id="ha-architecture-patterns" class="position-relative d-flex align-items-center group"> HA Architecture Patterns <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="ha-architecture-patterns" aria-haspopup="dialog" aria-label="Share link: HA Architecture Patterns"> Share link </button> </h3> <h4 id="active-passive-primary-standby" class="position-relative d-flex align-items-center group"> Active-Passive (Primary-Standby) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="active-passive-primary-standby" aria-haspopup="dialog" aria-label="Share link: Active-Passive (Primary-Standby)"> Share link </button> </h4>One primary node handles all traffic; standby nodes remain synchronized for failover: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> ┌─────────────┐ Clients ───────>│ Primary │ │ (Active) │ └──────┬──────┘ │ Replication ┌────────────┼────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Standby1 │ │ Standby2 │ │ Standby3 │ │(Passive) │ │(Passive) │ │(Passive) │ └──────────┘ └──────────┘ └──────────┘ </code></pre></div>Configuration: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml"># geode.toml - Primary node [cluster] mode = "replicated" role = "primary" [replication] mode = "sync" factor = 3 standby_nodes = [ "standby1.geode.internal:7687", "standby2.geode.internal:7687", "standby3.geode.internal:7687" ] [failover] enabled = true promotion_strategy = "automatic" min_sync_replicas = 1 </code></pre></div>Advantages: Simple, strong consistency Disadvantages: Standby resources underutilized <h4 id="active-active-multi-primary" class="position-relative d-flex align-items-center group"> Active-Active (Multi-Primary) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="active-active-multi-primary" aria-haspopup="dialog" aria-label="Share link: Active-Active (Multi-Primary)"> Share link </button> </h4>Multiple nodes handle traffic simultaneously with synchronization: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> ┌──────────────────────────────┐ │ Load Balancer │ └──────────────┬───────────────┘ ┌────────────────────┼────────────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Node 1 │◄───────►│ Node 2 │◄───────►│ Node 3 │ │ (Active) │ Sync │ (Active) │ Sync │ (Active) │ └──────────┘ └──────────┘ └──────────┘ </code></pre></div>Configuration: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml"># geode.toml - Active node [cluster] mode = "distributed" role = "data" [cluster.nodes] seeds = [ "node1.geode.internal:7687", "node2.geode.internal:7687", "node3.geode.internal:7687" ] [replication] mode = "sync" factor = 3 read_preference = "nearest" write_concern = "majority" </code></pre></div>Advantages: Better resource utilization, horizontal scaling Disadvantages: More complex, potential for conflicts <h4 id="geode-recommended-raft-based-clustering" class="position-relative d-flex align-items-center group"> Geode Recommended: Raft-Based Clustering <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="geode-recommended-raft-based-clustering" aria-haspopup="dialog" aria-label="Share link: Geode Recommended: Raft-Based Clustering"> Share link </button> </h4>Geode uses Raft consensus for leader election and strong consistency: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> ┌──────────────────────────────┐ │ Load Balancer │ └──────────────┬───────────────┘ ┌────────────────────┼────────────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ (Leader) │────────►│(Follower)│ │(Follower)│ │ Writes │ │ Reads │ │ Reads │ └──────────┘ └──────────┘ └──────────┘ │ ▲ ▲ └────────────────────┴────────────────────┘ Log Replication </code></pre></div>Configuration: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml">[cluster] mode = "distributed" consensus = "raft" [cluster.raft] election_timeout_ms = 1500 heartbeat_interval_ms = 150 snapshot_threshold = 10000 [cluster.nodes] # Odd number for majority consensus count = 3 # or 5 for higher fault tolerance </code></pre></div> <h3 id="configuring-high-availability" class="position-relative d-flex align-items-center group"> Configuring High Availability <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="configuring-high-availability" aria-haspopup="dialog" aria-label="Share link: Configuring High Availability"> Share link </button> </h3> <h4 id="minimum-ha-cluster-3-nodes" class="position-relative d-flex align-items-center group"> Minimum HA Cluster (3 Nodes) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="minimum-ha-cluster-3-nodes" aria-haspopup="dialog" aria-label="Share link: Minimum HA Cluster (3 Nodes)"> Share link </button> </h4>A three-node cluster tolerates one node failure: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml"># geode.toml - Node 1 [server] node_id = "node1" listen = "0.0.0.0:3141" [cluster] mode = "distributed" name = "production" [cluster.nodes] seeds = [ "node1.geode.internal:7687", "node2.geode.internal:7687", "node3.geode.internal:7687" ] [replication] factor = 3 mode = "sync" ack_timeout_ms = 5000 [failover] enabled = true detection_interval_ms = 1000 failure_threshold = 3 promotion_delay_ms = 2000 </code></pre></div> <h4 id="enhanced-ha-cluster-5-nodes" class="position-relative d-flex align-items-center group"> Enhanced HA Cluster (5 Nodes) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="enhanced-ha-cluster-5-nodes" aria-haspopup="dialog" aria-label="Share link: Enhanced HA Cluster (5 Nodes)"> Share link </button> </h4>A five-node cluster tolerates two simultaneous node failures: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml"># geode.toml - 5-node cluster [cluster] mode = "distributed" name = "production-ha" [cluster.nodes] seeds = [ "node1.geode.internal:7687", "node2.geode.internal:7687", "node3.geode.internal:7687", "node4.geode.internal:7687", "node5.geode.internal:7687" ] # With 5 nodes, can lose 2 and maintain majority (3) [replication] factor = 3 mode = "sync" [cluster.placement] # Spread across availability zones strategy = "zone-aware" zones = ["us-east-1a", "us-east-1b", "us-east-1c"] min_zones_for_write = 2 </code></pre></div> <h4 id="geographic-distribution" class="position-relative d-flex align-items-center group"> Geographic Distribution <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="geographic-distribution" aria-haspopup="dialog" aria-label="Share link: Geographic Distribution"> Share link </button> </h4>Deploy across data centers for disaster resilience: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml"># Multi-region configuration [cluster] mode = "geo-distributed" [cluster.regions] primary = "us-east" secondary = ["us-west", "eu-west"] [cluster.region.us-east] nodes = ["node1-east", "node2-east", "node3-east"] priority = 1 [cluster.region.us-west] nodes = ["node1-west", "node2-west", "node3-west"] priority = 2 replication_mode = "async" max_lag_ms = 1000 [cluster.region.eu-west] nodes = ["node1-eu", "node2-eu", "node3-eu"] priority = 3 replication_mode = "async" max_lag_ms = 5000 </code></pre></div> <h3 id="automatic-failover" class="position-relative d-flex align-items-center group"> Automatic Failover <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="automatic-failover" aria-haspopup="dialog" aria-label="Share link: Automatic Failover"> Share link </button> </h3> <h4 id="failure-detection" class="position-relative d-flex align-items-center group"> Failure Detection <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="failure-detection" aria-haspopup="dialog" aria-label="Share link: Failure Detection"> Share link </button> </h4>Geode detects failures through multiple mechanisms: Heartbeat Monitoring: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml">[health.heartbeat] interval_ms = 100 timeout_ms = 500 failure_count = 3 # 3 missed = failure </code></pre></div>TCP Connection Health: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml">[health.connection] keepalive_interval_ms = 10000 keepalive_probes = 3 keepalive_timeout_ms = 5000 </code></pre></div>Application-Level Health Checks: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml">[health.checks] enabled = true interval_ms = 5000 [health.checks.storage] type = "write_test" timeout_ms = 1000 [health.checks.memory] type = "threshold" max_used_percent = 90 [health.checks.disk] type = "threshold" min_free_percent = 10 </code></pre></div> <h4 id="failover-process" class="position-relative d-flex align-items-center group"> Failover Process <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="failover-process" aria-haspopup="dialog" aria-label="Share link: Failover Process"> Share link </button> </h4>When a node failure is detected: <ol> <li>Detection: Health check fails or heartbeat timeout</li> <li>Verification: Confirm failure from multiple observers</li> <li>Leader Election: Raft elects new leader if needed</li> <li>Promotion: Replicas promoted to primary for affected shards</li> <li>Client Redirect: Clients automatically reconnect</li> <li>Recovery: System rebalances when node returns</li> </ol> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Monitor failover events SELECT timestamp, event_type, source_node, target_node, duration_ms, data_loss_bytes FROM system.failover_log ORDER BY timestamp DESC LIMIT 20; </code></pre></div> <h4 id="failover-configuration" class="position-relative d-flex align-items-center group"> Failover Configuration <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="failover-configuration" aria-haspopup="dialog" aria-label="Share link: Failover Configuration"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml">[failover] enabled = true # Detection settings detection_method = "consensus" # heartbeat, consensus, or both min_observers = 2 # Timing detection_timeout_ms = 3000 promotion_delay_ms = 1000 client_redirect_timeout_ms = 5000 # Behavior auto_promote = true prefer_sync_replica = true block_writes_during_failover = false # Recovery auto_rejoin = true rejoin_as = "replica" # replica or standby catch_up_mode = "streaming" </code></pre></div> <h4 id="client-failover-handling" class="position-relative d-flex align-items-center group"> Client Failover Handling <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="client-failover-handling" aria-haspopup="dialog" aria-label="Share link: Client Failover Handling"> Share link </button> </h4>Python Client: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python">from geode_client import Client, FailoverConfig # Configure client for HA client = Client( hosts=[ "node1.geode.internal:3141", "node2.geode.internal:3141", "node3.geode.internal:3141" ], failover=FailoverConfig( enabled=True, retry_attempts=3, retry_delay_ms=100, circuit_breaker_threshold=5 ) ) async def resilient_query(): async with client.connection() as conn: try: # Automatic retry on failover result, _ = await conn.query( "MATCH (u:User {id: $id}) RETURN u", {"id": "user-123"} ) return result.rows except FailoverInProgressError: # Wait for failover to complete await asyncio.sleep(1) return await resilient_query() </code></pre></div>Go Client: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go">import ( "database/sql" "geodedb.com/geode" ) func main() { // Connection string with multiple hosts dsn := "quic://node1:3141,node2:3141,node3:3141?failover=true" db, err := sql.Open("geode", dsn) if err != nil { log.Fatal(err) } // Configure connection pool for HA db.SetMaxOpenConns(50) db.SetMaxIdleConns(10) db.SetConnMaxLifetime(5 * time.Minute) // Queries automatically retry on failover rows, err := db.Query("MATCH (u:User) RETURN u.name") } </code></pre></div> <h3 id="load-balancing" class="position-relative d-flex align-items-center group"> Load Balancing <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="load-balancing" aria-haspopup="dialog" aria-label="Share link: Load Balancing"> Share link </button> </h3> <h4 id="internal-load-balancing" class="position-relative d-flex align-items-center group"> Internal Load Balancing <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="internal-load-balancing" aria-haspopup="dialog" aria-label="Share link: Internal Load Balancing"> Share link </button> </h4>Geode’s query coordinators distribute load across data nodes: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml">[load_balancing] enabled = true algorithm = "least_connections" # round_robin, least_connections, weighted [load_balancing.weights] # Higher weight = more traffic node1 = 100 node2 = 100 node3 = 50 # Smaller instance [load_balancing.health] # Remove unhealthy nodes from rotation check_interval_ms = 5000 unhealthy_threshold = 3 healthy_threshold = 2 </code></pre></div> <h4 id="external-load-balancer-configuration" class="position-relative d-flex align-items-center group"> External Load Balancer Configuration <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="external-load-balancer-configuration" aria-haspopup="dialog" aria-label="Share link: External Load Balancer Configuration"> Share link </button> </h4>HAProxy Example: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">frontend geode_frontend bind *:3141 mode tcp default_backend geode_backend backend geode_backend mode tcp balance leastconn option tcp-check server node1 node1.geode.internal:3141 check inter 1s fall 3 rise 2 server node2 node2.geode.internal:3141 check inter 1s fall 3 rise 2 server node3 node3.geode.internal:3141 check inter 1s fall 3 rise 2 </code></pre></div>Kubernetes Service: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml">apiVersion: v1 kind: Service metadata: name: geode-lb spec: type: LoadBalancer ports: - port: 3141 targetPort: 3141 protocol: TCP selector: app: geode sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 3600 </code></pre></div> <h4 id="readwrite-splitting" class="position-relative d-flex align-items-center group"> Read/Write Splitting <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="readwrite-splitting" aria-haspopup="dialog" aria-label="Share link: Read/Write Splitting"> Share link </button> </h4>Route reads to replicas, writes to primary: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-toml" data-lang="toml">[routing] write_to = "primary" read_from = "nearest" # primary, replica, or nearest [routing.read_preference] # Prefer local replica, fall back to primary strategy = "nearest" max_staleness_ms = 100 </code></pre></div>Client-Side Routing: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python">from geode_client import Client, ReadPreference client = Client(hosts=["node1:3141", "node2:3141", "node3:3141"]) async def read_user(user_id): async with client.connection(read_preference=ReadPreference.NEAREST) as conn: result, _ = await conn.query( "MATCH (u:User {id: $id}) RETURN u", {"id": user_id} ) return result.rows async def update_user(user_id, name): async with client.connection(read_preference=ReadPreference.PRIMARY) as conn: await conn.execute( "MATCH (u:User {id: $id}) SET u.name = $name", {"id": user_id, "name": name} ) </code></pre></div> <h3 id="monitoring-high-availability" class="position-relative d-flex align-items-center group"> Monitoring High Availability <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="monitoring-high-availability" aria-haspopup="dialog" aria-label="Share link: Monitoring High Availability"> Share link </button> </h3> <h4 id="key-ha-metrics" class="position-relative d-flex align-items-center group"> Key HA Metrics <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="key-ha-metrics" aria-haspopup="dialog" aria-label="Share link: Key HA Metrics"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"># Prometheus metrics for HA monitoring curl http://node1:3141/metrics | grep -E "geode_cluster|geode_replication|geode_failover" # Example output geode_cluster_nodes_total{status="healthy"} 3 geode_cluster_nodes_total{status="unhealthy"} 0 geode_cluster_leader_node{node="node1"} 1 geode_replication_lag_seconds{shard="1",replica="node2"} 0.005 geode_replication_lag_seconds{shard="1",replica="node3"} 0.008 geode_failover_events_total{type="automatic"} 2 geode_failover_duration_seconds_sum 3.45 </code></pre></div> <h4 id="health-check-endpoints" class="position-relative d-flex align-items-center group"> Health Check Endpoints <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="health-check-endpoints" aria-haspopup="dialog" aria-label="Share link: Health Check Endpoints"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"># Liveness probe - is the process running? curl http://node1:3141/health/live # Response: {"status": "ok", "uptime_seconds": 86400} # Readiness probe - can it serve traffic? curl http://node1:3141/health/ready # Response: {"status": "ready", "role": "leader", "replicas_synced": 2} # Cluster health - overall cluster status curl http://node1:3141/health/cluster # Response: { # "status": "healthy", # "nodes": {"total": 3, "healthy": 3, "unhealthy": 0}, # "replication": {"in_sync": true, "max_lag_ms": 12} # } </code></pre></div> <h4 id="alerting-rules" class="position-relative d-flex align-items-center group"> Alerting Rules <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alerting-rules" aria-haspopup="dialog" aria-label="Share link: Alerting Rules"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># Prometheus alerting rules for HA groups: - name: geode_ha_alerts rules: - alert: GeodeNodeDown expr: geode_cluster_nodes_total{status="unhealthy"} > 0 for: 30s labels: severity: critical annotations: summary: "Geode cluster has unhealthy nodes" description: "{{ $value }} nodes are unhealthy" - alert: GeodeReplicationLagHigh expr: geode_replication_lag_seconds > 1 for: 1m labels: severity: warning annotations: summary: "High replication lag detected" description: "Replication lag is {{ $value }}s" - alert: GeodeNoQuorum expr: geode_cluster_nodes_total{status="healthy"} < 2 for: 10s labels: severity: critical annotations: summary: "Geode cluster lost quorum" description: "Only {{ $value }} healthy nodes remain" - alert: GeodeFailoverFrequent expr: rate(geode_failover_events_total[1h]) > 3 for: 5m labels: severity: warning annotations: summary: "Frequent failovers detected" description: "{{ $value }} failovers in the last hour" - alert: GeodeLeaderElectionStuck expr: geode_cluster_leader_election_in_progress == 1 for: 30s labels: severity: critical annotations: summary: "Leader election taking too long" </code></pre></div> <h4 id="grafana-dashboard" class="position-relative d-flex align-items-center group"> Grafana Dashboard <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="grafana-dashboard" aria-haspopup="dialog" aria-label="Share link: Grafana Dashboard"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json">{ "dashboard": { "title": "Geode High Availability", "panels": [ { "title": "Cluster Health", "type": "stat", "targets": [{ "expr": "geode_cluster_nodes_total{status='healthy'}" }] }, { "title": "Replication Lag", "type": "graph", "targets": [{ "expr": "geode_replication_lag_seconds", "legendFormat": "{{shard}} -> {{replica}}" }] }, { "title": "Failover Events", "type": "graph", "targets": [{ "expr": "rate(geode_failover_events_total[5m])", "legendFormat": "Failovers/min" }] }, { "title": "Node Roles", "type": "table", "targets": [{ "expr": "geode_cluster_node_role" }] } ] } } </code></pre></div> <h3 id="testing-high-availability" class="position-relative d-flex align-items-center group"> Testing High Availability <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="testing-high-availability" aria-haspopup="dialog" aria-label="Share link: Testing High Availability"> Share link </button> </h3> <h4 id="chaos-engineering" class="position-relative d-flex align-items-center group"> Chaos Engineering <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="chaos-engineering" aria-haspopup="dialog" aria-label="Share link: Chaos Engineering"> Share link </button> </h4>Verify HA behavior by intentionally causing failures: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"># Kill a node and verify automatic failover docker stop geode-node2 # Verify cluster continues operating curl http://node1:3141/health/cluster # Verify queries still work ./geode shell --host node1:3141 -c "MATCH (n) RETURN count(n)" # Restart node and verify rejoin docker start geode-node2 # Verify node rejoined and synced curl http://node1:3141/health/cluster </code></pre></div>Automated HA Test Script: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python">import asyncio import subprocess from geode_client import Client async def test_failover(): client = Client(hosts=["node1:3141", "node2:3141", "node3:3141"]) # Verify initial state async with client.connection() as conn: result, _ = await conn.query("MATCH (n) RETURN count(n) as cnt") initial_count = result.rows[0]['cnt'] print(f"Initial node count: {initial_count}") # Kill a node print("Stopping node2...") subprocess.run(["docker", "stop", "geode-node2"]) # Wait for failover await asyncio.sleep(5) # Verify cluster still works async with client.connection() as conn: result, _ = await conn.query("MATCH (n) RETURN count(n) as cnt") assert result.rows[0]['cnt'] == initial_count print("Cluster operational after failover") # Restart node print("Restarting node2...") subprocess.run(["docker", "start", "geode-node2"]) # Wait for rejoin await asyncio.sleep(10) # Verify full recovery async with client.connection() as conn: result, _ = await conn.query( "SELECT * FROM system.cluster_nodes WHERE status = 'healthy'" ) assert len(result.rows) == 3 print("Full cluster recovered") asyncio.run(test_failover()) </code></pre></div> <h4 id="disaster-recovery-drills" class="position-relative d-flex align-items-center group"> Disaster Recovery Drills <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="disaster-recovery-drills" aria-haspopup="dialog" aria-label="Share link: Disaster Recovery Drills"> Share link </button> </h4>Regularly test full recovery procedures: <ol> <li>Simulate complete cluster failure</li> <li>Restore from backup</li> <li>Verify data integrity</li> <li>Measure RTO and RPO</li> <li>Document and improve procedures</li> </ol> <h3 id="best-practices" class="position-relative d-flex align-items-center group"> Best Practices <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="best-practices" aria-haspopup="dialog" aria-label="Share link: Best Practices"> Share link </button> </h3> <h4 id="deployment" class="position-relative d-flex align-items-center group"> Deployment <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="deployment" aria-haspopup="dialog" aria-label="Share link: Deployment"> Share link </button> </h4><ol> <li>Use odd number of nodes: 3, 5, or 7 for clean majority</li> <li>Spread across failure domains: Different racks, AZs, or regions</li> <li>Size for N+1 capacity: Each node handles (total load / N-1)</li> <li>Use dedicated networks: Separate client and replication traffic</li> </ol> <h4 id="configuration" class="position-relative d-flex align-items-center group"> Configuration <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="configuration" aria-haspopup="dialog" aria-label="Share link: Configuration"> Share link </button> </h4><ol> <li>Enable synchronous replication: For zero RPO</li> <li>Configure appropriate timeouts: Balance detection speed vs false positives</li> <li>Set conservative health thresholds: Avoid unnecessary failovers</li> <li>Test failover regularly: Verify HA actually works</li> </ol> <h4 id="operations" class="position-relative d-flex align-items-center group"> Operations <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="operations" aria-haspopup="dialog" aria-label="Share link: Operations"> Share link </button> </h4><ol> <li>Monitor replication lag: Alert before it becomes critical</li> <li>Perform rolling upgrades: One node at a time</li> <li>Maintain runbooks: Document recovery procedures</li> <li>Practice disaster recovery: Regular drills</li> </ol> <h4 id="client-applications" class="position-relative d-flex align-items-center group"> Client Applications <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="client-applications" aria-haspopup="dialog" aria-label="Share link: Client Applications"> Share link </button> </h4><ol> <li>Configure connection pools: Multiple connections for resilience</li> <li>Implement retry logic: Handle transient failures</li> <li>Use circuit breakers: Prevent cascade failures</li> <li>Handle failover gracefully: Inform users of temporary issues</li> </ol> <h3 id="related-topics" class="position-relative d-flex align-items-center group"> Related Topics <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="related-topics" aria-haspopup="dialog" aria-label="Share link: Related Topics"> Share link </button> </h3><ul> <li><a href="/tags/distributed-systems/" >Distributed Systems</a> - Distributed architecture fundamentals</li> <li><a href="/tags/recovery/" >Recovery</a> - Backup and disaster recovery</li> <li><a href="/tags/clustering/" >Clustering</a> - Cluster setup and management</li> <li><a href="/tags/deployment/" >Deployment</a> - Production deployment patterns</li> <li><a href="/tags/monitoring/" >Monitoring</a> - Observability and alerting</li> <li><a href="/tags/replication/" >Replication</a> - Data replication strategies</li> </ul> <h3 id="further-reading" class="position-relative d-flex align-items-center group"> Further Reading <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="further-reading" aria-haspopup="dialog" aria-label="Share link: Further Reading"> Share link </button> </h3><ul> <li>High Availability Architecture Guide</li> <li>Failover Testing Procedures</li> <li>Disaster Recovery Planning</li> <li>SLA Management Guide</li> <li>Chaos Engineering Handbook</li> <li>Production Operations Checklist</li> </ul>

Popular

Related Articles

Disaster Recovery

Backup and Restore Guide

High Availability Guide