<!-- CANARY: REQ=REQ-DOCS-001; FEATURE="Docs"; ASPECT=Documentation; STATUS=TESTED; OWNER=docs; UPDATED=2026-01-15 --> <p>Alert management is the practice of defining, configuring, and responding to automated notifications when system conditions deviate from expected behavior. Effective alerting enables rapid incident response, reduces downtime, and ensures service level objectives (SLOs) are met while minimizing alert fatigue and false positives.</p> <p>Geode integrates with Prometheus Alertmanager for flexible, powerful alerting on metrics, supporting multi-channel notifications, alert grouping, silencing, and escalation policies. Combined with comprehensive metrics coverage, Geode enables proactive monitoring and rapid problem detection.</p> <p>This guide covers alert design principles, critical alert rules, notification configuration, alert fatigue prevention, and incident response workflows.</p> <h3 id="alerting-philosophy" class="position-relative d-flex align-items-center group"> <span>Alerting Philosophy</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alerting-philosophy" aria-haspopup="dialog" aria-label="Share link: Alerting Philosophy"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3><div id="headingShareModal" class="heading-share-modal" role="dialog" aria-modal="true" aria-labelledby="headingShareTitle" hidden> <div class="hsm-dialog" role="document"> <div class="hsm-header"> <h2 id="headingShareTitle" class="h6 mb-0 fw-bold">Share this section</h2> <button type="button" class="hsm-close" aria-label="Close"> <i class="fa-solid fa-xmark"></i> </button> </div> <div class="hsm-body"> <label for="headingShareInput" class="form-label small text-muted mb-1 text-uppercase fw-bold" style="font-size: 0.7rem; letter-spacing: 0.5px;">Permalink</label> <div class="input-group mb-4 hsm-url-group"> <input id="headingShareInput" type="text" class="form-control font-monospace" readonly aria-readonly="true" style="font-size: 0.85rem;" /> <button class="btn btn-primary hsm-copy" type="button" aria-label="Copy" title="Copy"> <i class="fa-duotone fa-clipboard" aria-hidden="true"></i> </button> </div> <div class="small fw-bold mb-2 text-muted text-uppercase" style="font-size: 0.7rem; letter-spacing: 0.5px;">Share via</div> <div class="hsm-share-grid"> <a id="share-twitter" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> <i class="fa-brands fa-twitter me-2"></i>Twitter </a> <a id="share-linkedin" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> <i class="fa-brands fa-linkedin me-2"></i>LinkedIn </a> <a id="share-facebook" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> <i class="fa-brands fa-facebook me-2"></i>Facebook </a> </div> </div> </div> </div> <style> .heading-share-modal { position: fixed; inset: 0; display: flex; justify-content: center; align-items: center; background: rgba(0, 0, 0, 0.6); z-index: 1050; padding: 1rem; backdrop-filter: blur(4px); -webkit-backdrop-filter: blur(4px); } .heading-share-modal[hidden] { display: none !important; } .hsm-dialog { max-width: 420px; width: 100%; background: var(--bs-body-bg, #fff); color: var(--bs-body-color, #212529); border: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); border-radius: 1rem; box-shadow: 0 25px 50px -12px rgba(0, 0, 0, 0.25); overflow: hidden; animation: hsm-fade-in 0.2s ease-out; } @keyframes hsm-fade-in { from { opacity: 0; transform: scale(0.95); } to { opacity: 1; transform: scale(1); } } [data-bs-theme="dark"] .hsm-dialog { background: #1e293b; border-color: rgba(255,255,255,0.1); color: #f8f9fa; } .hsm-header { display: flex; justify-content: space-between; align-items: center; padding: 1rem 1.5rem; border-bottom: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); background: rgba(0,0,0,0.02); } [data-bs-theme="dark"] .hsm-header { background: rgba(255,255,255,0.02); border-color: rgba(255,255,255,0.1); } .hsm-close { background: transparent; border: none; color: inherit; opacity: 0.5; padding: 0.25rem 0.5rem; border-radius: 0.25rem; font-size: 1.2rem; line-height: 1; transition: opacity 0.2s; } .hsm-close:hover { opacity: 1; } .hsm-body { padding: 1.5rem; } .hsm-url-group { display: flex !important; align-items: stretch; } .hsm-url-group .form-control { flex: 1; min-width: 0; margin: 0; background: var(--bs-secondary-bg, #f8f9fa); border-color: var(--bs-border-color, #dee2e6); border-top-right-radius: 0; border-bottom-right-radius: 0; height: 42px; } .hsm-url-group .btn { flex: 0 0 auto; margin: 0; margin-left: -1px; border-top-left-radius: 0; border-bottom-left-radius: 0; height: 42px; display: flex; align-items: center; justify-content: center; padding: 0 1.25rem; z-index: 2; } [data-bs-theme="dark"] .hsm-url-group .form-control { background: #0f172a; border-color: #334155; color: #e2e8f0; } .hsm-share-grid { display: flex; flex-direction: column; gap: 0.5rem; } .hsm-share-grid .btn { display: flex; align-items: center; justify-content: center; font-size: 0.9rem; padding: 0.6rem; border-color: var(--bs-border-color); width: 100%; } [data-bs-theme="dark"] .hsm-share-grid .btn { color: #e2e8f0; border-color: #475569; } [data-bs-theme="dark"] .hsm-share-grid .btn:hover { background: #334155; border-color: #cbd5e1; } </style> <script> (function(){ const modal = document.getElementById('headingShareModal'); if(!modal) return; const input = modal.querySelector('#headingShareInput'); const copyBtn = modal.querySelector('.hsm-copy'); const twitter = modal.querySelector('#share-twitter'); const linkedin = modal.querySelector('#share-linkedin'); const facebook = modal.querySelector('#share-facebook'); const closeBtn = modal.querySelector('.hsm-close'); let lastFocus=null; let trapBound=false; function buildUrl(id){ return window.location.origin + window.location.pathname + '#' + id; } function isOpen(){ return !modal.hasAttribute('hidden'); } function hydrate(id){ const url=buildUrl(id); input.value=url; const enc=encodeURIComponent(url); const text=encodeURIComponent(document.title); if(twitter) twitter.href=`https://twitter.com/intent/tweet?url=${enc}&text=${text}`; if(linkedin) linkedin.href=`https://www.linkedin.com/sharing/share-offsite/?url=${enc}`; if(facebook) facebook.href=`https://www.facebook.com/sharer/sharer.php?u=${enc}`; } function openModal(id){ lastFocus=document.activeElement; hydrate(id); if(!isOpen()){ modal.removeAttribute('hidden'); } requestAnimationFrame(()=>{ input.focus(); }); trapFocus(); } function closeModal(){ if(!isOpen()) return; modal.setAttribute('hidden',''); if(lastFocus && typeof lastFocus.focus==='function') lastFocus.focus(); } function copyCurrent(){ try{ navigator.clipboard.writeText(input.value).then(()=>feedback(true),()=>fallback()); } catch(e){ fallback(); } } function fallback(){ input.select(); try{ document.execCommand('copy'); feedback(true);}catch(e){ feedback(false);} } function feedback(ok){ if(!copyBtn) return; const icon=copyBtn.querySelector('i'); if(!icon) return; const prev=copyBtn.getAttribute('data-prev')||icon.className; if(!copyBtn.getAttribute('data-prev')) copyBtn.setAttribute('data-prev',prev); icon.className= ok ? 'fa-duotone fa-clipboard-check':'fa-duotone fa-circle-exclamation'; setTimeout(()=>{ icon.className=prev; },1800); } function handleShareClick(e){ e.preventDefault(); const btn=e.currentTarget; const id=btn.getAttribute('data-share-target'); if(id) openModal(id); } function bindShareButtons(){ document.querySelectorAll('.h-share').forEach(btn=>{ if(!btn.dataset.hShareBound){ btn.addEventListener('click', handleShareClick); btn.dataset.hShareBound='1'; } }); } bindShareButtons(); if(document.readyState==='loading'){ document.addEventListener('DOMContentLoaded', bindShareButtons); } else { requestAnimationFrame(bindShareButtons); } document.addEventListener('click', function(e){ const shareBtn=e.target.closest && e.target.closest('.h-share'); if(shareBtn && !shareBtn.dataset.hShareBound){ handleShareClick.call(shareBtn, e); } }, true); document.addEventListener('click', e=>{ if(e.target===modal) closeModal(); if(e.target.closest && e.target.closest('.hsm-close')){ e.preventDefault(); closeModal(); } if(copyBtn && (e.target===copyBtn || (e.target.closest && e.target.closest('.hsm-copy')))) { e.preventDefault(); copyCurrent(); } }); document.addEventListener('keydown', e=>{ if(e.key==='Escape' && isOpen()) closeModal(); }); function trapFocus(){ if(trapBound) return; trapBound=true; modal.addEventListener('keydown', f=>{ if(f.key==='Tab' && isOpen()){ const focusable=[...modal.querySelectorAll('a[href],button,input,textarea,select,[tabindex]:not([tabindex="-1"])')].filter(el=>!el.hasAttribute('disabled')); if(!focusable.length) return; const first=focusable[0]; const last=focusable[focusable.length-1]; if(f.shiftKey && document.activeElement===first){ f.preventDefault(); last.focus(); } else if(!f.shiftKey && document.activeElement===last){ f.preventDefault(); first.focus(); } } }); } if(closeBtn) closeBtn.addEventListener('click', e=>{ e.preventDefault(); closeModal(); }); })(); </script> <h4 id="alert-on-symptoms-not-causes" class="position-relative d-flex align-items-center group"> <span>Alert on Symptoms, Not Causes</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-on-symptoms-not-causes" aria-haspopup="dialog" aria-label="Share link: Alert on Symptoms, Not Causes"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Focus alerts on user-impacting symptoms rather than internal metrics:</p> <p><strong>Good (Symptom-Based)</strong>:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># High query error rate (users experiencing failures)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">HighQueryErrorRate</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="l">rate(geode_queries_total{status=&#34;error&#34;}[5m]) &gt; 10</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c"># Slow query latency (users experiencing delays)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">HighQueryLatency</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="l">histogram_quantile(0.95, rate(geode_query_duration_seconds_bucket[5m])) &gt; 1.0</span><span class="w"> </span></span></span></code></pre></div><p><strong>Bad (Cause-Based)</strong>:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># High CPU (may not impact users)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">HighCPU</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="l">cpu_usage &gt; 0.8</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c"># High memory (may be normal)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">HighMemory</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="l">memory_usage &gt; 0.7</span><span class="w"> </span></span></span></code></pre></div> <h4 id="actionable-alerts-only" class="position-relative d-flex align-items-center group"> <span>Actionable Alerts Only</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="actionable-alerts-only" aria-haspopup="dialog" aria-label="Share link: Actionable Alerts Only"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Every alert should require human action. If no action is needed, it&rsquo;s not an alert—it&rsquo;s a metric to track in dashboards.</p> <p><strong>Actionable</strong>: &ldquo;Database is down&rdquo; → Immediate response required <strong>Not Actionable</strong>: &ldquo;CPU at 60%&rdquo; → Normal operation, monitor in dashboard</p> <h3 id="alert-severity-levels" class="position-relative d-flex align-items-center group"> <span>Alert Severity Levels</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-severity-levels" aria-haspopup="dialog" aria-label="Share link: Alert Severity Levels"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3> <h4 id="critical-alerts" class="position-relative d-flex align-items-center group"> <span>Critical Alerts</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="critical-alerts" aria-haspopup="dialog" aria-label="Share link: Critical Alerts"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>System down or severe degradation requiring immediate response:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">groups</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">geode_critical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">interval</span><span class="p">:</span><span class="w"> </span><span class="l">30s</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">rules</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">GeodeDown</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="l">up{job=&#34;geode&#34;} == 0</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">1m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">critical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">team</span><span class="p">:</span><span class="w"> </span><span class="l">database</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Geode instance {{ $labels.instance }} is down&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Geode has been unreachable for more than 1 minute&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/geode-down&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">HighQueryErrorRate</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> rate(geode_queries_total{status=&#34;error&#34;}[5m]) &gt; 10</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">5m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">critical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;High query error rate on {{ $labels.instance }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Error rate is {{ $value }} errors/sec (threshold: 10/sec)&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/high-error-rate&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">DiskSpaceCritical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> geode_disk_free_bytes / geode_disk_total_bytes &lt; 0.05</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">5m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">critical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Critical disk space on {{ $labels.instance }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Only {{ $value | humanizePercentage }} disk space remaining&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/disk-space&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">MemoryExhaustion</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> geode_memory_used_bytes / geode_memory_total_bytes &gt; 0.95</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">10m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">critical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Memory near exhaustion on {{ $labels.instance }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Memory usage at {{ $value | humanizePercentage }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/memory-pressure&#34;</span><span class="w"> </span></span></span></code></pre></div> <h4 id="warning-alerts" class="position-relative d-flex align-items-center group"> <span>Warning Alerts</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="warning-alerts" aria-haspopup="dialog" aria-label="Share link: Warning Alerts"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Degraded performance or conditions that may lead to critical issues:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">geode_warnings</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">interval</span><span class="p">:</span><span class="w"> </span><span class="l">1m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">rules</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">SlowQueryRateIncreasing</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> rate(geode_slow_queries_total[5m]) &gt; 5</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">10m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Slow query rate increasing on {{ $labels.instance }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;{{ $value }} slow queries per second (threshold: 5/sec)&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/slow-queries&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">HighTransactionConflicts</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> rate(geode_transaction_conflicts_total[5m]) &gt; 100</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">10m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;High transaction conflict rate&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;{{ $value }} conflicts per second (threshold: 100/sec)&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/transaction-conflicts&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">ConnectionPoolPressure</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> geode_active_connections / geode_max_connections &gt; 0.8</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">15m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Connection pool pressure on {{ $labels.instance }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;{{ $value | humanizePercentage }} of connection pool used&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/connection-pool&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">CacheHitRateLow</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> rate(geode_cache_hits_total[5m]) / </span></span></span><span class="line"><span class="cl"><span class="sd"> (rate(geode_cache_hits_total[5m]) + rate(geode_cache_misses_total[5m])) &lt; 0.7</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">30m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Low cache hit rate on {{ $labels.instance }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Cache hit rate is {{ $value | humanizePercentage }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/cache-tuning&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">LongRunningQueries</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> geode_active_queries{duration_seconds=&#34;&gt;60&#34;} &gt; 5</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">5m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Multiple long-running queries detected&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;{{ $value }} queries running for &gt;60 seconds&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/long-queries&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">WalSizeGrowing</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> deriv(geode_wal_size_bytes[30m]) &gt; 1e9</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">30m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;WAL size growing rapidly on {{ $labels.instance }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;WAL growing at {{ $value | humanize1024 }}/sec&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/wal-growth&#34;</span><span class="w"> </span></span></span></code></pre></div> <h4 id="info-alerts" class="position-relative d-flex align-items-center group"> <span>Info Alerts</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="info-alerts" aria-haspopup="dialog" aria-label="Share link: Info Alerts"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Informational notifications that don&rsquo;t require immediate action:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">geode_info</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">interval</span><span class="p">:</span><span class="w"> </span><span class="l">5m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">rules</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">GeodeInstanceRestarted</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> time() - geode_start_time_seconds &lt; 300</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">info</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Geode instance {{ $labels.instance }} recently restarted&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Instance started {{ $value | humanizeDuration }} ago&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">IndexBuildCompleted</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> increase(geode_index_builds_completed_total[5m]) &gt; 0</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">info</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Index build completed on {{ $labels.instance }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;{{ $value }} index(es) built in the last 5 minutes&#34;</span><span class="w"> </span></span></span></code></pre></div> <h3 id="alert-thresholds-and-tuning" class="position-relative d-flex align-items-center group"> <span>Alert Thresholds and Tuning</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-thresholds-and-tuning" aria-haspopup="dialog" aria-label="Share link: Alert Thresholds and Tuning"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3> <h4 id="establishing-baselines" class="position-relative d-flex align-items-center group"> <span>Establishing Baselines</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="establishing-baselines" aria-haspopup="dialog" aria-label="Share link: Establishing Baselines"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Measure normal behavior before setting thresholds:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Calculate p95 query latency over last 7 days</span> </span></span><span class="line"><span class="cl">promtool query instant <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> <span class="s1">&#39;histogram_quantile(0.95, rate(geode_query_duration_seconds_bucket[7d]))&#39;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># Average query rate during business hours</span> </span></span><span class="line"><span class="cl">promtool query instant <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> <span class="s1">&#39;avg_over_time(rate(geode_queries_total[7d])[1h:5m])&#39;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># Peak connection count</span> </span></span><span class="line"><span class="cl">promtool query instant <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> <span class="s1">&#39;max_over_time(geode_active_connections[7d])&#39;</span> </span></span></code></pre></div><p>Set thresholds based on observed values:</p> <ul> <li><strong>Warning</strong>: 1.5x baseline</li> <li><strong>Critical</strong>: 2x baseline or SLO violation</li> </ul> <h4 id="dynamic-thresholds" class="position-relative d-flex align-items-center group"> <span>Dynamic Thresholds</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="dynamic-thresholds" aria-haspopup="dialog" aria-label="Share link: Dynamic Thresholds"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Use statistical methods for dynamic thresholds:</p> <p><strong>Detect anomalies using standard deviation</strong>:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">QueryRateAnomaly</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> abs(rate(geode_queries_total[5m]) - </span></span></span><span class="line"><span class="cl"><span class="sd"> avg_over_time(rate(geode_queries_total[5m])[1h:5m])) &gt; </span></span></span><span class="line"><span class="cl"><span class="sd"> 2 * stddev_over_time(rate(geode_queries_total[5m])[1h:5m])</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">10m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Anomalous query rate detected&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Current rate deviates &gt;2 standard deviations from 1-hour average&#34;</span><span class="w"> </span></span></span></code></pre></div><p><strong>Predict resource exhaustion</strong>:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">DiskWillFillIn4Hours</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> predict_linear(geode_disk_free_bytes[1h], 4 * 3600) &lt; 0</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Disk will be full in approximately 4 hours&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Based on current growth rate&#34;</span><span class="w"> </span></span></span></code></pre></div> <h3 id="alertmanager-configuration" class="position-relative d-flex align-items-center group"> <span>Alertmanager Configuration</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alertmanager-configuration" aria-haspopup="dialog" aria-label="Share link: Alertmanager Configuration"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3> <h4 id="basic-setup" class="position-relative d-flex align-items-center group"> <span>Basic Setup</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="basic-setup" aria-haspopup="dialog" aria-label="Share link: Basic Setup"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># alertmanager.yml</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">global</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">resolve_timeout</span><span class="p">:</span><span class="w"> </span><span class="l">5m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">smtp_smarthost</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;smtp.example.com:587&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">smtp_from</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;[email protected]&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">smtp_auth_username</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;[email protected]&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">smtp_auth_password</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;${SMTP_PASSWORD}&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c"># Alert routing tree</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">route</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">receiver</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;default&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">group_by</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s1">&#39;alertname&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;cluster&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;instance&#39;</span><span class="p">]</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">group_wait</span><span class="p">:</span><span class="w"> </span><span class="l">10s</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">group_interval</span><span class="p">:</span><span class="w"> </span><span class="l">5m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">repeat_interval</span><span class="p">:</span><span class="w"> </span><span class="l">4h</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">routes</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># Critical alerts to on-call</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">match</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">critical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">receiver</span><span class="p">:</span><span class="w"> </span><span class="l">oncall</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">group_wait</span><span class="p">:</span><span class="w"> </span><span class="l">0s</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">repeat_interval</span><span class="p">:</span><span class="w"> </span><span class="l">1h</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># Warnings to team channel</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">match</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">warning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">receiver</span><span class="p">:</span><span class="w"> </span><span class="l">team-slack</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">repeat_interval</span><span class="p">:</span><span class="w"> </span><span class="l">4h</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># Info alerts to low-priority channel</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">match</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">info</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">receiver</span><span class="p">:</span><span class="w"> </span><span class="l">info-channel</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">repeat_interval</span><span class="p">:</span><span class="w"> </span><span class="l">24h</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">receivers</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;default&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">email_configs</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">to</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;[email protected]&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;oncall&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">pagerduty_configs</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">service_key</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;${PAGERDUTY_SERVICE_KEY}&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;{{ .GroupLabels.alertname }} on {{ .GroupLabels.instance }}&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;team-slack&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">slack_configs</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">api_url</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;${SLACK_WEBHOOK_URL}&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">channel</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;#database-alerts&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">title</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;{{ .GroupLabels.alertname }}&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">text</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;{{ range .Alerts }}{{ .Annotations.description }}{{ end }}&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;info-channel&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">slack_configs</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">api_url</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;${SLACK_WEBHOOK_URL}&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">channel</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;#database-info&#39;</span><span class="w"> </span></span></span></code></pre></div> <h4 id="multi-channel-notifications" class="position-relative d-flex align-items-center group"> <span>Multi-Channel Notifications</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="multi-channel-notifications" aria-haspopup="dialog" aria-label="Share link: Multi-Channel Notifications"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Route alerts to appropriate channels:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">routes</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># Database team for database-specific issues</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">match</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">team</span><span class="p">:</span><span class="w"> </span><span class="l">database</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">receiver</span><span class="p">:</span><span class="w"> </span><span class="l">database-team</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">routes</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># Critical to PagerDuty + Slack</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">match</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">critical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">receiver</span><span class="p">:</span><span class="w"> </span><span class="l">database-oncall</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">continue</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># All to Slack</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">receiver</span><span class="p">:</span><span class="w"> </span><span class="l">database-slack</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># Infrastructure team for resource issues</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">match_re</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">alertname</span><span class="p">:</span><span class="w"> </span><span class="l">(DiskSpaceCritical|MemoryExhaustion)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">receiver</span><span class="p">:</span><span class="w"> </span><span class="l">infrastructure-team</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">receivers</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">database-oncall</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">pagerduty_configs</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">service_key</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;${DB_PAGERDUTY_KEY}&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">webhook_configs</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">url</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;https://hooks.slack.com/services/...&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">database-slack</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">slack_configs</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">channel</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;#database-alerts&#39;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">infrastructure-team</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">pagerduty_configs</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">service_key</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;${INFRA_PAGERDUTY_KEY}&#39;</span><span class="w"> </span></span></span></code></pre></div> <h4 id="alert-grouping" class="position-relative d-flex align-items-center group"> <span>Alert Grouping</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-grouping" aria-haspopup="dialog" aria-label="Share link: Alert Grouping"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Group related alerts to reduce noise:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">route</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># Group by instance to see all issues with single node</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">group_by</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s1">&#39;instance&#39;</span><span class="p">]</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">group_wait</span><span class="p">:</span><span class="w"> </span><span class="l">30s</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">group_interval</span><span class="p">:</span><span class="w"> </span><span class="l">5m</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># Don&#39;t group critical alerts (send immediately)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">routes</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">match</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">severity</span><span class="p">:</span><span class="w"> </span><span class="l">critical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">group_by</span><span class="p">:</span><span class="w"> </span><span class="p">[]</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">group_wait</span><span class="p">:</span><span class="w"> </span><span class="l">0s</span><span class="w"> </span></span></span></code></pre></div> <h4 id="alert-silencing" class="position-relative d-flex align-items-center group"> <span>Alert Silencing</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-silencing" aria-haspopup="dialog" aria-label="Share link: Alert Silencing"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Temporarily suppress alerts during maintenance:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Silence alerts for instance during maintenance</span> </span></span><span class="line"><span class="cl">amtool silence add <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> <span class="nv">instance</span><span class="o">=</span>geode-node-1 <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> --duration<span class="o">=</span>2h <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> --author<span class="o">=</span><span class="s2">&#34;[email protected]&#34;</span> <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> --comment<span class="o">=</span><span class="s2">&#34;Scheduled maintenance&#34;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># Silence specific alert</span> </span></span><span class="line"><span class="cl">amtool silence add <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> <span class="nv">alertname</span><span class="o">=</span>SlowQueryRateIncreasing <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> --duration<span class="o">=</span>30m <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> --comment<span class="o">=</span><span class="s2">&#34;Known issue, fix in progress&#34;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># List active silences</span> </span></span><span class="line"><span class="cl">amtool silence query </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># Expire silence early</span> </span></span><span class="line"><span class="cl">amtool silence expire &lt;silence-id&gt; </span></span></code></pre></div> <h3 id="alert-fatigue-prevention" class="position-relative d-flex align-items-center group"> <span>Alert Fatigue Prevention</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-fatigue-prevention" aria-haspopup="dialog" aria-label="Share link: Alert Fatigue Prevention"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3> <h4 id="reduce-false-positives" class="position-relative d-flex align-items-center group"> <span>Reduce False Positives</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="reduce-false-positives" aria-haspopup="dialog" aria-label="Share link: Reduce False Positives"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p><strong>Appropriate for duration</strong>: Require sustained condition before alerting</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># Bad: Alert on momentary spike</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">HighCPU</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="l">cpu_usage &gt; 0.8</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c"># Good: Alert on sustained high CPU</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">HighCPU</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="l">cpu_usage &gt; 0.8</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">for</span><span class="p">:</span><span class="w"> </span><span class="l">10m </span><span class="w"> </span><span class="c"># Must persist for 10 minutes</span><span class="w"> </span></span></span></code></pre></div><p><strong>Set realistic thresholds</strong>: Base on actual SLOs, not arbitrary numbers</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># Bad: Arbitrary threshold</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">LatencyHigh</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="l">latency_seconds &gt; 0.1</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c"># Good: SLO-based threshold</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span>- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">LatencyViolatesSLO</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> histogram_quantile(0.99, rate(geode_query_duration_seconds_bucket[5m])) &gt; </span></span></span><span class="line"><span class="cl"><span class="sd"> 1.0 # SLO: p99 &lt; 1s</span><span class="w"> </span></span></span></code></pre></div> <h4 id="alert-on-rate-of-change" class="position-relative d-flex align-items-center group"> <span>Alert on Rate of Change</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-on-rate-of-change" aria-haspopup="dialog" aria-label="Share link: Alert on Rate of Change"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Detect rapid changes that indicate problems:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">ErrorRateSpiking</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">expr</span><span class="p">:</span><span class="w"> </span><span class="p">|</span><span class="sd"> </span></span></span><span class="line"><span class="cl"><span class="sd"> deriv(rate(geode_queries_total{status=&#34;error&#34;}[5m])[5m:1m]) &gt; 10</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Error rate increasing rapidly&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Error rate increased by {{ $value }}/sec in last 5 minutes&#34;</span><span class="w"> </span></span></span></code></pre></div> <h4 id="deduplicate-related-alerts" class="position-relative d-flex align-items-center group"> <span>Deduplicate Related Alerts</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="deduplicate-related-alerts" aria-haspopup="dialog" aria-label="Share link: Deduplicate Related Alerts"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h4><p>Use alert dependencies and inhibition rules:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># alertmanager.yml</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">inhibit_rules</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># If Geode is down, inhibit all other alerts for that instance</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">source_match</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">alertname</span><span class="p">:</span><span class="w"> </span><span class="l">GeodeDown</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">target_match_re</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">alertname</span><span class="p">:</span><span class="w"> </span><span class="l">(HighQueryErrorRate|HighQueryLatency|ConnectionPoolPressure)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">equal</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s1">&#39;instance&#39;</span><span class="p">]</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="c"># If disk is critical, inhibit disk warnings</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span>- <span class="nt">source_match</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">alertname</span><span class="p">:</span><span class="w"> </span><span class="l">DiskSpaceCritical</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">target_match</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">alertname</span><span class="p">:</span><span class="w"> </span><span class="l">DiskSpaceWarning</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">equal</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s1">&#39;instance&#39;</span><span class="p">]</span><span class="w"> </span></span></span></code></pre></div> <h3 id="runbooks-and-documentation" class="position-relative d-flex align-items-center group"> <span>Runbooks and Documentation</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="runbooks-and-documentation" aria-haspopup="dialog" aria-label="Share link: Runbooks and Documentation"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3><p>Link alerts to detailed runbooks:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl">- <span class="nt">alert</span><span class="p">:</span><span class="w"> </span><span class="l">HighQueryErrorRate</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">annotations</span><span class="p">:</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">summary</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;High query error rate on {{ $labels.instance }}&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Error rate is {{ $value }} errors/sec&#34;</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nt">runbook</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;https://docs.geodedb.com/runbooks/high-error-rate&#34;</span><span class="w"> </span></span></span></code></pre></div><p><strong>Example Runbook Structure</strong>:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-markdown" data-lang="markdown"><span class="line"><span class="cl"><span class="gh"># Runbook: High Query Error Rate </span></span></span><span class="line"><span class="cl"><span class="gh"></span> </span></span><span class="line"><span class="cl"><span class="gu">## Severity </span></span></span><span class="line"><span class="cl"><span class="gu"></span>Critical </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="gu">## Impact </span></span></span><span class="line"><span class="cl"><span class="gu"></span>Users experiencing query failures, application functionality degraded </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="gu">## Diagnosis </span></span></span><span class="line"><span class="cl"><span class="gu"></span><span class="k">1.</span> Check error breakdown: </span></span></code></pre></div><p>sum by (error_type) (rate(geode_query_errors_total[5m]))</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">2. View recent error logs: </span></span></code></pre></div><p>jq &lsquo;select(.level == &ldquo;ERROR&rdquo; and .logger == &ldquo;geode.query&rdquo;)&rsquo; <br> /var/log/geode/geode.log | tail -50</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-zed" data-lang="zed"><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="err">3</span><span class="p">.</span><span class="w"> </span><span class="n">Check</span><span class="w"> </span><span class="n">for</span><span class="w"> </span><span class="n">common</span><span class="w"> </span><span class="n">error</span><span class="w"> </span><span class="n">patterns</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="err">##</span><span class="w"> </span><span class="n">Resolution</span><span class="w"> </span><span class="n">Steps</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="err">1</span><span class="p">.</span><span class="w"> </span><span class="n">If</span><span class="w"> </span><span class="n">syntax</span><span class="w"> </span><span class="n">errors</span><span class="o">:</span><span class="w"> </span><span class="n">Check</span><span class="w"> </span><span class="n">application</span><span class="w"> </span><span class="n">code</span><span class="w"> </span><span class="n">for</span><span class="w"> </span><span class="n">malformed</span><span class="w"> </span><span class="n">queries</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="err">2</span><span class="p">.</span><span class="w"> </span><span class="n">If</span><span class="w"> </span><span class="n">connection</span><span class="w"> </span><span class="n">errors</span><span class="o">:</span><span class="w"> </span><span class="n">Check</span><span class="w"> </span><span class="n">network</span><span class="w"> </span><span class="n">connectivity</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="n">connection</span><span class="w"> </span><span class="n">pool</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="err">3</span><span class="p">.</span><span class="w"> </span><span class="n">If</span><span class="w"> </span><span class="kd">permission</span><span class="w"> </span><span class="n">errors</span><span class="o">:</span><span class="w"> </span><span class="n">Review</span><span class="w"> </span><span class="n">access</span><span class="w"> </span><span class="n">control</span><span class="w"> </span><span class="n">policies</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="err">4</span><span class="p">.</span><span class="w"> </span><span class="n">If</span><span class="w"> </span><span class="n">resource</span><span class="w"> </span><span class="n">errors</span><span class="o">:</span><span class="w"> </span><span class="n">Check</span><span class="w"> </span><span class="nn">memory/</span><span class="n">disk</span><span class="w"> </span><span class="n">availability</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="err">##</span><span class="w"> </span><span class="n">Escalation</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="n">If</span><span class="w"> </span><span class="n">unable</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">resolve</span><span class="w"> </span><span class="n">in</span><span class="w"> </span><span class="err">30</span><span class="w"> </span><span class="n">minutes</span><span class="p">,</span><span class="w"> </span><span class="n">escalate</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">database</span><span class="w"> </span><span class="n">team</span><span class="w"> </span><span class="n">lead</span><span class="w"> </span></span></span></code></pre></div> <h3 id="testing-alerts" class="position-relative d-flex align-items-center group"> <span>Testing Alerts</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="testing-alerts" aria-haspopup="dialog" aria-label="Share link: Testing Alerts"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3><p>Verify alert rules before deploying:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Test alert rule syntax</span> </span></span><span class="line"><span class="cl">promtool check rules alert-rules.yml </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># Query current alert state</span> </span></span><span class="line"><span class="cl">promtool query instant <span class="se">\ </span></span></span><span class="line"><span class="cl"><span class="se"></span> <span class="s1">&#39;ALERTS{alertname=&#34;HighQueryErrorRate&#34;}&#39;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># Simulate alert condition (if safe)</span> </span></span><span class="line"><span class="cl"><span class="c1"># Trigger slow queries to test alert</span> </span></span><span class="line"><span class="cl"><span class="k">for</span> i in <span class="o">{</span>1..100<span class="o">}</span><span class="p">;</span> <span class="k">do</span> </span></span><span class="line"><span class="cl"> geode query <span class="s2">&#34;MATCH (n) RETURN n&#34;</span> <span class="p">&amp;</span> </span></span><span class="line"><span class="cl"><span class="k">done</span> </span></span></code></pre></div> <h3 id="best-practices" class="position-relative d-flex align-items-center group"> <span>Best Practices</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="best-practices" aria-haspopup="dialog" aria-label="Share link: Best Practices"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3><p><strong>Alert on SLOs</strong>: Tie critical alerts to service level objectives that matter to users.</p> <p><strong>Include Context</strong>: Provide runbook links, affected instances, and resolution hints in annotations.</p> <p><strong>Test Regularly</strong>: Verify alert delivery and escalation paths work correctly.</p> <p><strong>Review and Refine</strong>: Periodically review alerts that fire frequently or never fire.</p> <p><strong>Document Everything</strong>: Maintain up-to-date runbooks for all critical alerts.</p> <p><strong>Appropriate Severity</strong>: Reserve critical alerts for situations requiring immediate human response.</p> <p><strong>Avoid Alert Storms</strong>: Use grouping and inhibition to prevent overwhelming responders.</p> <p><strong>Monitor Alertmanager</strong>: Ensure alerting system itself is reliable and monitored.</p> <h3 id="related-topics" class="position-relative d-flex align-items-center group"> <span>Related Topics</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="related-topics" aria-haspopup="dialog" aria-label="Share link: Related Topics"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3><ul> <li><a href="/tags/monitoring/" >System Monitoring</a> - Monitoring strategies</li> <li><a href="/tags/metrics/" >Performance Metrics</a> - Metrics collection</li> <li><a href="/tags/prometheus/" >Prometheus Integration</a> - Prometheus configuration</li> <li><a href="/tags/observability/" >System Observability</a> - Observability best practices</li> <li><a href="/tags/operations/" >Operations</a> - Operations guide</li> <li><a href="/tags/troubleshooting/" >Troubleshooting</a> - Debugging techniques</li> </ul> <h3 id="further-reading" class="position-relative d-flex align-items-center group"> <span>Further Reading</span> <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="further-reading" aria-haspopup="dialog" aria-label="Share link: Further Reading"> <i class="fa-sharp-duotone fa-solid fa-share-nodes" aria-hidden="true" style="font-size: 0.8em;"></i> <span class="visually-hidden">Share link</span> </button> </h3><ul> <li>Alert Design Best Practices</li> <li>Prometheus Alerting Rules Reference</li> <li>Alertmanager Configuration Guide</li> <li>Runbook Templates</li> <li>On-Call Incident Response Guide</li> </ul>

Related Articles