Alert Management

Alert management is the practice of defining, configuring, and responding to automated notifications when system conditions deviate from expected behavior. Effective alerting enables rapid incident response, reduces downtime, and ensures service level objectives (SLOs) are met while minimizing alert fatigue and false positives. Geode integrates with Prometheus Alertmanager for flexible, powerful alerting on metrics, supporting multi-channel notifications, alert grouping, silencing, and escalation policies. Combined with comprehensive metrics coverage, Geode enables proactive monitoring and rapid problem detection. This guide covers alert design principles, critical alert rules, notification configuration, alert fatigue prevention, and incident response workflows. <h3 id="alerting-philosophy" class="position-relative d-flex align-items-center group"> Alerting Philosophy <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alerting-philosophy" aria-haspopup="dialog" aria-label="Share link: Alerting Philosophy"> Share link </button> </h3><div id="headingShareModal" class="heading-share-modal" role="dialog" aria-modal="true" aria-labelledby="headingShareTitle" hidden> <div class="hsm-dialog" role="document"> <div class="hsm-header"> <h2 id="headingShareTitle" class="h6 mb-0 fw-bold">Share this section</h2> <button type="button" class="hsm-close" aria-label="Close"> </button> </div> <div class="hsm-body"> <label for="headingShareInput" class="form-label small text-muted mb-1 text-uppercase fw-bold" style="font-size: 0.7rem; letter-spacing: 0.5px;">Permalink</label> <div class="input-group mb-4 hsm-url-group"> <input id="headingShareInput" type="text" class="form-control font-monospace" readonly aria-readonly="true" style="font-size: 0.85rem;" /> <button class="btn btn-primary hsm-copy" type="button" aria-label="Copy" title="Copy"> </button> </div> <div class="small fw-bold mb-2 text-muted text-uppercase" style="font-size: 0.7rem; letter-spacing: 0.5px;">Share via</div> <div class="hsm-share-grid"> <a id="share-twitter" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> Twitter </a> <a id="share-linkedin" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> LinkedIn </a> <a id="share-facebook" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> Facebook </a> </div> </div> </div> </div> <style> .heading-share-modal { position: fixed; inset: 0; display: flex; justify-content: center; align-items: center; background: rgba(0, 0, 0, 0.6); z-index: 1050; padding: 1rem; backdrop-filter: blur(4px); -webkit-backdrop-filter: blur(4px); } .heading-share-modal[hidden] { display: none !important; } .hsm-dialog { max-width: 420px; width: 100%; background: var(--bs-body-bg, #fff); color: var(--bs-body-color, #212529); border: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); border-radius: 1rem; box-shadow: 0 25px 50px -12px rgba(0, 0, 0, 0.25); overflow: hidden; animation: hsm-fade-in 0.2s ease-out; } @keyframes hsm-fade-in { from { opacity: 0; transform: scale(0.95); } to { opacity: 1; transform: scale(1); } } [data-bs-theme="dark"] .hsm-dialog { background: #1e293b; border-color: rgba(255,255,255,0.1); color: #f8f9fa; } .hsm-header { display: flex; justify-content: space-between; align-items: center; padding: 1rem 1.5rem; border-bottom: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); background: rgba(0,0,0,0.02); } [data-bs-theme="dark"] .hsm-header { background: rgba(255,255,255,0.02); border-color: rgba(255,255,255,0.1); } .hsm-close { background: transparent; border: none; color: inherit; opacity: 0.5; padding: 0.25rem 0.5rem; border-radius: 0.25rem; font-size: 1.2rem; line-height: 1; transition: opacity 0.2s; } .hsm-close:hover { opacity: 1; } .hsm-body { padding: 1.5rem; } .hsm-url-group { display: flex !important; align-items: stretch; } .hsm-url-group .form-control { flex: 1; min-width: 0; margin: 0; background: var(--bs-secondary-bg, #f8f9fa); border-color: var(--bs-border-color, #dee2e6); border-top-right-radius: 0; border-bottom-right-radius: 0; height: 42px; } .hsm-url-group .btn { flex: 0 0 auto; margin: 0; margin-left: -1px; border-top-left-radius: 0; border-bottom-left-radius: 0; height: 42px; display: flex; align-items: center; justify-content: center; padding: 0 1.25rem; z-index: 2; } [data-bs-theme="dark"] .hsm-url-group .form-control { background: #0f172a; border-color: #334155; color: #e2e8f0; } .hsm-share-grid { display: flex; flex-direction: column; gap: 0.5rem; } .hsm-share-grid .btn { display: flex; align-items: center; justify-content: center; font-size: 0.9rem; padding: 0.6rem; border-color: var(--bs-border-color); width: 100%; } [data-bs-theme="dark"] .hsm-share-grid .btn { color: #e2e8f0; border-color: #475569; } [data-bs-theme="dark"] .hsm-share-grid .btn:hover { background: #334155; border-color: #cbd5e1; } </style> <script> (function(){ const modal = document.getElementById('headingShareModal'); if(!modal) return; const input = modal.querySelector('#headingShareInput'); const copyBtn = modal.querySelector('.hsm-copy'); const twitter = modal.querySelector('#share-twitter'); const linkedin = modal.querySelector('#share-linkedin'); const facebook = modal.querySelector('#share-facebook'); const closeBtn = modal.querySelector('.hsm-close'); let lastFocus=null; let trapBound=false; function buildUrl(id){ return window.location.origin + window.location.pathname + '#' + id; } function isOpen(){ return !modal.hasAttribute('hidden'); } function hydrate(id){ const url=buildUrl(id); input.value=url; const enc=encodeURIComponent(url); const text=encodeURIComponent(document.title); if(twitter) twitter.href=`https://twitter.com/intent/tweet?url=${enc}&text=${text}`; if(linkedin) linkedin.href=`https://www.linkedin.com/sharing/share-offsite/?url=${enc}`; if(facebook) facebook.href=`https://www.facebook.com/sharer/sharer.php?u=${enc}`; } function openModal(id){ lastFocus=document.activeElement; hydrate(id); if(!isOpen()){ modal.removeAttribute('hidden'); } requestAnimationFrame(()=>{ input.focus(); }); trapFocus(); } function closeModal(){ if(!isOpen()) return; modal.setAttribute('hidden',''); if(lastFocus && typeof lastFocus.focus==='function') lastFocus.focus(); } function copyCurrent(){ try{ navigator.clipboard.writeText(input.value).then(()=>feedback(true),()=>fallback()); } catch(e){ fallback(); } } function fallback(){ input.select(); try{ document.execCommand('copy'); feedback(true);}catch(e){ feedback(false);} } function feedback(ok){ if(!copyBtn) return; const icon=copyBtn.querySelector('i'); if(!icon) return; const prev=copyBtn.getAttribute('data-prev')||icon.className; if(!copyBtn.getAttribute('data-prev')) copyBtn.setAttribute('data-prev',prev); icon.className= ok ? 'fa-duotone fa-clipboard-check':'fa-duotone fa-circle-exclamation'; setTimeout(()=>{ icon.className=prev; },1800); } function handleShareClick(e){ e.preventDefault(); const btn=e.currentTarget; const id=btn.getAttribute('data-share-target'); if(id) openModal(id); } function bindShareButtons(){ document.querySelectorAll('.h-share').forEach(btn=>{ if(!btn.dataset.hShareBound){ btn.addEventListener('click', handleShareClick); btn.dataset.hShareBound='1'; } }); } bindShareButtons(); if(document.readyState==='loading'){ document.addEventListener('DOMContentLoaded', bindShareButtons); } else { requestAnimationFrame(bindShareButtons); } document.addEventListener('click', function(e){ const shareBtn=e.target.closest && e.target.closest('.h-share'); if(shareBtn && !shareBtn.dataset.hShareBound){ handleShareClick.call(shareBtn, e); } }, true); document.addEventListener('click', e=>{ if(e.target===modal) closeModal(); if(e.target.closest && e.target.closest('.hsm-close')){ e.preventDefault(); closeModal(); } if(copyBtn && (e.target===copyBtn || (e.target.closest && e.target.closest('.hsm-copy')))) { e.preventDefault(); copyCurrent(); } }); document.addEventListener('keydown', e=>{ if(e.key==='Escape' && isOpen()) closeModal(); }); function trapFocus(){ if(trapBound) return; trapBound=true; modal.addEventListener('keydown', f=>{ if(f.key==='Tab' && isOpen()){ const focusable=[...modal.querySelectorAll('a[href],button,input,textarea,select,[tabindex]:not([tabindex="-1"])')].filter(el=>!el.hasAttribute('disabled')); if(!focusable.length) return; const first=focusable[0]; const last=focusable[focusable.length-1]; if(f.shiftKey && document.activeElement===first){ f.preventDefault(); last.focus(); } else if(!f.shiftKey && document.activeElement===last){ f.preventDefault(); first.focus(); } } }); } if(closeBtn) closeBtn.addEventListener('click', e=>{ e.preventDefault(); closeModal(); }); })(); </script> <h4 id="alert-on-symptoms-not-causes" class="position-relative d-flex align-items-center group"> Alert on Symptoms, Not Causes <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-on-symptoms-not-causes" aria-haspopup="dialog" aria-label="Share link: Alert on Symptoms, Not Causes"> Share link </button> </h4>Focus alerts on user-impacting symptoms rather than internal metrics: Good (Symptom-Based): <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># High query error rate (users experiencing failures) - alert: HighQueryErrorRate expr: rate(geode_queries_total{status="error"}[5m]) > 10 # Slow query latency (users experiencing delays) - alert: HighQueryLatency expr: histogram_quantile(0.95, rate(geode_query_duration_seconds_bucket[5m])) > 1.0 </code></pre></div>Bad (Cause-Based): <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># High CPU (may not impact users) - alert: HighCPU expr: cpu_usage > 0.8 # High memory (may be normal) - alert: HighMemory expr: memory_usage > 0.7 </code></pre></div> <h4 id="actionable-alerts-only" class="position-relative d-flex align-items-center group"> Actionable Alerts Only <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="actionable-alerts-only" aria-haspopup="dialog" aria-label="Share link: Actionable Alerts Only"> Share link </button> </h4>Every alert should require human action. If no action is needed, it’s not an alert—it’s a metric to track in dashboards. Actionable: “Database is down” → Immediate response required Not Actionable: “CPU at 60%” → Normal operation, monitor in dashboard <h3 id="alert-severity-levels" class="position-relative d-flex align-items-center group"> Alert Severity Levels <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-severity-levels" aria-haspopup="dialog" aria-label="Share link: Alert Severity Levels"> Share link </button> </h3> <h4 id="critical-alerts" class="position-relative d-flex align-items-center group"> Critical Alerts <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="critical-alerts" aria-haspopup="dialog" aria-label="Share link: Critical Alerts"> Share link </button> </h4>System down or severe degradation requiring immediate response: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml">groups: - name: geode_critical interval: 30s rules: - alert: GeodeDown expr: up{job="geode"} == 0 for: 1m labels: severity: critical team: database annotations: summary: "Geode instance {{ $labels.instance }} is down" description: "Geode has been unreachable for more than 1 minute" runbook: "https://docs.geodedb.com/runbooks/geode-down" - alert: HighQueryErrorRate expr: | rate(geode_queries_total{status="error"}[5m]) > 10 for: 5m labels: severity: critical annotations: summary: "High query error rate on {{ $labels.instance }}" description: "Error rate is {{ $value }} errors/sec (threshold: 10/sec)" runbook: "https://docs.geodedb.com/runbooks/high-error-rate" - alert: DiskSpaceCritical expr: | geode_disk_free_bytes / geode_disk_total_bytes < 0.05 for: 5m labels: severity: critical annotations: summary: "Critical disk space on {{ $labels.instance }}" description: "Only {{ $value | humanizePercentage }} disk space remaining" runbook: "https://docs.geodedb.com/runbooks/disk-space" - alert: MemoryExhaustion expr: | geode_memory_used_bytes / geode_memory_total_bytes > 0.95 for: 10m labels: severity: critical annotations: summary: "Memory near exhaustion on {{ $labels.instance }}" description: "Memory usage at {{ $value | humanizePercentage }}" runbook: "https://docs.geodedb.com/runbooks/memory-pressure" </code></pre></div> <h4 id="warning-alerts" class="position-relative d-flex align-items-center group"> Warning Alerts <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="warning-alerts" aria-haspopup="dialog" aria-label="Share link: Warning Alerts"> Share link </button> </h4>Degraded performance or conditions that may lead to critical issues: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"> - name: geode_warnings interval: 1m rules: - alert: SlowQueryRateIncreasing expr: | rate(geode_slow_queries_total[5m]) > 5 for: 10m labels: severity: warning annotations: summary: "Slow query rate increasing on {{ $labels.instance }}" description: "{{ $value }} slow queries per second (threshold: 5/sec)" runbook: "https://docs.geodedb.com/runbooks/slow-queries" - alert: HighTransactionConflicts expr: | rate(geode_transaction_conflicts_total[5m]) > 100 for: 10m labels: severity: warning annotations: summary: "High transaction conflict rate" description: "{{ $value }} conflicts per second (threshold: 100/sec)" runbook: "https://docs.geodedb.com/runbooks/transaction-conflicts" - alert: ConnectionPoolPressure expr: | geode_active_connections / geode_max_connections > 0.8 for: 15m labels: severity: warning annotations: summary: "Connection pool pressure on {{ $labels.instance }}" description: "{{ $value | humanizePercentage }} of connection pool used" runbook: "https://docs.geodedb.com/runbooks/connection-pool" - alert: CacheHitRateLow expr: | rate(geode_cache_hits_total[5m]) / (rate(geode_cache_hits_total[5m]) + rate(geode_cache_misses_total[5m])) < 0.7 for: 30m labels: severity: warning annotations: summary: "Low cache hit rate on {{ $labels.instance }}" description: "Cache hit rate is {{ $value | humanizePercentage }}" runbook: "https://docs.geodedb.com/runbooks/cache-tuning" - alert: LongRunningQueries expr: | geode_active_queries{duration_seconds=">60"} > 5 for: 5m labels: severity: warning annotations: summary: "Multiple long-running queries detected" description: "{{ $value }} queries running for >60 seconds" runbook: "https://docs.geodedb.com/runbooks/long-queries" - alert: WalSizeGrowing expr: | deriv(geode_wal_size_bytes[30m]) > 1e9 for: 30m labels: severity: warning annotations: summary: "WAL size growing rapidly on {{ $labels.instance }}" description: "WAL growing at {{ $value | humanize1024 }}/sec" runbook: "https://docs.geodedb.com/runbooks/wal-growth" </code></pre></div> <h4 id="info-alerts" class="position-relative d-flex align-items-center group"> Info Alerts <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="info-alerts" aria-haspopup="dialog" aria-label="Share link: Info Alerts"> Share link </button> </h4>Informational notifications that don’t require immediate action: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"> - name: geode_info interval: 5m rules: - alert: GeodeInstanceRestarted expr: | time() - geode_start_time_seconds < 300 labels: severity: info annotations: summary: "Geode instance {{ $labels.instance }} recently restarted" description: "Instance started {{ $value | humanizeDuration }} ago" - alert: IndexBuildCompleted expr: | increase(geode_index_builds_completed_total[5m]) > 0 labels: severity: info annotations: summary: "Index build completed on {{ $labels.instance }}" description: "{{ $value }} index(es) built in the last 5 minutes" </code></pre></div> <h3 id="alert-thresholds-and-tuning" class="position-relative d-flex align-items-center group"> Alert Thresholds and Tuning <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-thresholds-and-tuning" aria-haspopup="dialog" aria-label="Share link: Alert Thresholds and Tuning"> Share link </button> </h3> <h4 id="establishing-baselines" class="position-relative d-flex align-items-center group"> Establishing Baselines <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="establishing-baselines" aria-haspopup="dialog" aria-label="Share link: Establishing Baselines"> Share link </button> </h4>Measure normal behavior before setting thresholds: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"># Calculate p95 query latency over last 7 days promtool query instant \ 'histogram_quantile(0.95, rate(geode_query_duration_seconds_bucket[7d]))' # Average query rate during business hours promtool query instant \ 'avg_over_time(rate(geode_queries_total[7d])[1h:5m])' # Peak connection count promtool query instant \ 'max_over_time(geode_active_connections[7d])' </code></pre></div>Set thresholds based on observed values: <ul> <li>Warning: 1.5x baseline</li> <li>Critical: 2x baseline or SLO violation</li> </ul> <h4 id="dynamic-thresholds" class="position-relative d-flex align-items-center group"> Dynamic Thresholds <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="dynamic-thresholds" aria-haspopup="dialog" aria-label="Share link: Dynamic Thresholds"> Share link </button> </h4>Use statistical methods for dynamic thresholds: Detect anomalies using standard deviation: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml">- alert: QueryRateAnomaly expr: | abs(rate(geode_queries_total[5m]) - avg_over_time(rate(geode_queries_total[5m])[1h:5m])) > 2 * stddev_over_time(rate(geode_queries_total[5m])[1h:5m]) for: 10m labels: severity: warning annotations: summary: "Anomalous query rate detected" description: "Current rate deviates >2 standard deviations from 1-hour average" </code></pre></div>Predict resource exhaustion: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml">- alert: DiskWillFillIn4Hours expr: | predict_linear(geode_disk_free_bytes[1h], 4 * 3600) < 0 labels: severity: warning annotations: summary: "Disk will be full in approximately 4 hours" description: "Based on current growth rate" </code></pre></div> <h3 id="alertmanager-configuration" class="position-relative d-flex align-items-center group"> Alertmanager Configuration <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alertmanager-configuration" aria-haspopup="dialog" aria-label="Share link: Alertmanager Configuration"> Share link </button> </h3> <h4 id="basic-setup" class="position-relative d-flex align-items-center group"> Basic Setup <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="basic-setup" aria-haspopup="dialog" aria-label="Share link: Basic Setup"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: 'smtp.example.com:587' smtp_from: '[email protected]' smtp_auth_username: '[email protected]' smtp_auth_password: '${SMTP_PASSWORD}' # Alert routing tree route: receiver: 'default' group_by: ['alertname', 'cluster', 'instance'] group_wait: 10s group_interval: 5m repeat_interval: 4h routes: # Critical alerts to on-call - match: severity: critical receiver: oncall group_wait: 0s repeat_interval: 1h # Warnings to team channel - match: severity: warning receiver: team-slack repeat_interval: 4h # Info alerts to low-priority channel - match: severity: info receiver: info-channel repeat_interval: 24h receivers: - name: 'default' email_configs: - to: '[email protected]' - name: 'oncall' pagerduty_configs: - service_key: '${PAGERDUTY_SERVICE_KEY}' description: '{{ .GroupLabels.alertname }} on {{ .GroupLabels.instance }}' - name: 'team-slack' slack_configs: - api_url: '${SLACK_WEBHOOK_URL}' channel: '#database-alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' - name: 'info-channel' slack_configs: - api_url: '${SLACK_WEBHOOK_URL}' channel: '#database-info' </code></pre></div> <h4 id="multi-channel-notifications" class="position-relative d-flex align-items-center group"> Multi-Channel Notifications <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="multi-channel-notifications" aria-haspopup="dialog" aria-label="Share link: Multi-Channel Notifications"> Share link </button> </h4>Route alerts to appropriate channels: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml">routes: # Database team for database-specific issues - match: team: database receiver: database-team routes: # Critical to PagerDuty + Slack - match: severity: critical receiver: database-oncall continue: true # All to Slack - receiver: database-slack # Infrastructure team for resource issues - match_re: alertname: (DiskSpaceCritical|MemoryExhaustion) receiver: infrastructure-team receivers: - name: database-oncall pagerduty_configs: - service_key: '${DB_PAGERDUTY_KEY}' webhook_configs: - url: 'https://hooks.slack.com/services/...' - name: database-slack slack_configs: - channel: '#database-alerts' - name: infrastructure-team pagerduty_configs: - service_key: '${INFRA_PAGERDUTY_KEY}' </code></pre></div> <h4 id="alert-grouping" class="position-relative d-flex align-items-center group"> Alert Grouping <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-grouping" aria-haspopup="dialog" aria-label="Share link: Alert Grouping"> Share link </button> </h4>Group related alerts to reduce noise: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml">route: # Group by instance to see all issues with single node group_by: ['instance'] group_wait: 30s group_interval: 5m # Don't group critical alerts (send immediately) routes: - match: severity: critical group_by: [] group_wait: 0s </code></pre></div> <h4 id="alert-silencing" class="position-relative d-flex align-items-center group"> Alert Silencing <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-silencing" aria-haspopup="dialog" aria-label="Share link: Alert Silencing"> Share link </button> </h4>Temporarily suppress alerts during maintenance: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"># Silence alerts for instance during maintenance amtool silence add \ instance=geode-node-1 \ --duration=2h \ --author="[email protected]" \ --comment="Scheduled maintenance" # Silence specific alert amtool silence add \ alertname=SlowQueryRateIncreasing \ --duration=30m \ --comment="Known issue, fix in progress" # List active silences amtool silence query # Expire silence early amtool silence expire <silence-id> </code></pre></div> <h3 id="alert-fatigue-prevention" class="position-relative d-flex align-items-center group"> Alert Fatigue Prevention <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-fatigue-prevention" aria-haspopup="dialog" aria-label="Share link: Alert Fatigue Prevention"> Share link </button> </h3> <h4 id="reduce-false-positives" class="position-relative d-flex align-items-center group"> Reduce False Positives <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="reduce-false-positives" aria-haspopup="dialog" aria-label="Share link: Reduce False Positives"> Share link </button> </h4>Appropriate for duration: Require sustained condition before alerting <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># Bad: Alert on momentary spike - alert: HighCPU expr: cpu_usage > 0.8 # Good: Alert on sustained high CPU - alert: HighCPU expr: cpu_usage > 0.8 for: 10m # Must persist for 10 minutes </code></pre></div>Set realistic thresholds: Base on actual SLOs, not arbitrary numbers <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># Bad: Arbitrary threshold - alert: LatencyHigh expr: latency_seconds > 0.1 # Good: SLO-based threshold - alert: LatencyViolatesSLO expr: | histogram_quantile(0.99, rate(geode_query_duration_seconds_bucket[5m])) > 1.0 # SLO: p99 < 1s </code></pre></div> <h4 id="alert-on-rate-of-change" class="position-relative d-flex align-items-center group"> Alert on Rate of Change <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="alert-on-rate-of-change" aria-haspopup="dialog" aria-label="Share link: Alert on Rate of Change"> Share link </button> </h4>Detect rapid changes that indicate problems: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml">- alert: ErrorRateSpiking expr: | deriv(rate(geode_queries_total{status="error"}[5m])[5m:1m]) > 10 annotations: summary: "Error rate increasing rapidly" description: "Error rate increased by {{ $value }}/sec in last 5 minutes" </code></pre></div> <h4 id="deduplicate-related-alerts" class="position-relative d-flex align-items-center group"> Deduplicate Related Alerts <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="deduplicate-related-alerts" aria-haspopup="dialog" aria-label="Share link: Deduplicate Related Alerts"> Share link </button> </h4>Use alert dependencies and inhibition rules: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"># alertmanager.yml inhibit_rules: # If Geode is down, inhibit all other alerts for that instance - source_match: alertname: GeodeDown target_match_re: alertname: (HighQueryErrorRate|HighQueryLatency|ConnectionPoolPressure) equal: ['instance'] # If disk is critical, inhibit disk warnings - source_match: alertname: DiskSpaceCritical target_match: alertname: DiskSpaceWarning equal: ['instance'] </code></pre></div> <h3 id="runbooks-and-documentation" class="position-relative d-flex align-items-center group"> Runbooks and Documentation <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="runbooks-and-documentation" aria-haspopup="dialog" aria-label="Share link: Runbooks and Documentation"> Share link </button> </h3>Link alerts to detailed runbooks: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml">- alert: HighQueryErrorRate annotations: summary: "High query error rate on {{ $labels.instance }}" description: "Error rate is {{ $value }} errors/sec" runbook: "https://docs.geodedb.com/runbooks/high-error-rate" </code></pre></div>Example Runbook Structure: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-markdown" data-lang="markdown"># Runbook: High Query Error Rate ## Severity Critical ## Impact Users experiencing query failures, application functionality degraded ## Diagnosis 1. Check error breakdown: </code></pre></div>sum by (error_type) (rate(geode_query_errors_total[5m])) <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"> 2. View recent error logs: </code></pre></div>jq ‘select(.level == “ERROR” and .logger == “geode.query”)’ /var/log/geode/geode.log | tail -50 <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-zed" data-lang="zed"> 3. Check for common error patterns ## Resolution Steps 1. If syntax errors: Check application code for malformed queries 2. If connection errors: Check network connectivity and connection pool 3. If permission errors: Review access control policies 4. If resource errors: Check memory/disk availability ## Escalation If unable to resolve in 30 minutes, escalate to database team lead </code></pre></div> <h3 id="testing-alerts" class="position-relative d-flex align-items-center group"> Testing Alerts <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="testing-alerts" aria-haspopup="dialog" aria-label="Share link: Testing Alerts"> Share link </button> </h3>Verify alert rules before deploying: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"># Test alert rule syntax promtool check rules alert-rules.yml # Query current alert state promtool query instant \ 'ALERTS{alertname="HighQueryErrorRate"}' # Simulate alert condition (if safe) # Trigger slow queries to test alert for i in {1..100}; do geode query "MATCH (n) RETURN n" & done </code></pre></div> <h3 id="best-practices" class="position-relative d-flex align-items-center group"> Best Practices <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="best-practices" aria-haspopup="dialog" aria-label="Share link: Best Practices"> Share link </button> </h3>Alert on SLOs: Tie critical alerts to service level objectives that matter to users. Include Context: Provide runbook links, affected instances, and resolution hints in annotations. Test Regularly: Verify alert delivery and escalation paths work correctly. Review and Refine: Periodically review alerts that fire frequently or never fire. Document Everything: Maintain up-to-date runbooks for all critical alerts. Appropriate Severity: Reserve critical alerts for situations requiring immediate human response. Avoid Alert Storms: Use grouping and inhibition to prevent overwhelming responders. Monitor Alertmanager: Ensure alerting system itself is reliable and monitored. <h3 id="related-topics" class="position-relative d-flex align-items-center group"> Related Topics <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="related-topics" aria-haspopup="dialog" aria-label="Share link: Related Topics"> Share link </button> </h3><ul> <li><a href="/tags/monitoring/" >System Monitoring</a> - Monitoring strategies</li> <li><a href="/tags/metrics/" >Performance Metrics</a> - Metrics collection</li> <li><a href="/tags/prometheus/" >Prometheus Integration</a> - Prometheus configuration</li> <li><a href="/tags/observability/" >System Observability</a> - Observability best practices</li> <li><a href="/tags/operations/" >Operations</a> - Operations guide</li> <li><a href="/tags/troubleshooting/" >Troubleshooting</a> - Debugging techniques</li> </ul> <h3 id="further-reading" class="position-relative d-flex align-items-center group"> Further Reading <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="further-reading" aria-haspopup="dialog" aria-label="Share link: Further Reading"> Share link </button> </h3><ul> <li>Alert Design Best Practices</li> <li>Prometheus Alerting Rules Reference</li> <li>Alertmanager Configuration Guide</li> <li>Runbook Templates</li> <li>On-Call Incident Response Guide</li> </ul>

Popular

Related Articles

Monitoring

Monitoring Guide