Tag: BM25 Ranking Algorithm

Documentation tagged with BM25 Ranking Algorithm in the Geode graph database. BM25 (Best Matching 25) is a probabilistic ranking function used for text search and information retrieval, providing relevance scoring for keyword-based document searches. <h3 id="introduction-to-bm25" class="position-relative d-flex align-items-center group"> Introduction to BM25 <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="introduction-to-bm25" aria-haspopup="dialog" aria-label="Share link: Introduction to BM25"> Share link </button> </h3><div id="headingShareModal" class="heading-share-modal" role="dialog" aria-modal="true" aria-labelledby="headingShareTitle" hidden> <div class="hsm-dialog" role="document"> <div class="hsm-header"> <h2 id="headingShareTitle" class="h6 mb-0 fw-bold">Share this section</h2> <button type="button" class="hsm-close" aria-label="Close"> </button> </div> <div class="hsm-body"> <label for="headingShareInput" class="form-label small text-muted mb-1 text-uppercase fw-bold" style="font-size: 0.7rem; letter-spacing: 0.5px;">Permalink</label> <div class="input-group mb-4 hsm-url-group"> <input id="headingShareInput" type="text" class="form-control font-monospace" readonly aria-readonly="true" style="font-size: 0.85rem;" /> <button class="btn btn-primary hsm-copy" type="button" aria-label="Copy" title="Copy"> </button> </div> <div class="small fw-bold mb-2 text-muted text-uppercase" style="font-size: 0.7rem; letter-spacing: 0.5px;">Share via</div> <div class="hsm-share-grid"> <a id="share-twitter" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> Twitter </a> <a id="share-linkedin" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> LinkedIn </a> <a id="share-facebook" class="btn btn-outline-secondary w-100" target="_blank" rel="noopener noreferrer"> Facebook </a> </div> </div> </div> </div> <style> .heading-share-modal { position: fixed; inset: 0; display: flex; justify-content: center; align-items: center; background: rgba(0, 0, 0, 0.6); z-index: 1050; padding: 1rem; backdrop-filter: blur(4px); -webkit-backdrop-filter: blur(4px); } .heading-share-modal[hidden] { display: none !important; } .hsm-dialog { max-width: 420px; width: 100%; background: var(--bs-body-bg, #fff); color: var(--bs-body-color, #212529); border: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); border-radius: 1rem; box-shadow: 0 25px 50px -12px rgba(0, 0, 0, 0.25); overflow: hidden; animation: hsm-fade-in 0.2s ease-out; } @keyframes hsm-fade-in { from { opacity: 0; transform: scale(0.95); } to { opacity: 1; transform: scale(1); } } [data-bs-theme="dark"] .hsm-dialog { background: #1e293b; border-color: rgba(255,255,255,0.1); color: #f8f9fa; } .hsm-header { display: flex; justify-content: space-between; align-items: center; padding: 1rem 1.5rem; border-bottom: 1px solid var(--bs-border-color, rgba(0,0,0,0.1)); background: rgba(0,0,0,0.02); } [data-bs-theme="dark"] .hsm-header { background: rgba(255,255,255,0.02); border-color: rgba(255,255,255,0.1); } .hsm-close { background: transparent; border: none; color: inherit; opacity: 0.5; padding: 0.25rem 0.5rem; border-radius: 0.25rem; font-size: 1.2rem; line-height: 1; transition: opacity 0.2s; } .hsm-close:hover { opacity: 1; } .hsm-body { padding: 1.5rem; } .hsm-url-group { display: flex !important; align-items: stretch; } .hsm-url-group .form-control { flex: 1; min-width: 0; margin: 0; background: var(--bs-secondary-bg, #f8f9fa); border-color: var(--bs-border-color, #dee2e6); border-top-right-radius: 0; border-bottom-right-radius: 0; height: 42px; } .hsm-url-group .btn { flex: 0 0 auto; margin: 0; margin-left: -1px; border-top-left-radius: 0; border-bottom-left-radius: 0; height: 42px; display: flex; align-items: center; justify-content: center; padding: 0 1.25rem; z-index: 2; } [data-bs-theme="dark"] .hsm-url-group .form-control { background: #0f172a; border-color: #334155; color: #e2e8f0; } .hsm-share-grid { display: flex; flex-direction: column; gap: 0.5rem; } .hsm-share-grid .btn { display: flex; align-items: center; justify-content: center; font-size: 0.9rem; padding: 0.6rem; border-color: var(--bs-border-color); width: 100%; } [data-bs-theme="dark"] .hsm-share-grid .btn { color: #e2e8f0; border-color: #475569; } [data-bs-theme="dark"] .hsm-share-grid .btn:hover { background: #334155; border-color: #cbd5e1; } </style> <script> (function(){ const modal = document.getElementById('headingShareModal'); if(!modal) return; const input = modal.querySelector('#headingShareInput'); const copyBtn = modal.querySelector('.hsm-copy'); const twitter = modal.querySelector('#share-twitter'); const linkedin = modal.querySelector('#share-linkedin'); const facebook = modal.querySelector('#share-facebook'); const closeBtn = modal.querySelector('.hsm-close'); let lastFocus=null; let trapBound=false; function buildUrl(id){ return window.location.origin + window.location.pathname + '#' + id; } function isOpen(){ return !modal.hasAttribute('hidden'); } function hydrate(id){ const url=buildUrl(id); input.value=url; const enc=encodeURIComponent(url); const text=encodeURIComponent(document.title); if(twitter) twitter.href=`https://twitter.com/intent/tweet?url=${enc}&text=${text}`; if(linkedin) linkedin.href=`https://www.linkedin.com/sharing/share-offsite/?url=${enc}`; if(facebook) facebook.href=`https://www.facebook.com/sharer/sharer.php?u=${enc}`; } function openModal(id){ lastFocus=document.activeElement; hydrate(id); if(!isOpen()){ modal.removeAttribute('hidden'); } requestAnimationFrame(()=>{ input.focus(); }); trapFocus(); } function closeModal(){ if(!isOpen()) return; modal.setAttribute('hidden',''); if(lastFocus && typeof lastFocus.focus==='function') lastFocus.focus(); } function copyCurrent(){ try{ navigator.clipboard.writeText(input.value).then(()=>feedback(true),()=>fallback()); } catch(e){ fallback(); } } function fallback(){ input.select(); try{ document.execCommand('copy'); feedback(true);}catch(e){ feedback(false);} } function feedback(ok){ if(!copyBtn) return; const icon=copyBtn.querySelector('i'); if(!icon) return; const prev=copyBtn.getAttribute('data-prev')||icon.className; if(!copyBtn.getAttribute('data-prev')) copyBtn.setAttribute('data-prev',prev); icon.className= ok ? 'fa-duotone fa-clipboard-check':'fa-duotone fa-circle-exclamation'; setTimeout(()=>{ icon.className=prev; },1800); } function handleShareClick(e){ e.preventDefault(); const btn=e.currentTarget; const id=btn.getAttribute('data-share-target'); if(id) openModal(id); } function bindShareButtons(){ document.querySelectorAll('.h-share').forEach(btn=>{ if(!btn.dataset.hShareBound){ btn.addEventListener('click', handleShareClick); btn.dataset.hShareBound='1'; } }); } bindShareButtons(); if(document.readyState==='loading'){ document.addEventListener('DOMContentLoaded', bindShareButtons); } else { requestAnimationFrame(bindShareButtons); } document.addEventListener('click', function(e){ const shareBtn=e.target.closest && e.target.closest('.h-share'); if(shareBtn && !shareBtn.dataset.hShareBound){ handleShareClick.call(shareBtn, e); } }, true); document.addEventListener('click', e=>{ if(e.target===modal) closeModal(); if(e.target.closest && e.target.closest('.hsm-close')){ e.preventDefault(); closeModal(); } if(copyBtn && (e.target===copyBtn || (e.target.closest && e.target.closest('.hsm-copy')))) { e.preventDefault(); copyCurrent(); } }); document.addEventListener('keydown', e=>{ if(e.key==='Escape' && isOpen()) closeModal(); }); function trapFocus(){ if(trapBound) return; trapBound=true; modal.addEventListener('keydown', f=>{ if(f.key==='Tab' && isOpen()){ const focusable=[...modal.querySelectorAll('a[href],button,input,textarea,select,[tabindex]:not([tabindex="-1"])')].filter(el=>!el.hasAttribute('disabled')); if(!focusable.length) return; const first=focusable[0]; const last=focusable[focusable.length-1]; if(f.shiftKey && document.activeElement===first){ f.preventDefault(); last.focus(); } else if(!f.shiftKey && document.activeElement===last){ f.preventDefault(); first.focus(); } } }); } if(closeBtn) closeBtn.addEventListener('click', e=>{ e.preventDefault(); closeModal(); }); })(); </script>BM25 (Best Matching 25) is the gold standard ranking function for text search. Developed by Stephen Robertson and Karen Spärck Jones in the 1990s as part of the Okapi information retrieval system, BM25 has become the default ranking algorithm in search engines like Elasticsearch, Apache Solr, and Apache Lucene. BM25 solves a fundamental question: given a search query and a collection of documents, which documents are most relevant? The algorithm computes a relevance score based on: <ul> <li>Term frequency: How often query terms appear in each document</li> <li>Inverse document frequency: How rare or common terms are across all documents</li> <li>Document length normalization: Penalizing long documents that contain many terms</li> <li>Saturation: Diminishing returns for repeated terms</li> </ul> Unlike simple keyword matching (which is binary: match or no match), BM25 provides nuanced relevance scores that enable ranking search results by quality. This makes it invaluable for full-text search applications. Geode implements BM25 for property text search, enabling powerful keyword-based search that complements semantic vector search (HNSW). You can combine BM25 text search with graph traversal for queries like “find documents about databases written by friends, ranked by relevance.” <h3 id="core-bm25-concepts" class="position-relative d-flex align-items-center group"> Core BM25 Concepts <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="core-bm25-concepts" aria-haspopup="dialog" aria-label="Share link: Core BM25 Concepts"> Share link </button> </h3> <h4 id="the-bm25-formula" class="position-relative d-flex align-items-center group"> The BM25 Formula <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="the-bm25-formula" aria-haspopup="dialog" aria-label="Share link: The BM25 Formula"> Share link </button> </h4>BM25 computes a relevance score for document D given query Q: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">score(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl)) Where: - qi: Each term in query Q - f(qi, D): Frequency of qi in document D - |D|: Length of document D (in tokens) - avgdl: Average document length in collection - k1: Term frequency saturation parameter (typically 1.2-2.0) - b: Length normalization parameter (typically 0.75) - IDF(qi): Inverse document frequency of qi </code></pre></div> <h4 id="term-frequency-tf" class="position-relative d-flex align-items-center group"> Term Frequency (TF) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="term-frequency-tf" aria-haspopup="dialog" aria-label="Share link: Term Frequency (TF)"> Share link </button> </h4>Term frequency measures how often a query term appears in a document. BM25 uses a saturating function—the first few occurrences of a term matter much more than later ones: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">TF Impact: 1 occurrence: High impact 2 occurrences: Medium impact 10 occurrences: Marginal additional impact 100 occurrences: Almost no additional impact </code></pre></div>This saturation prevents keyword stuffing from artificially inflating relevance. <h4 id="inverse-document-frequency-idf" class="position-relative d-flex align-items-center group"> Inverse Document Frequency (IDF) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="inverse-document-frequency-idf" aria-haspopup="dialog" aria-label="Share link: Inverse Document Frequency (IDF)"> Share link </button> </h4>IDF measures how rare or common a term is across the entire document collection: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">IDF(term) = log((N - n(term) + 0.5) / (n(term) + 0.5)) Where: - N: Total number of documents - n(term): Number of documents containing term </code></pre></div>Common terms (like “the”, “and”) have low IDF and contribute little to relevance. Rare terms have high IDF and strongly indicate relevance. Examples: <ul> <li>“the” appears in 1M of 1M docs → IDF ≈ 0</li> <li>“database” appears in 10K of 1M docs → IDF ≈ 4.6</li> <li>“geode” appears in 100 of 1M docs → IDF ≈ 9.2</li> </ul> <h4 id="length-normalization" class="position-relative d-flex align-items-center group"> Length Normalization <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="length-normalization" aria-haspopup="dialog" aria-label="Share link: Length Normalization"> Share link </button> </h4>Longer documents tend to contain more terms by chance. BM25 penalizes long documents to avoid bias: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Length penalty = 1 - b + b * |D| / avgdl Where: - b = 0: No length normalization - b = 1: Full length normalization - b = 0.75: Balanced (typical) </code></pre></div>A document twice as long as average receives a moderate penalty. <h4 id="parameter-tuning" class="position-relative d-flex align-items-center group"> Parameter Tuning <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="parameter-tuning" aria-haspopup="dialog" aria-label="Share link: Parameter Tuning"> Share link </button> </h4>BM25 has two main parameters: k1 (term frequency saturation): <ul> <li>Low (0.5-1.0): Aggressive saturation, repeated terms matter less</li> <li>Medium (1.2-1.5): Balanced (typical default: 1.2)</li> <li>High (2.0-3.0): Weak saturation, repeated terms matter more</li> </ul> b (length normalization): <ul> <li>Low (0.0-0.5): Weak length penalty</li> <li>Medium (0.75): Balanced (typical default)</li> <li>High (0.9-1.0): Strong length penalty</li> </ul> <h3 id="how-bm25-works-in-geode" class="position-relative d-flex align-items-center group"> How BM25 Works in Geode <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="how-bm25-works-in-geode" aria-haspopup="dialog" aria-label="Share link: How BM25 Works in Geode"> Share link </button> </h3> <h4 id="creating-full-text-indexes" class="position-relative d-flex align-items-center group"> Creating Full-Text Indexes <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="creating-full-text-indexes" aria-haspopup="dialog" aria-label="Share link: Creating Full-Text Indexes"> Share link </button> </h4>Enable BM25 ranking by creating full-text indexes: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Create full-text index on document content CREATE TEXT INDEX document_content FOR (d:Document) ON (d.content, d.title) OPTIONS { analyzer: 'standard', -- Tokenization and stemming k1: 1.2, -- Term frequency saturation b: 0.75 -- Length normalization }; </code></pre></div>Options: <ul> <li>analyzer: Text processing (standard, english, multilingual, custom)</li> <li>k1: Term frequency saturation parameter</li> <li>b: Length normalization parameter</li> <li>stopwords: Words to ignore (the, and, or, etc.)</li> <li>stemming: Reduce words to roots (running → run)</li> </ul> <h4 id="full-text-search-queries" class="position-relative d-flex align-items-center group"> Full-Text Search Queries <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="full-text-search-queries" aria-haspopup="dialog" aria-label="Share link: Full-Text Search Queries"> Share link </button> </h4>Search using the text index: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- BM25-ranked full-text search MATCH (d:Document) WHERE text_search(d.content, 'graph database performance') RETURN d.title, d.author, text_score(d) AS relevance ORDER BY relevance DESC LIMIT 20; -- Or using CALL syntax CALL fulltext.search({ index: 'document_content', query: 'graph database performance', limit: 20 }) YIELD node, score RETURN node.title, score ORDER BY score DESC; </code></pre></div> <h4 id="boolean-queries" class="position-relative d-flex align-items-center group"> Boolean Queries <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="boolean-queries" aria-haspopup="dialog" aria-label="Share link: Boolean Queries"> Share link </button> </h4>Combine terms with Boolean operators: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Must contain "graph" and "database" MATCH (d:Document) WHERE text_search(d.content, '+graph +database') RETURN d.title, text_score(d) AS score ORDER BY score DESC; -- Must contain "graph", should contain "database" (boosts score) MATCH (d:Document) WHERE text_search(d.content, '+graph database') RETURN d.title, text_score(d) AS score ORDER BY score DESC; -- Contains "graph" but not "neo4j" MATCH (d:Document) WHERE text_search(d.content, 'graph -neo4j') RETURN d.title, text_score(d) AS score ORDER BY score DESC; </code></pre></div> <h4 id="phrase-queries" class="position-relative d-flex align-items-center group"> Phrase Queries <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="phrase-queries" aria-haspopup="dialog" aria-label="Share link: Phrase Queries"> Share link </button> </h4>Search for exact phrases: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Exact phrase match MATCH (d:Document) WHERE text_search(d.content, '"graph database"') RETURN d.title, text_score(d) AS score ORDER BY score DESC; -- Proximity search (words within 5 tokens) MATCH (d:Document) WHERE text_search(d.content, '"graph database"~5') RETURN d.title, text_score(d) AS score ORDER BY score DESC; </code></pre></div> <h4 id="combining-with-graph-traversal" class="position-relative d-flex align-items-center group"> Combining with Graph Traversal <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="combining-with-graph-traversal" aria-haspopup="dialog" aria-label="Share link: Combining with Graph Traversal"> Share link </button> </h4>The power of BM25 in a graph database: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Find relevant documents written by friends MATCH (me:User {id: $userId})-[:FRIEND]->(friend:User) -[:AUTHORED]->(doc:Document) WHERE text_search(doc.content, $query) RETURN doc.title, friend.name AS author, text_score(doc) AS relevance, COUNT(DISTINCT friend) AS friend_author_count ORDER BY relevance DESC, friend_author_count DESC LIMIT 10; -- Search within a specific graph context MATCH (category:Category {name: 'Technology'})<-[:IN_CATEGORY]-(doc:Document) WHERE text_search(doc.content, 'machine learning') AND doc.publish_date > date('2024-01-01') RETURN doc.title, text_score(doc) AS score ORDER BY score DESC LIMIT 20; </code></pre></div> <h3 id="use-cases" class="position-relative d-flex align-items-center group"> Use Cases <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="use-cases" aria-haspopup="dialog" aria-label="Share link: Use Cases"> Share link </button> </h3> <h4 id="document-search" class="position-relative d-flex align-items-center group"> Document Search <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="document-search" aria-haspopup="dialog" aria-label="Share link: Document Search"> Share link </button> </h4>Classic full-text search: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Search knowledge base MATCH (doc:Document) WHERE text_search(doc.content, $user_query) RETURN doc.title, doc.summary, text_score(doc) AS relevance ORDER BY relevance DESC LIMIT 50; </code></pre></div> <h4 id="e-commerce-product-search" class="position-relative d-flex align-items-center group"> E-Commerce Product Search <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="e-commerce-product-search" aria-haspopup="dialog" aria-label="Share link: E-Commerce Product Search"> Share link </button> </h4>Find relevant products: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Product search with metadata filtering MATCH (product:Product) WHERE text_search(product.name, product.description, $query) AND product.price BETWEEN $min_price AND $max_price AND product.in_stock = true RETURN product.name, product.price, text_score(product) AS relevance ORDER BY relevance DESC LIMIT 20; </code></pre></div> <h4 id="log-and-event-search" class="position-relative d-flex align-items-center group"> Log and Event Search <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="log-and-event-search" aria-haspopup="dialog" aria-label="Share link: Log and Event Search"> Share link </button> </h4>Search through logs: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Find relevant log entries MATCH (log:LogEntry) WHERE text_search(log.message, 'error timeout connection') AND log.timestamp > datetime() - duration('P1D') AND log.severity IN ['ERROR', 'FATAL'] RETURN log.timestamp, log.message, log.service, text_score(log) AS score ORDER BY score DESC, log.timestamp DESC LIMIT 100; </code></pre></div> <h4 id="hybrid-search-bm25--vector-search" class="position-relative d-flex align-items-center group"> Hybrid Search (BM25 + Vector Search) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="hybrid-search-bm25--vector-search" aria-haspopup="dialog" aria-label="Share link: Hybrid Search (BM25 &#43; Vector Search)"> Share link </button> </h4>Combine keyword and semantic search: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Hybrid search: BM25 + HNSW MATCH (doc:Document) WHERE text_search(doc.content, $keyword_query) AND vector_similarity(doc.embedding, $query_embedding) > 0.7 WITH doc, text_score(doc) AS bm25_score, vector_similarity(doc.embedding, $query_embedding) AS vector_score RETURN doc.title, bm25_score, vector_score, (0.6 * bm25_score + 0.4 * vector_score) AS combined_score ORDER BY combined_score DESC LIMIT 20; </code></pre></div>This hybrid approach leverages both keyword matching (BM25) and semantic understanding (vectors). <h3 id="best-practices" class="position-relative d-flex align-items-center group"> Best Practices <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="best-practices" aria-haspopup="dialog" aria-label="Share link: Best Practices"> Share link </button> </h3> <h4 id="index-configuration" class="position-relative d-flex align-items-center group"> Index Configuration <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="index-configuration" aria-haspopup="dialog" aria-label="Share link: Index Configuration"> Share link </button> </h4>Choose the right analyzer: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- English text with stemming CREATE TEXT INDEX docs_en FOR (d:Document) ON (d.content) OPTIONS {analyzer: 'english'}; -- running → run, databases → database -- Multilingual support CREATE TEXT INDEX docs_multi FOR (d:Document) ON (d.content) OPTIONS {analyzer: 'multilingual'}; -- Detects language automatically -- Code/technical content CREATE TEXT INDEX code FOR (d:Code) ON (d.content) OPTIONS {analyzer: 'keyword'}; -- No stemming, preserve exact terms </code></pre></div>Configure stopwords: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">CREATE TEXT INDEX docs FOR (d:Document) ON (d.content) OPTIONS { stopwords: ['the', 'a', 'an', 'and', 'or', 'but'] -- Custom stopword list }; </code></pre></div> <h4 id="query-optimization" class="position-relative d-flex align-items-center group"> Query Optimization <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="query-optimization" aria-haspopup="dialog" aria-label="Share link: Query Optimization"> Share link </button> </h4>Use specific terms: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Poor: Too generic WHERE text_search(d.content, 'data') -- Better: Specific terms WHERE text_search(d.content, 'graph database ACID transactions') </code></pre></div>Combine with filters: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Efficient: Filter before expensive text search MATCH (d:Document) WHERE d.category = 'technical' AND d.publish_date > date('2024-01-01') AND text_search(d.content, $query) RETURN d.title, text_score(d) AS score ORDER BY score DESC; </code></pre></div>Tune parameters for your data: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Short documents (tweets, titles): Reduce length penalty CREATE TEXT INDEX tweets FOR (t:Tweet) ON (t.content) OPTIONS {k1: 1.2, b: 0.5}; -- Weak length normalization -- Long documents (articles, books): Increase length penalty CREATE TEXT INDEX articles FOR (a:Article) ON (a.content) OPTIONS {k1: 1.2, b: 0.9}; -- Strong length normalization </code></pre></div> <h4 id="relevance-tuning" class="position-relative d-flex align-items-center group"> Relevance Tuning <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="relevance-tuning" aria-haspopup="dialog" aria-label="Share link: Relevance Tuning"> Share link </button> </h4>Field weighting: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Weight title matches higher than content matches MATCH (d:Document) WITH d, CASE WHEN text_search(d.title, $query) THEN 3.0 ELSE 0.0 END AS title_score, CASE WHEN text_search(d.content, $query) THEN 1.0 ELSE 0.0 END AS content_score WHERE title_score > 0 OR content_score > 0 RETURN d.title, (title_score + content_score) AS score ORDER BY score DESC; </code></pre></div>Query-time boosting: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Boost recent documents MATCH (d:Document) WHERE text_search(d.content, $query) WITH d, text_score(d) AS base_score, (datetime().epochSeconds - d.publish_date.epochSeconds) / (86400 * 365) AS age_years RETURN d.title, base_score, (base_score / (1 + 0.1 * age_years)) AS adjusted_score ORDER BY adjusted_score DESC; </code></pre></div> <h3 id="performance-considerations" class="position-relative d-flex align-items-center group"> Performance Considerations <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="performance-considerations" aria-haspopup="dialog" aria-label="Share link: Performance Considerations"> Share link </button> </h3> <h4 id="index-size" class="position-relative d-flex align-items-center group"> Index Size <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="index-size" aria-haspopup="dialog" aria-label="Share link: Index Size"> Share link </button> </h4>Full-text indexes require additional storage: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">Index size ≈ 30-50% of original text size Example: - 1M documents, 5KB average - Total text: 5GB - Index size: 1.5-2.5GB </code></pre></div> <h4 id="query-performance" class="position-relative d-flex align-items-center group"> Query Performance <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="query-performance" aria-haspopup="dialog" aria-label="Share link: Query Performance"> Share link </button> </h4>Typical performance characteristics: <ul> <li>Simple queries: 1-10ms for millions of documents</li> <li>Complex Boolean queries: 10-50ms</li> <li>Combined graph + text: 50-500ms depending on graph complexity</li> </ul> <h4 id="optimization-tips" class="position-relative d-flex align-items-center group"> Optimization Tips <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="optimization-tips" aria-haspopup="dialog" aria-label="Share link: Optimization Tips"> Share link </button> </h4><ol> <li>Limit result set: Always use LIMIT to cap results</li> <li>Pre-filter: Use property filters before text search</li> <li>Cache common queries: Cache frequent query results</li> <li>Partition large collections: Split by category, date, etc.</li> </ol> <h3 id="monitoring" class="position-relative d-flex align-items-center group"> Monitoring <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="monitoring" aria-haspopup="dialog" aria-label="Share link: Monitoring"> Share link </button> </h3> <h4 id="index-statistics" class="position-relative d-flex align-items-center group"> Index Statistics <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="index-statistics" aria-haspopup="dialog" aria-label="Share link: Index Statistics"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Check index statistics CALL fulltext.index.stats('document_content') YIELD documents, terms, size_mb, avg_doc_length RETURN documents, terms, size_mb, avg_doc_length; </code></pre></div> <h4 id="query-performance-1" class="position-relative d-flex align-items-center group"> Query Performance <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="query-performance-1" aria-haspopup="dialog" aria-label="Share link: Query Performance"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Profile text search query PROFILE MATCH (d:Document) WHERE text_search(d.content, $query) RETURN d.title, text_score(d) AS score ORDER BY score DESC LIMIT 20; </code></pre></div> <h3 id="related-topics" class="position-relative d-flex align-items-center group"> Related Topics <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="related-topics" aria-haspopup="dialog" aria-label="Share link: Related Topics"> Share link </button> </h3><ul> <li><a href="/tags/search/" >Search</a> - General search capabilities</li> <li><a href="/tags/text/" >Text</a> - Text processing features</li> <li><a href="/tags/indexing/" >Indexing</a> - Index management</li> <li><a href="/tags/query-optimization/" >Query Optimization</a> - Performance tuning</li> </ul> <h3 id="further-reading" class="position-relative d-flex align-items-center group"> Further Reading <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="further-reading" aria-haspopup="dialog" aria-label="Share link: Further Reading"> Share link </button> </h3><ul> <li><a href="/docs/query/full-text-search/" >Full-Text Search Guide</a> - Complete BM25 documentation</li> <li><a href="/docs/tutorials/vector-search-tutorial/" >Vector Search Tutorial</a> - Combining BM25 and vectors</li> <li><a href="/docs/query/performance-tuning/" >Performance Tuning</a> - Performance best practices</li> </ul> Geode’s BM25 implementation provides powerful keyword-based search that integrates seamlessly with graph traversal, enabling rich text search applications combined with relationship-based filtering and ranking. <h3 id="advanced-bm25-techniques" class="position-relative d-flex align-items-center group"> Advanced BM25 Techniques <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="advanced-bm25-techniques" aria-haspopup="dialog" aria-label="Share link: Advanced BM25 Techniques"> Share link </button> </h3> <h4 id="bm25-improved-variant" class="position-relative d-flex align-items-center group"> BM25+ (Improved Variant) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="bm25-improved-variant" aria-haspopup="dialog" aria-label="Share link: BM25&#43; (Improved Variant)"> Share link </button> </h4>BM25+ adds a delta parameter to prevent negative IDF values: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback">BM25+(D, Q) = Σ IDF(qi) × ((k1 + 1) × f(qi, D)) / (k1 × (1 - b + b × |D| / avgdl) + f(qi, D)) + δ Where δ = typically 1.0 </code></pre></div>Advantages: <ul> <li>Never penalizes term presence</li> <li>Better performance on verbose queries</li> <li>More robust to long documents</li> </ul> <h4 id="bm25f-field-weighted" class="position-relative d-flex align-items-center group"> BM25F (Field-Weighted) <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="bm25f-field-weighted" aria-haspopup="dialog" aria-label="Share link: BM25F (Field-Weighted)"> Share link </button> </h4>Weight different document fields separately: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- BM25F: weighted fields MATCH (d:Document) WHERE text_search(d.title, $query) OR text_search(d.content, $query) WITH d, text_score(d.title, $query) AS title_score, text_score(d.content, $query) AS content_score, text_score(d.abstract, $query) AS abstract_score WITH d, 0.5 * title_score + // Title boost: 2x 0.3 * abstract_score + // Abstract boost: 1.5x 0.2 * content_score // Content: baseline AS weighted_score WHERE weighted_score > 0 RETURN d.doc_id, d.title, weighted_score ORDER BY weighted_score DESC LIMIT 20; </code></pre></div> <h3 id="query-expansion-and-relevance-feedback" class="position-relative d-flex align-items-center group"> Query Expansion and Relevance Feedback <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="query-expansion-and-relevance-feedback" aria-haspopup="dialog" aria-label="Share link: Query Expansion and Relevance Feedback"> Share link </button> </h3> <h4 id="pseudo-relevance-feedback" class="position-relative d-flex align-items-center group"> Pseudo-Relevance Feedback <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="pseudo-relevance-feedback" aria-haspopup="dialog" aria-label="Share link: Pseudo-Relevance Feedback"> Share link </button> </h4>Expand query using top results: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Stage 1: Initial retrieval CALL fulltext.search({ index: 'documents', query: $original_query, limit: 10 }) YIELD node AS top_doc, score -- Stage 2: Extract expansion terms WITH top_doc MATCH (top_doc)-[:HAS_TERM]->(term:Term) WITH term, SUM(term.tfidf_score) AS term_importance ORDER BY term_importance DESC LIMIT 5 WITH COLLECT(term.text) AS expansion_terms -- Stage 3: Expanded query WITH $original_query + ' ' + join(expansion_terms, ' ') AS expanded_query CALL fulltext.search({ index: 'documents', query: expanded_query, limit: 50 }) YIELD node, score RETURN node.title, score ORDER BY score DESC; </code></pre></div> <h4 id="query-relaxation" class="position-relative d-flex align-items-center group"> Query Relaxation <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="query-relaxation" aria-haspopup="dialog" aria-label="Share link: Query Relaxation"> Share link </button> </h4>Progressively relax boolean constraints: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Strict: All terms required MATCH (d:Document) WHERE text_search(d.content, '+term1 +term2 +term3') WITH COUNT(d) AS strict_count -- Relaxed: At least 2 of 3 terms MATCH (d:Document) WHERE strict_count = 0 AND text_search(d.content, 'term1 term2 term3') WITH d, text_score(d) AS score WHERE score > 0.5 // Higher threshold for relaxed query RETURN d.title, score ORDER BY score DESC; </code></pre></div> <h3 id="text-processing-and-analyzers" class="position-relative d-flex align-items-center group"> Text Processing and Analyzers <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="text-processing-and-analyzers" aria-haspopup="dialog" aria-label="Share link: Text Processing and Analyzers"> Share link </button> </h3> <h4 id="language-specific-analyzers" class="position-relative d-flex align-items-center group"> Language-Specific Analyzers <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="language-specific-analyzers" aria-haspopup="dialog" aria-label="Share link: Language-Specific Analyzers"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- English with stemming and stopword removal CREATE TEXT INDEX docs_en FOR (d:Document) ON (d.content) OPTIONS { analyzer: 'english', stopwords: ['the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at'], stemmer: 'porter', k1: 1.2, b: 0.75 }; -- German with compound word handling CREATE TEXT INDEX docs_de FOR (d:Document) ON (d.content) OPTIONS { analyzer: 'german', stemmer: 'snowball_german', compound_splitting: true }; -- Multi-language with automatic detection CREATE TEXT INDEX docs_multi FOR (d:Document) ON (d.content) OPTIONS { analyzer: 'icu', // Unicode-aware tokenization language_detection: true, k1: 1.5, b: 0.75 }; </code></pre></div> <h4 id="custom-analyzers" class="position-relative d-flex align-items-center group"> Custom Analyzers <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="custom-analyzers" aria-haspopup="dialog" aria-label="Share link: Custom Analyzers"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Code search analyzer (no stemming, preserve case) CREATE TEXT INDEX code_search FOR (c:Code) ON (c.source) OPTIONS { analyzer: 'code', tokenizer: 'whitespace', filters: ['lowercase'], preserve_original: true, k1: 1.2, b: 0.0 // No length normalization for code }; -- Product SKU search (exact matching) CREATE TEXT INDEX product_sku FOR (p:Product) ON (p.sku) OPTIONS { analyzer: 'keyword', // No tokenization case_sensitive: true, k1: 2.0 // Higher term frequency boost }; </code></pre></div> <h3 id="performance-optimization" class="position-relative d-flex align-items-center group"> Performance Optimization <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="performance-optimization" aria-haspopup="dialog" aria-label="Share link: Performance Optimization"> Share link </button> </h3> <h4 id="index-sharding" class="position-relative d-flex align-items-center group"> Index Sharding <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="index-sharding" aria-haspopup="dialog" aria-label="Share link: Index Sharding"> Share link </button> </h4>Partition large indexes: <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Create date-partitioned indexes CREATE TEXT INDEX docs_2024 FOR (d:Document) ON (d.content) WHERE d.publish_date >= date('2024-01-01') OPTIONS {analyzer: 'english'}; CREATE TEXT INDEX docs_2023 FOR (d:Document) ON (d.content) WHERE d.publish_date >= date('2023-01-01') AND d.publish_date < date('2024-01-01') OPTIONS {analyzer: 'english'}; -- Query specific partition MATCH (d:Document) WHERE d.publish_date >= date('2024-01-01') AND text_search(d.content, $query) RETURN d.title, text_score(d) AS score ORDER BY score DESC; </code></pre></div> <h4 id="caching-strategies" class="position-relative d-flex align-items-center group"> Caching Strategies <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="caching-strategies" aria-haspopup="dialog" aria-label="Share link: Caching Strategies"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"># Cache frequent query results from functools import lru_cache @lru_cache(maxsize=1000) async def cached_search(query: str, limit: int = 20): result, _ = await client.query(""" MATCH (d:Document) WHERE text_search(d.content, $query) RETURN d.doc_id, d.title, text_score(d) AS score ORDER BY score DESC LIMIT $limit """, {"query": query, "limit": limit}) return result </code></pre></div> <h3 id="evaluation-and-tuning" class="position-relative d-flex align-items-center group"> Evaluation and Tuning <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="evaluation-and-tuning" aria-haspopup="dialog" aria-label="Share link: Evaluation and Tuning"> Share link </button> </h3> <h4 id="precisionrecall-analysis" class="position-relative d-flex align-items-center group"> Precision/Recall Analysis <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="precisionrecall-analysis" aria-haspopup="dialog" aria-label="Share link: Precision/Recall Analysis"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Compute precision@k and recall@k WITH ['doc1', 'doc2', 'doc3', 'doc4', 'doc5'] AS relevant_docs MATCH (d:Document) WHERE text_search(d.content, $query) WITH d, text_score(d) AS score, relevant_docs ORDER BY score DESC LIMIT 10 WITH COLLECT(d.doc_id) AS retrieved_docs, relevant_docs WITH SIZE([id IN retrieved_docs WHERE id IN relevant_docs]) AS hits, SIZE(retrieved_docs) AS k, SIZE(relevant_docs) AS total_relevant RETURN hits * 1.0 / k AS precision_at_10, hits * 1.0 / total_relevant AS recall_at_10, 2 * (precision_at_10 * recall_at_10) / (precision_at_10 + recall_at_10) AS f1_score; </code></pre></div> <h4 id="parameter-tuning-1" class="position-relative d-flex align-items-center group"> Parameter Tuning <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="parameter-tuning-1" aria-haspopup="dialog" aria-label="Share link: Parameter Tuning"> Share link </button> </h4><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gql" data-lang="gql">-- Grid search for optimal k1 and b WITH [0.5, 1.0, 1.2, 1.5, 2.0] AS k1_values, [0.0, 0.25, 0.5, 0.75, 1.0] AS b_values UNWIND k1_values AS k1 UNWIND b_values AS b CALL { WITH k1, b // Recreate index with parameters DROP INDEX IF EXISTS test_index; CREATE TEXT INDEX test_index FOR (d:Document) ON (d.content) OPTIONS {k1: k1, b: b}; // Run evaluation queries CALL evaluate_queries() YIELD avg_ndcg RETURN k1, b, avg_ndcg } RETURN k1, b, avg_ndcg ORDER BY avg_ndcg DESC LIMIT 1; </code></pre></div> <h3 id="further-reading-1" class="position-relative d-flex align-items-center group"> Further Reading <button type="button" class="h-share btn btn-link p-0 text-decoration-none link-secondary opacity-50 hover-opacity-100 transition-all ms-1" data-share-target="further-reading-1" aria-haspopup="dialog" aria-label="Share link: Further Reading"> Share link </button> </h3><ul> <li>BM25 Algorithm: Theory, Variants (BM25+, BM25F), and Applications</li> <li>Text Analysis: Tokenization, Stemming, and Language Processing</li> <li>Query Expansion: Pseudo-Relevance Feedback and Synonym Handling</li> <li>Hybrid Search: Combining BM25 with Vector Search (HNSW)</li> <li>Index Optimization: Sharding, Compression, and Caching</li> </ul> Browse tagged content for complete BM25 and full-text search documentation.

Popular

Related Articles

Querying, Indexing, and Query Optimization