Documentation tagged with BM25 Ranking Algorithm in the Geode graph database. BM25 (Best Matching 25) is a probabilistic ranking function used for text search and information retrieval, providing relevance scoring for keyword-based document searches.
Introduction to BM25
BM25 (Best Matching 25) is the gold standard ranking function for text search. Developed by Stephen Robertson and Karen Spärck Jones in the 1990s as part of the Okapi information retrieval system, BM25 has become the default ranking algorithm in search engines like Elasticsearch, Apache Solr, and Apache Lucene.
BM25 solves a fundamental question: given a search query and a collection of documents, which documents are most relevant? The algorithm computes a relevance score based on:
- Term frequency: How often query terms appear in each document
- Inverse document frequency: How rare or common terms are across all documents
- Document length normalization: Penalizing long documents that contain many terms
- Saturation: Diminishing returns for repeated terms
Unlike simple keyword matching (which is binary: match or no match), BM25 provides nuanced relevance scores that enable ranking search results by quality. This makes it invaluable for full-text search applications.
Geode implements BM25 for property text search, enabling powerful keyword-based search that complements semantic vector search (HNSW). You can combine BM25 text search with graph traversal for queries like “find documents about databases written by friends, ranked by relevance.”
Core BM25 Concepts
The BM25 Formula
BM25 computes a relevance score for document D given query Q:
score(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl))
Where:
- qi: Each term in query Q
- f(qi, D): Frequency of qi in document D
- |D|: Length of document D (in tokens)
- avgdl: Average document length in collection
- k1: Term frequency saturation parameter (typically 1.2-2.0)
- b: Length normalization parameter (typically 0.75)
- IDF(qi): Inverse document frequency of qi
Term Frequency (TF)
Term frequency measures how often a query term appears in a document. BM25 uses a saturating function—the first few occurrences of a term matter much more than later ones:
TF Impact:
1 occurrence: High impact
2 occurrences: Medium impact
10 occurrences: Marginal additional impact
100 occurrences: Almost no additional impact
This saturation prevents keyword stuffing from artificially inflating relevance.
Inverse Document Frequency (IDF)
IDF measures how rare or common a term is across the entire document collection:
IDF(term) = log((N - n(term) + 0.5) / (n(term) + 0.5))
Where:
- N: Total number of documents
- n(term): Number of documents containing term
Common terms (like “the”, “and”) have low IDF and contribute little to relevance. Rare terms have high IDF and strongly indicate relevance.
Examples:
- “the” appears in 1M of 1M docs → IDF ≈ 0
- “database” appears in 10K of 1M docs → IDF ≈ 4.6
- “geode” appears in 100 of 1M docs → IDF ≈ 9.2
Length Normalization
Longer documents tend to contain more terms by chance. BM25 penalizes long documents to avoid bias:
Length penalty = 1 - b + b * |D| / avgdl
Where:
- b = 0: No length normalization
- b = 1: Full length normalization
- b = 0.75: Balanced (typical)
A document twice as long as average receives a moderate penalty.
Parameter Tuning
BM25 has two main parameters:
k1 (term frequency saturation):
- Low (0.5-1.0): Aggressive saturation, repeated terms matter less
- Medium (1.2-1.5): Balanced (typical default: 1.2)
- High (2.0-3.0): Weak saturation, repeated terms matter more
b (length normalization):
- Low (0.0-0.5): Weak length penalty
- Medium (0.75): Balanced (typical default)
- High (0.9-1.0): Strong length penalty
How BM25 Works in Geode
Creating Full-Text Indexes
Enable BM25 ranking by creating full-text indexes:
-- Create full-text index on document content
CREATE TEXT INDEX document_content
FOR (d:Document)
ON (d.content, d.title)
OPTIONS {
analyzer: 'standard', -- Tokenization and stemming
k1: 1.2, -- Term frequency saturation
b: 0.75 -- Length normalization
};
Options:
- analyzer: Text processing (standard, english, multilingual, custom)
- k1: Term frequency saturation parameter
- b: Length normalization parameter
- stopwords: Words to ignore (the, and, or, etc.)
- stemming: Reduce words to roots (running → run)
Full-Text Search Queries
Search using the text index:
-- BM25-ranked full-text search
MATCH (d:Document)
WHERE text_search(d.content, 'graph database performance')
RETURN d.title, d.author, text_score(d) AS relevance
ORDER BY relevance DESC
LIMIT 20;
-- Or using CALL syntax
CALL fulltext.search({
index: 'document_content',
query: 'graph database performance',
limit: 20
})
YIELD node, score
RETURN node.title, score
ORDER BY score DESC;
Boolean Queries
Combine terms with Boolean operators:
-- Must contain "graph" and "database"
MATCH (d:Document)
WHERE text_search(d.content, '+graph +database')
RETURN d.title, text_score(d) AS score
ORDER BY score DESC;
-- Must contain "graph", should contain "database" (boosts score)
MATCH (d:Document)
WHERE text_search(d.content, '+graph database')
RETURN d.title, text_score(d) AS score
ORDER BY score DESC;
-- Contains "graph" but not "neo4j"
MATCH (d:Document)
WHERE text_search(d.content, 'graph -neo4j')
RETURN d.title, text_score(d) AS score
ORDER BY score DESC;
Phrase Queries
Search for exact phrases:
-- Exact phrase match
MATCH (d:Document)
WHERE text_search(d.content, '"graph database"')
RETURN d.title, text_score(d) AS score
ORDER BY score DESC;
-- Proximity search (words within 5 tokens)
MATCH (d:Document)
WHERE text_search(d.content, '"graph database"~5')
RETURN d.title, text_score(d) AS score
ORDER BY score DESC;
Combining with Graph Traversal
The power of BM25 in a graph database:
-- Find relevant documents written by friends
MATCH (me:User {id: $userId})-[:FRIEND]->(friend:User)
-[:AUTHORED]->(doc:Document)
WHERE text_search(doc.content, $query)
RETURN doc.title,
friend.name AS author,
text_score(doc) AS relevance,
COUNT(DISTINCT friend) AS friend_author_count
ORDER BY relevance DESC, friend_author_count DESC
LIMIT 10;
-- Search within a specific graph context
MATCH (category:Category {name: 'Technology'})<-[:IN_CATEGORY]-(doc:Document)
WHERE text_search(doc.content, 'machine learning')
AND doc.publish_date > date('2024-01-01')
RETURN doc.title, text_score(doc) AS score
ORDER BY score DESC
LIMIT 20;
Use Cases
Document Search
Classic full-text search:
-- Search knowledge base
MATCH (doc:Document)
WHERE text_search(doc.content, $user_query)
RETURN doc.title, doc.summary, text_score(doc) AS relevance
ORDER BY relevance DESC
LIMIT 50;
E-Commerce Product Search
Find relevant products:
-- Product search with metadata filtering
MATCH (product:Product)
WHERE text_search(product.name, product.description, $query)
AND product.price BETWEEN $min_price AND $max_price
AND product.in_stock = true
RETURN product.name,
product.price,
text_score(product) AS relevance
ORDER BY relevance DESC
LIMIT 20;
Log and Event Search
Search through logs:
-- Find relevant log entries
MATCH (log:LogEntry)
WHERE text_search(log.message, 'error timeout connection')
AND log.timestamp > datetime() - duration('P1D')
AND log.severity IN ['ERROR', 'FATAL']
RETURN log.timestamp, log.message, log.service, text_score(log) AS score
ORDER BY score DESC, log.timestamp DESC
LIMIT 100;
Hybrid Search (BM25 + Vector Search)
Combine keyword and semantic search:
-- Hybrid search: BM25 + HNSW
MATCH (doc:Document)
WHERE text_search(doc.content, $keyword_query)
AND vector_similarity(doc.embedding, $query_embedding) > 0.7
WITH doc,
text_score(doc) AS bm25_score,
vector_similarity(doc.embedding, $query_embedding) AS vector_score
RETURN doc.title,
bm25_score,
vector_score,
(0.6 * bm25_score + 0.4 * vector_score) AS combined_score
ORDER BY combined_score DESC
LIMIT 20;
This hybrid approach leverages both keyword matching (BM25) and semantic understanding (vectors).
Best Practices
Index Configuration
Choose the right analyzer:
-- English text with stemming
CREATE TEXT INDEX docs_en FOR (d:Document) ON (d.content)
OPTIONS {analyzer: 'english'}; -- running → run, databases → database
-- Multilingual support
CREATE TEXT INDEX docs_multi FOR (d:Document) ON (d.content)
OPTIONS {analyzer: 'multilingual'}; -- Detects language automatically
-- Code/technical content
CREATE TEXT INDEX code FOR (d:Code) ON (d.content)
OPTIONS {analyzer: 'keyword'}; -- No stemming, preserve exact terms
Configure stopwords:
CREATE TEXT INDEX docs FOR (d:Document) ON (d.content)
OPTIONS {
stopwords: ['the', 'a', 'an', 'and', 'or', 'but'] -- Custom stopword list
};
Query Optimization
Use specific terms:
-- Poor: Too generic
WHERE text_search(d.content, 'data')
-- Better: Specific terms
WHERE text_search(d.content, 'graph database ACID transactions')
Combine with filters:
-- Efficient: Filter before expensive text search
MATCH (d:Document)
WHERE d.category = 'technical'
AND d.publish_date > date('2024-01-01')
AND text_search(d.content, $query)
RETURN d.title, text_score(d) AS score
ORDER BY score DESC;
Tune parameters for your data:
-- Short documents (tweets, titles): Reduce length penalty
CREATE TEXT INDEX tweets FOR (t:Tweet) ON (t.content)
OPTIONS {k1: 1.2, b: 0.5}; -- Weak length normalization
-- Long documents (articles, books): Increase length penalty
CREATE TEXT INDEX articles FOR (a:Article) ON (a.content)
OPTIONS {k1: 1.2, b: 0.9}; -- Strong length normalization
Relevance Tuning
Field weighting:
-- Weight title matches higher than content matches
MATCH (d:Document)
WITH d,
CASE WHEN text_search(d.title, $query) THEN 3.0 ELSE 0.0 END AS title_score,
CASE WHEN text_search(d.content, $query) THEN 1.0 ELSE 0.0 END AS content_score
WHERE title_score > 0 OR content_score > 0
RETURN d.title, (title_score + content_score) AS score
ORDER BY score DESC;
Query-time boosting:
-- Boost recent documents
MATCH (d:Document)
WHERE text_search(d.content, $query)
WITH d,
text_score(d) AS base_score,
(datetime().epochSeconds - d.publish_date.epochSeconds) / (86400 * 365) AS age_years
RETURN d.title, base_score, (base_score / (1 + 0.1 * age_years)) AS adjusted_score
ORDER BY adjusted_score DESC;
Performance Considerations
Index Size
Full-text indexes require additional storage:
Index size ≈ 30-50% of original text size
Example:
- 1M documents, 5KB average
- Total text: 5GB
- Index size: 1.5-2.5GB
Query Performance
Typical performance characteristics:
- Simple queries: 1-10ms for millions of documents
- Complex Boolean queries: 10-50ms
- Combined graph + text: 50-500ms depending on graph complexity
Optimization Tips
- Limit result set: Always use LIMIT to cap results
- Pre-filter: Use property filters before text search
- Cache common queries: Cache frequent query results
- Partition large collections: Split by category, date, etc.
Monitoring
Index Statistics
-- Check index statistics
CALL fulltext.index.stats('document_content')
YIELD documents, terms, size_mb, avg_doc_length
RETURN documents, terms, size_mb, avg_doc_length;
Query Performance
-- Profile text search query
PROFILE MATCH (d:Document)
WHERE text_search(d.content, $query)
RETURN d.title, text_score(d) AS score
ORDER BY score DESC
LIMIT 20;
Related Topics
- Search - General search capabilities
- Text - Text processing features
- Indexing - Index management
- Query Optimization - Performance tuning
Further Reading
- Full-Text Search Guide - Complete BM25 documentation
- Vector Search Tutorial - Combining BM25 and vectors
- Performance Tuning - Performance best practices
Geode’s BM25 implementation provides powerful keyword-based search that integrates seamlessly with graph traversal, enabling rich text search applications combined with relationship-based filtering and ranking.
Advanced BM25 Techniques
BM25+ (Improved Variant)
BM25+ adds a delta parameter to prevent negative IDF values:
BM25+(D, Q) = Σ IDF(qi) × ((k1 + 1) × f(qi, D)) / (k1 × (1 - b + b × |D| / avgdl) + f(qi, D)) + δ
Where δ = typically 1.0
Advantages:
- Never penalizes term presence
- Better performance on verbose queries
- More robust to long documents
BM25F (Field-Weighted)
Weight different document fields separately:
-- BM25F: weighted fields
MATCH (d:Document)
WHERE text_search(d.title, $query) OR text_search(d.content, $query)
WITH d,
text_score(d.title, $query) AS title_score,
text_score(d.content, $query) AS content_score,
text_score(d.abstract, $query) AS abstract_score
WITH d,
0.5 * title_score + // Title boost: 2x
0.3 * abstract_score + // Abstract boost: 1.5x
0.2 * content_score // Content: baseline
AS weighted_score
WHERE weighted_score > 0
RETURN d.doc_id, d.title, weighted_score
ORDER BY weighted_score DESC
LIMIT 20;
Query Expansion and Relevance Feedback
Pseudo-Relevance Feedback
Expand query using top results:
-- Stage 1: Initial retrieval
CALL fulltext.search({
index: 'documents',
query: $original_query,
limit: 10
})
YIELD node AS top_doc, score
-- Stage 2: Extract expansion terms
WITH top_doc
MATCH (top_doc)-[:HAS_TERM]->(term:Term)
WITH term, SUM(term.tfidf_score) AS term_importance
ORDER BY term_importance DESC
LIMIT 5
WITH COLLECT(term.text) AS expansion_terms
-- Stage 3: Expanded query
WITH $original_query + ' ' + join(expansion_terms, ' ') AS expanded_query
CALL fulltext.search({
index: 'documents',
query: expanded_query,
limit: 50
})
YIELD node, score
RETURN node.title, score
ORDER BY score DESC;
Query Relaxation
Progressively relax boolean constraints:
-- Strict: All terms required
MATCH (d:Document)
WHERE text_search(d.content, '+term1 +term2 +term3')
WITH COUNT(d) AS strict_count
-- Relaxed: At least 2 of 3 terms
MATCH (d:Document)
WHERE strict_count = 0
AND text_search(d.content, 'term1 term2 term3')
WITH d, text_score(d) AS score
WHERE score > 0.5 // Higher threshold for relaxed query
RETURN d.title, score
ORDER BY score DESC;
Text Processing and Analyzers
Language-Specific Analyzers
-- English with stemming and stopword removal
CREATE TEXT INDEX docs_en FOR (d:Document) ON (d.content)
OPTIONS {
analyzer: 'english',
stopwords: ['the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at'],
stemmer: 'porter',
k1: 1.2,
b: 0.75
};
-- German with compound word handling
CREATE TEXT INDEX docs_de FOR (d:Document) ON (d.content)
OPTIONS {
analyzer: 'german',
stemmer: 'snowball_german',
compound_splitting: true
};
-- Multi-language with automatic detection
CREATE TEXT INDEX docs_multi FOR (d:Document) ON (d.content)
OPTIONS {
analyzer: 'icu', // Unicode-aware tokenization
language_detection: true,
k1: 1.5,
b: 0.75
};
Custom Analyzers
-- Code search analyzer (no stemming, preserve case)
CREATE TEXT INDEX code_search FOR (c:Code) ON (c.source)
OPTIONS {
analyzer: 'code',
tokenizer: 'whitespace',
filters: ['lowercase'],
preserve_original: true,
k1: 1.2,
b: 0.0 // No length normalization for code
};
-- Product SKU search (exact matching)
CREATE TEXT INDEX product_sku FOR (p:Product) ON (p.sku)
OPTIONS {
analyzer: 'keyword', // No tokenization
case_sensitive: true,
k1: 2.0 // Higher term frequency boost
};
Performance Optimization
Index Sharding
Partition large indexes:
-- Create date-partitioned indexes
CREATE TEXT INDEX docs_2024 FOR (d:Document) ON (d.content)
WHERE d.publish_date >= date('2024-01-01')
OPTIONS {analyzer: 'english'};
CREATE TEXT INDEX docs_2023 FOR (d:Document) ON (d.content)
WHERE d.publish_date >= date('2023-01-01')
AND d.publish_date < date('2024-01-01')
OPTIONS {analyzer: 'english'};
-- Query specific partition
MATCH (d:Document)
WHERE d.publish_date >= date('2024-01-01')
AND text_search(d.content, $query)
RETURN d.title, text_score(d) AS score
ORDER BY score DESC;
Caching Strategies
# Cache frequent query results
from functools import lru_cache
@lru_cache(maxsize=1000)
async def cached_search(query: str, limit: int = 20):
result, _ = await client.query("""
MATCH (d:Document)
WHERE text_search(d.content, $query)
RETURN d.doc_id, d.title, text_score(d) AS score
ORDER BY score DESC
LIMIT $limit
""", {"query": query, "limit": limit})
return result
Evaluation and Tuning
Precision/Recall Analysis
-- Compute precision@k and recall@k
WITH ['doc1', 'doc2', 'doc3', 'doc4', 'doc5'] AS relevant_docs
MATCH (d:Document)
WHERE text_search(d.content, $query)
WITH d, text_score(d) AS score, relevant_docs
ORDER BY score DESC
LIMIT 10
WITH COLLECT(d.doc_id) AS retrieved_docs, relevant_docs
WITH SIZE([id IN retrieved_docs WHERE id IN relevant_docs]) AS hits,
SIZE(retrieved_docs) AS k,
SIZE(relevant_docs) AS total_relevant
RETURN hits * 1.0 / k AS precision_at_10,
hits * 1.0 / total_relevant AS recall_at_10,
2 * (precision_at_10 * recall_at_10) / (precision_at_10 + recall_at_10) AS f1_score;
Parameter Tuning
-- Grid search for optimal k1 and b
WITH [0.5, 1.0, 1.2, 1.5, 2.0] AS k1_values,
[0.0, 0.25, 0.5, 0.75, 1.0] AS b_values
UNWIND k1_values AS k1
UNWIND b_values AS b
CALL {
WITH k1, b
// Recreate index with parameters
DROP INDEX IF EXISTS test_index;
CREATE TEXT INDEX test_index FOR (d:Document) ON (d.content)
OPTIONS {k1: k1, b: b};
// Run evaluation queries
CALL evaluate_queries() YIELD avg_ndcg
RETURN k1, b, avg_ndcg
}
RETURN k1, b, avg_ndcg
ORDER BY avg_ndcg DESC
LIMIT 1;
Further Reading
- BM25 Algorithm: Theory, Variants (BM25+, BM25F), and Applications
- Text Analysis: Tokenization, Stemming, and Language Processing
- Query Expansion: Pseudo-Relevance Feedback and Synonym Handling
- Hybrid Search: Combining BM25 with Vector Search (HNSW)
- Index Optimization: Sharding, Compression, and Caching
Browse tagged content for complete BM25 and full-text search documentation.