Text Processing & String Operations

Text processing is fundamental to database applications, from simple string manipulation to complex pattern matching and natural language operations. Geode provides comprehensive text processing capabilities through its GQL implementation, offering a rich set of string functions, pattern matching operators, Unicode support, and text normalization features that enable sophisticated text manipulation in graph queries.

As an ISO/IEC 39075:2024 GQL-compliant graph database, Geode treats text as a first-class data type with full Unicode support, locale-aware collation, and efficient string operations. Whether you’re building search interfaces, data cleaning pipelines, or natural language processing workflows, Geode’s text processing capabilities provide the tools you need.

String Data Type

Geode stores text using UTF-8 encoding, supporting the full Unicode character set:

-- Create nodes with text properties
CREATE (u:User {
  name: 'Alice Johnson',
  email: 'alice@example.com',
  bio: 'Software engineer passionate about graph databases',
  nickname: 'AJ'
});

-- Unicode characters are fully supported
CREATE (u:User {
  name: 'José García',
  greeting: 'こんにちは',  -- Japanese
  emoji: '👋🌍'
});

String Literals:

-- Single quotes
'Hello, World!'

-- Double quotes (GQL standard)
"Hello, World!"

-- Escape sequences
'It\'s a string with a quote'
"Line 1\nLine 2\nLine 3"
"Tab\tseparated\tvalues"

-- Unicode escapes
"\u0048\u0065\u006C\u006C\u006F"  -- "Hello"

Basic String Functions

Length and Size:

-- String length (character count)
MATCH (u:User)
RETURN u.name, LENGTH(u.name) AS name_length;

-- Byte size
MATCH (u:User)
RETURN u.bio, OCTET_LENGTH(u.bio) AS byte_size;

-- Character count (Unicode-aware)
MATCH (u:User)
RETURN u.greeting, CHAR_LENGTH(u.greeting) AS char_count;

Case Conversion:

-- Convert to uppercase
MATCH (u:User)
RETURN UPPER(u.name) AS uppercase_name;

-- Convert to lowercase
MATCH (u:User)
RETURN LOWER(u.email) AS lowercase_email;

-- Title case (capitalize first letter of each word)
MATCH (u:User)
RETURN INITCAP(u.name) AS title_case;

Trimming and Padding:

-- Remove leading and trailing whitespace
MATCH (u:User)
RETURN TRIM(u.name) AS trimmed_name;

-- Remove leading whitespace
MATCH (u:User)
RETURN LTRIM(u.name) AS left_trimmed;

-- Remove trailing whitespace
MATCH (u:User)
RETURN RTRIM(u.name) AS right_trimmed;

-- Pad string to specified length
MATCH (p:Product)
RETURN LPAD(p.sku, 10, '0') AS padded_sku;  -- Left pad with zeros

MATCH (p:Product)
RETURN RPAD(p.name, 20, '.') AS padded_name;  -- Right pad with dots

Substring and Extraction

Substring Operations:

-- Extract substring by position and length
MATCH (u:User)
RETURN SUBSTRING(u.email, 1, 5) AS first_five;

-- Extract from position to end
MATCH (u:User)
RETURN SUBSTRING(u.name FROM 6) AS last_name_approx;

-- Extract using position and length
MATCH (u:User)
RETURN SUBSTRING(u.bio, 1, 50) AS preview;

String Splitting:

-- Split string into array
MATCH (u:User)
RETURN u.name, SPLIT(u.name, ' ') AS name_parts;

-- Extract first element
MATCH (u:User)
WITH SPLIT(u.name, ' ') AS parts
RETURN parts[0] AS first_name;

-- Process split results
MATCH (t:Tag)
UNWIND SPLIT(t.keywords, ',') AS keyword
RETURN TRIM(keyword) AS cleaned_keyword;

Pattern Extraction:

-- Extract using regular expressions
MATCH (u:User)
RETURN REGEXP_EXTRACT(u.email, '^([^@]+)') AS username;

-- Extract all matches
MATCH (p:Post)
RETURN REGEXP_EXTRACT_ALL(p.content, '#(\w+)') AS hashtags;

String Concatenation and Formatting

Concatenation:

-- Concatenate strings
MATCH (u:User)
RETURN u.first_name + ' ' + u.last_name AS full_name;

-- Concatenate with separator
MATCH (u:User)
RETURN CONCAT(u.first_name, ' ', u.last_name) AS full_name;

-- Concatenate with null handling
MATCH (u:User)
RETURN CONCAT_WS(' ', u.first_name, u.middle_name, u.last_name) AS full_name;

String Formatting:

-- Format string with placeholders
MATCH (u:User)
RETURN FORMAT('User: %s (ID: %d)', u.name, u.user_id) AS formatted;

-- Formatted output
MATCH (p:Product)
RETURN FORMAT('$%.2f', p.price) AS formatted_price;

String Aggregation:

-- Aggregate strings with separator
MATCH (u:User)-[:MEMBER_OF]->(team:Team)
RETURN team.name, STRING_AGG(u.name, ', ') AS members
GROUP BY team.name;

-- Aggregate with ordering
MATCH (u:User)-[:PURCHASED]->(p:Product)
RETURN u.name,
       STRING_AGG(p.name ORDER BY p.purchased_at, ', ') AS purchase_history
GROUP BY u.name;

Pattern Matching

LIKE Operator:

-- Wildcard matching
MATCH (u:User)
WHERE u.email LIKE '%@example.com'
RETURN u.name, u.email;

-- Pattern matching with wildcards
MATCH (p:Product)
WHERE p.sku LIKE 'PROD-2024-%'
RETURN p.name, p.sku;

-- Case-insensitive matching
MATCH (u:User)
WHERE LOWER(u.name) LIKE '%smith%'
RETURN u.name;

Regular Expressions:

-- Regex matching
MATCH (u:User)
WHERE u.email =~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
RETURN u.email;

-- Case-insensitive regex
MATCH (p:Product)
WHERE p.name =~ '(?i)laptop'
RETURN p.name;

-- Extract with regex
MATCH (p:Post)
WHERE p.content =~ '#(\w+)'
RETURN REGEXP_EXTRACT_ALL(p.content, '#(\w+)') AS hashtags;

STARTS WITH, ENDS WITH, CONTAINS:

-- Prefix matching
MATCH (p:Product)
WHERE p.sku STARTS WITH 'PROD-'
RETURN p.name, p.sku;

-- Suffix matching
MATCH (u:User)
WHERE u.email ENDS WITH '@company.com'
RETURN u.name, u.email;

-- Substring matching
MATCH (p:Post)
WHERE p.content CONTAINS 'graph database'
RETURN p.title, p.content;

Text Replacement and Transformation

Replace Operations:

-- Simple replacement
MATCH (u:User)
RETURN REPLACE(u.phone, '-', '') AS phone_no_dashes;

-- Regular expression replacement
MATCH (p:Post)
RETURN REGEXP_REPLACE(p.content, '#(\w+)', '[$1]') AS processed_content;

-- Multiple replacements
MATCH (t:Text)
RETURN REPLACE(REPLACE(t.content, '\n', ' '), '\t', ' ') AS cleaned;

Character Translation:

-- Translate characters
MATCH (u:User)
RETURN TRANSLATE(u.name, 'aeiou', 'AEIOU') AS vowels_upper;

-- Remove specific characters
MATCH (p:Product)
RETURN TRANSLATE(p.sku, '-_', '') AS sku_alphanumeric;

String Comparison and Sorting

Collation-Aware Comparison:

-- Case-insensitive comparison
MATCH (u:User)
WHERE LOWER(u.name) = LOWER('alice johnson')
RETURN u;

-- Locale-aware sorting
MATCH (u:User)
RETURN u.name
ORDER BY u.name COLLATE 'en_US';

-- Natural sorting (numbers sorted numerically)
MATCH (p:Product)
RETURN p.name
ORDER BY NATURAL_SORT(p.name);

Similarity Functions:

-- Levenshtein distance
MATCH (u:User)
RETURN u.name, LEVENSHTEIN(u.name, 'Alice Johnson') AS distance
ORDER BY distance
LIMIT 5;

-- Soundex (phonetic matching)
MATCH (u:User)
WHERE SOUNDEX(u.last_name) = SOUNDEX('Smith')
RETURN u.name;

-- Similarity ratio
MATCH (p:Product)
RETURN p.name, SIMILARITY(p.name, 'laptop') AS similarity
WHERE similarity > 0.5
ORDER BY similarity DESC;

Unicode and Normalization

Unicode Operations:

-- Get Unicode code point
MATCH (u:User)
RETURN ASCII(SUBSTRING(u.name, 1, 1)) AS first_char_code;

-- Character from code point
RETURN CHR(65) AS character;  -- Returns 'A'

-- Unicode normalization
MATCH (u:User)
RETURN NORMALIZE(u.name, 'NFC') AS normalized_name;

Locale-Specific Operations:

-- Locale-aware uppercase
MATCH (u:User)
RETURN UPPER(u.name COLLATE 'tr_TR') AS turkish_upper;

-- Locale-specific comparison
MATCH (u:User)
WHERE u.name = 'İstanbul' COLLATE 'tr_TR'
RETURN u;

Text Encoding and Hashing

Encoding Functions:

-- Base64 encoding
MATCH (u:User)
RETURN BASE64_ENCODE(u.password_hash) AS encoded;

-- Base64 decoding
MATCH (c:Config)
RETURN BASE64_DECODE(c.encoded_value) AS decoded;

-- URL encoding
MATCH (p:Product)
RETURN URL_ENCODE(p.name) AS url_safe_name;

Hashing Functions:

-- MD5 hash
MATCH (u:User)
RETURN MD5(u.email) AS email_hash;

-- SHA256 hash
MATCH (u:User)
RETURN SHA256(u.email) AS email_sha;

-- Password hashing (use proper password hashing in production)
CREATE (u:User {
  email: 'user@example.com',
  password_hash: SHA256('password' + 'salt')
});

Performance Optimization

Index Usage for Text:

-- Create index for prefix searches
CREATE INDEX user_email_prefix ON :User(email);

-- Efficient prefix query
MATCH (u:User)
WHERE u.email STARTS WITH 'alice'  -- Uses index
RETURN u;

-- Full-text index for content search
CREATE FULLTEXT INDEX post_content ON :Post(content);

MATCH (p:Post)
WHERE p.content MATCHES 'graph database'  -- Uses full-text index
RETURN p;

Avoid Inefficient Patterns:

-- Inefficient: requires full scan
MATCH (u:User)
WHERE u.email LIKE '%@example.com'  -- Cannot use index
RETURN u;

-- Efficient: uses index
MATCH (u:User)
WHERE u.email ENDS WITH '@example.com'  -- Can use suffix index
RETURN u;

-- Inefficient: case conversion on every row
MATCH (u:User)
WHERE LOWER(u.name) = 'alice'
RETURN u;

-- Efficient: store normalized version
CREATE INDEX user_name_lower ON :User(LOWER(name));
MATCH (u:User)
WHERE LOWER(u.name) = 'alice'  -- Uses expression index
RETURN u;

Best Practices

Use Appropriate Data Types: Store structured data (emails, URLs, IDs) with validation rather than treating as plain text.
Normalize Text Early: Perform case conversion, trimming, and normalization during data ingestion rather than at query time.
Index Strategically: Create indexes on frequently filtered string properties, especially for prefix and exact match queries.
Leverage Full-Text Search: Use full-text indexes for keyword search rather than LIKE or CONTAINS on large text fields.
Validate Input: Use regular expressions and constraints to ensure text data quality.
Consider Locale: Use locale-aware collation and normalization for international applications.

Common Use Cases

Email Validation:

CREATE (u:User {
  email: 'alice@example.com'
})
WHERE u.email =~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$';

Name Normalization:

MATCH (u:User)
SET u.name_normalized = TRIM(INITCAP(u.name));

Tag Extraction:

MATCH (p:Post)
WITH p, REGEXP_EXTRACT_ALL(p.content, '#(\w+)') AS tags
UNWIND tags AS tag
MERGE (t:Tag {name: LOWER(tag)})
MERGE (p)-[:TAGGED_WITH]->(t);

Search Autocomplete:

MATCH (p:Product)
WHERE p.name STARTS WITH 'lapt'
RETURN p.name
ORDER BY LENGTH(p.name)
LIMIT 10;

Full-Text Search and Indexing
Unicode Support and Internationalization
JSON Data Type and Functions
Regular Expressions and Pattern Matching
Data Validation and Constraints
Query Optimization
Collation and Sorting

Popular

String Data Type

Basic String Functions

Substring and Extraction

String Concatenation and Formatting

Pattern Matching

Text Replacement and Transformation

String Comparison and Sorting

Unicode and Normalization

Text Encoding and Hashing

Performance Optimization

Best Practices

Common Use Cases

Further Reading

Related Articles

Unicode Support and Text Processing

String Data Type Share link

Basic String Functions Share link

Substring and Extraction Share link

String Concatenation and Formatting Share link

Pattern Matching Share link

Text Replacement and Transformation Share link

String Comparison and Sorting Share link

Unicode and Normalization Share link

Text Encoding and Hashing Share link

Performance Optimization Share link

Best Practices Share link

Common Use Cases Share link

Related Topics Share link

Further Reading Share link

Related Articles

Unicode Support and Text Processing

String Data Type

Basic String Functions

Substring and Extraction

String Concatenation and Formatting

Pattern Matching

Text Replacement and Transformation

String Comparison and Sorting

Unicode and Normalization

Text Encoding and Hashing

Performance Optimization

Best Practices

Common Use Cases

Related Topics

Further Reading