Data Governance

Data governance in Geode provides comprehensive capabilities for managing data quality, tracking lineage, enforcing policies, and maintaining compliance across your graph database. This guide covers the tools and practices for implementing effective data governance in enterprise environments.

Data Governance Overview

Data governance ensures that data is:

Accurate: Data quality meets business requirements
Secure: Access is controlled and audited
Compliant: Regulatory requirements are met
Discoverable: Users can find and understand data
Traceable: Data lineage is documented
Consistent: Standards are enforced across the organization

Geode provides built-in features to support all aspects of data governance.

Data Quality Management

Schema Constraints

Define data quality rules using schema constraints:

-- Ensure email addresses are valid
CREATE CONSTRAINT valid_email
  ON (u:User)
  ASSERT u.email MATCHES '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$';

-- Ensure phone numbers follow E.164 format
CREATE CONSTRAINT valid_phone
  ON (c:Contact)
  ASSERT c.phone MATCHES '^\+[1-9]\d{1,14}$';

-- Ensure dates are in the future for events
CREATE CONSTRAINT future_event
  ON (e:Event)
  ASSERT e.event_date > current_date();

-- Ensure numeric ranges
CREATE CONSTRAINT valid_age
  ON (p:Person)
  ASSERT p.age >= 0 AND p.age <= 150;

-- Ensure required fields are present
CREATE CONSTRAINT required_fields
  ON (p:Product)
  ASSERT p.name IS NOT NULL
    AND p.sku IS NOT NULL
    AND p.price IS NOT NULL;

Data Quality Checks

Implement automated data quality checks:

-- Find records with missing required data
MATCH (p:Person)
WHERE p.email IS NULL
   OR p.name IS NULL
   OR p.created_at IS NULL
RETURN count(p) AS incomplete_records;

-- Find duplicate records
MATCH (p1:Person), (p2:Person)
WHERE p1.email = p2.email
  AND id(p1) < id(p2)
RETURN p1.email, count(*) AS duplicates;

-- Find orphaned relationships
MATCH ()-[r:BELONGS_TO]->()
WHERE NOT EXISTS {
  MATCH (n)-[r]->(m)
  WHERE n:Entity AND m:Group
}
RETURN count(r) AS orphaned_relationships;

-- Validate referential integrity
MATCH (o:Order)-[:ORDERED_BY]->(c:Customer)
WHERE NOT EXISTS {
  MATCH (c) WHERE c:Customer
}
RETURN count(o) AS orders_without_customers;

Data Quality Metrics

Track data quality over time:

-- Create data quality metrics
CREATE (:DataQualityMetric {
  name: 'email_completeness',
  timestamp: current_timestamp(),
  total_records: count {MATCH (p:Person) RETURN p},
  complete_records: count {MATCH (p:Person) WHERE p.email IS NOT NULL RETURN p},
  completeness_pct: 100.0 * count {MATCH (p:Person) WHERE p.email IS NOT NULL RETURN p} /
                           count {MATCH (p:Person) RETURN p}
});

-- Query quality trends
MATCH (m:DataQualityMetric)
WHERE m.name = 'email_completeness'
  AND m.timestamp > current_timestamp() - duration('P30D')
RETURN m.timestamp, m.completeness_pct
ORDER BY m.timestamp;

Data Lineage Tracking

Track how data flows through your systems and transformations:

Lineage Model

Model data lineage in the graph:

-- Source systems
CREATE (:DataSource {
  id: 'crm_system',
  name: 'Customer CRM',
  type: 'database',
  connection: 'postgresql://crm.example.com'
});

-- Data transformations
CREATE (:DataTransformation {
  id: 'etl_customer_enrichment',
  name: 'Customer Data Enrichment',
  type: 'ETL',
  script: 'customer_enrichment.py',
  version: '2.1.0',
  last_run: current_timestamp()
});

-- Data assets
CREATE (:DataAsset {
  id: 'customer_360',
  name: 'Customer 360 View',
  type: 'graph',
  schema: 'Person, Company, Product'
});

-- Create lineage relationships
MATCH (source:DataSource {id: 'crm_system'})
MATCH (transform:DataTransformation {id: 'etl_customer_enrichment'})
MATCH (asset:DataAsset {id: 'customer_360'})
CREATE (source)-[:FEEDS]->(transform)
CREATE (transform)-[:PRODUCES]->(asset);

Automatic Lineage Tracking

Enable automatic lineage tracking:

# Enable lineage tracking for all queries
geode serve --lineage-tracking=enabled \
  --lineage-detail=full \
  --lineage-storage=graph

Query lineage information:

-- Find all data sources for a specific asset
MATCH (source:DataSource)-[:FEEDS*]->(asset:DataAsset {id: 'customer_360'})
RETURN source.name, source.type;

-- Trace downstream impact of a data source
MATCH (source:DataSource {id: 'crm_system'})-[:FEEDS*]->(downstream)
RETURN downstream.name, labels(downstream);

-- Find all transformations applied to data
MATCH path = (source)-[:FEEDS*]->(asset:DataAsset {id: 'customer_360'})
WHERE 'DataTransformation' IN labels(nodes(path))
RETURN [n IN nodes(path) WHERE 'DataTransformation' IN labels(n) | n.name] AS transformations;

Impact Analysis

Analyze the impact of changes:

-- Find all assets affected by changing a source
MATCH (source:DataSource {id: 'crm_system'})-[:FEEDS*]->(affected)
RETURN DISTINCT labels(affected), affected.name
ORDER BY labels(affected);

-- Find all users affected by data source change
MATCH (source:DataSource {id: 'crm_system'})-[:FEEDS*]->(asset:DataAsset)
MATCH (user:User)-[:USES]->(asset)
RETURN DISTINCT user.name, user.email;

Metadata Management

Data Catalog

Build a searchable data catalog:

-- Create catalog entries
CREATE (:CatalogEntry {
  id: 'customers_table',
  name: 'Customer Data',
  type: 'dataset',
  description: 'Primary customer information including contact details and preferences',
  owner: 'data-team@example.com',
  created_at: current_timestamp(),
  updated_at: current_timestamp(),
  tags: ['pii', 'customer', 'core'],
  classification: 'confidential',
  retention_period: duration('P7Y')
});

-- Add schema information
CREATE (:SchemaField {
  name: 'email',
  type: 'string',
  description: 'Customer email address',
  required: true,
  pii: true,
  example: 'customer@example.com'
});

-- Link catalog entries to schema
MATCH (catalog:CatalogEntry {id: 'customers_table'})
MATCH (field:SchemaField {name: 'email'})
CREATE (catalog)-[:HAS_FIELD]->(field);

Searchable Metadata

Search the data catalog:

-- Search by tag
MATCH (entry:CatalogEntry)
WHERE 'pii' IN entry.tags
RETURN entry.name, entry.description;

-- Search by classification
MATCH (entry:CatalogEntry)
WHERE entry.classification = 'confidential'
RETURN entry.name, entry.owner;

-- Full-text search
MATCH (entry:CatalogEntry)
WHERE entry.name CONTAINS 'customer'
   OR entry.description CONTAINS 'customer'
RETURN entry.name, entry.description;

Business Glossary

Define business terms and link to technical assets:

-- Create business terms
CREATE (:BusinessTerm {
  id: 'customer_lifetime_value',
  name: 'Customer Lifetime Value',
  abbreviation: 'CLV',
  definition: 'Predicted net profit attributed to the entire future relationship with a customer',
  owner: 'finance-team@example.com',
  approved_by: 'CFO',
  approved_at: current_timestamp()
});

-- Link terms to data assets
MATCH (term:BusinessTerm {id: 'customer_lifetime_value'})
MATCH (field:SchemaField {name: 'lifetime_value'})
CREATE (term)-[:DEFINED_BY]->(field);

Access Policies

Policy-Based Access Control

Define and enforce access policies:

-- Create data access policy
CREATE (:DataAccessPolicy {
  id: 'pii_access_policy',
  name: 'PII Access Control',
  description: 'Restrict access to personally identifiable information',
  effective_date: current_timestamp(),
  created_by: 'security-team@example.com'
});

-- Define policy rules
CREATE (:PolicyRule {
  policy_id: 'pii_access_policy',
  rule_type: 'row_level_security',
  condition: 'user.has_role("pii_viewer") OR data.owner = current_user()',
  action: 'allow'
});

-- Apply policies to data
MATCH (policy:DataAccessPolicy {id: 'pii_access_policy'})
MATCH (entry:CatalogEntry)
WHERE 'pii' IN entry.tags
CREATE (entry)-[:GOVERNED_BY]->(policy);

Data Classification

Classify data by sensitivity:

-- Define classification levels
CREATE (:ClassificationLevel {name: 'public', level: 1});
CREATE (:ClassificationLevel {name: 'internal', level: 2});
CREATE (:ClassificationLevel {name: 'confidential', level: 3});
CREATE (:ClassificationLevel {name: 'restricted', level: 4});

-- Classify data assets
MATCH (entry:CatalogEntry {id: 'customers_table'})
MATCH (level:ClassificationLevel {name: 'confidential'})
CREATE (entry)-[:CLASSIFIED_AS]->(level);

-- Enforce classification-based access
CREATE POLICY classification_access
  ON CatalogEntry
  FOR SELECT
  USING {
    MATCH (entry)-[:CLASSIFIED_AS]->(level:ClassificationLevel)
    MATCH (user:User {id: current_user()})
    WHERE user.clearance_level >= level.level
    RETURN true
  };

Data Retention and Lifecycle

Retention Policies

Define and enforce data retention:

-- Create retention policy
CREATE (:RetentionPolicy {
  id: 'gdpr_customer_retention',
  name: 'GDPR Customer Data Retention',
  retention_period: duration('P7Y'),
  deletion_method: 'secure_delete',
  legal_basis: 'GDPR Article 5(1)(e)',
  approved_by: 'legal-team@example.com'
});

-- Apply to data
MATCH (policy:RetentionPolicy {id: 'gdpr_customer_retention'})
MATCH (entry:CatalogEntry)
WHERE 'customer' IN entry.tags
CREATE (entry)-[:GOVERNED_BY]->(policy);

-- Find data eligible for deletion
MATCH (data)-[:GOVERNED_BY]->(policy:RetentionPolicy)
WHERE data.created_at + policy.retention_period < current_timestamp()
RETURN data.id, data.name, data.created_at;

Automated Lifecycle Management

# Enable automated data lifecycle management
geode serve --lifecycle-management=enabled \
  --lifecycle-check-interval=daily \
  --lifecycle-enforcement=true

# Run manual lifecycle check
geode lifecycle-check --policy=gdpr_customer_retention \
  --dry-run=true \
  --output=lifecycle-report.json

Data Stewardship

Assign Data Stewards

-- Create stewardship assignments
CREATE (:DataSteward {
  id: 'alice@example.com',
  name: 'Alice Johnson',
  title: 'Senior Data Steward',
  department: 'Data Governance',
  responsibilities: ['Customer Data', 'Product Data']
});

-- Assign stewards to data
MATCH (steward:DataSteward {id: 'alice@example.com'})
MATCH (entry:CatalogEntry)
WHERE 'customer' IN entry.tags
CREATE (steward)-[:RESPONSIBLE_FOR]->(entry);

-- Find steward for specific data
MATCH (steward:DataSteward)-[:RESPONSIBLE_FOR]->(entry:CatalogEntry {id: 'customers_table'})
RETURN steward.name, steward.id;

Compliance Reporting

Generate Compliance Reports

# Generate GDPR compliance report
geode governance-report --framework=gdpr \
  --include=data-inventory,lineage,access-log \
  --start-date=2025-01-01 \
  --end-date=2025-12-31 \
  --output=gdpr-compliance-2025.pdf

# Generate data quality report
geode governance-report --type=data-quality \
  --metrics=completeness,accuracy,consistency \
  --output=data-quality-report.json

# Generate access report
geode governance-report --type=access-audit \
  --user=[email protected] \
  --include-lineage=true \
  --output=access-audit.json

Compliance Dashboards

-- Data quality dashboard metrics
MATCH (m:DataQualityMetric)
WHERE m.timestamp > current_timestamp() - duration('P1D')
RETURN m.name, avg(m.completeness_pct) AS avg_completeness
ORDER BY m.name;

-- Policy compliance metrics
MATCH (entry:CatalogEntry)-[:GOVERNED_BY]->(policy)
MATCH (violation:PolicyViolation)-[:VIOLATED]->(policy)
RETURN policy.name,
       count(DISTINCT entry) AS governed_assets,
       count(violation) AS violations,
       100.0 * (1 - count(violation)::float / count(DISTINCT entry)) AS compliance_pct;

Data Discovery

Self-Service Discovery

Enable users to discover data:

-- Search catalog by keyword
MATCH (entry:CatalogEntry)
WHERE entry.name CONTAINS $keyword
   OR entry.description CONTAINS $keyword
   OR ANY(tag IN entry.tags WHERE tag CONTAINS $keyword)
RETURN entry.name, entry.description, entry.tags, entry.owner;

-- Browse by classification
MATCH (entry:CatalogEntry)-[:CLASSIFIED_AS]->(level:ClassificationLevel)
WHERE level.name = $classification
RETURN entry.name, entry.description;

-- Find related datasets
MATCH (entry:CatalogEntry {id: $dataset_id})-[:RELATED_TO*1..2]-(related:CatalogEntry)
RETURN DISTINCT related.name, related.description;

Best Practices

Establish Clear Ownership: Assign data stewards for all critical data assets
Document Everything: Maintain comprehensive metadata for all data
Automate Quality Checks: Run automated data quality checks regularly
Track Lineage: Enable automatic lineage tracking for all data flows
Classify Appropriately: Classify all data by sensitivity level
Enforce Policies: Use automated policy enforcement, not just documentation
Regular Audits: Conduct periodic governance audits
User Education: Train users on governance policies and tools
Measure Effectiveness: Track governance metrics and KPIs
Continuous Improvement: Regularly review and update governance policies

Compliance - Regulatory compliance frameworks
Audit Logging - Comprehensive audit trails
Row-Level Security - Fine-grained access control
Data Integrity - Data consistency and validation
Encryption - Data protection with encryption
Configuration - Governance configuration settings
Schema - Schema design and constraints
Authorization - Permission management

Popular

Data Governance Overview

Data Quality Management

Schema Constraints

Data Quality Checks

Data Quality Metrics

Data Lineage Tracking

Lineage Model

Automatic Lineage Tracking

Impact Analysis

Metadata Management

Data Catalog

Searchable Metadata

Business Glossary

Access Policies

Policy-Based Access Control

Data Classification

Data Retention and Lifecycle

Retention Policies

Automated Lifecycle Management

Data Stewardship

Assign Data Stewards

Compliance Reporting

Generate Compliance Reports

Compliance Dashboards

Data Discovery

Self-Service Discovery

Best Practices

Related Articles

Development

Governance and Requirements Tracking

Data Governance Overview Share link

Data Quality Management Share link

Schema Constraints Share link

Data Quality Checks Share link

Data Quality Metrics Share link

Data Lineage Tracking Share link

Lineage Model Share link

Automatic Lineage Tracking Share link

Impact Analysis Share link

Metadata Management Share link

Data Catalog Share link

Searchable Metadata Share link

Business Glossary Share link

Access Policies Share link

Policy-Based Access Control Share link

Data Classification Share link

Data Retention and Lifecycle Share link

Retention Policies Share link

Automated Lifecycle Management Share link

Data Stewardship Share link

Assign Data Stewards Share link

Compliance Reporting Share link

Generate Compliance Reports Share link

Compliance Dashboards Share link

Data Discovery Share link

Self-Service Discovery Share link

Best Practices Share link

Related Topics Share link

Related Articles

Development

Governance and Requirements Tracking

Data Governance Overview

Data Quality Management

Schema Constraints

Data Quality Checks

Data Quality Metrics

Data Lineage Tracking

Lineage Model

Automatic Lineage Tracking

Impact Analysis

Metadata Management

Data Catalog

Searchable Metadata

Business Glossary

Access Policies

Policy-Based Access Control

Data Classification

Data Retention and Lifecycle

Retention Policies

Automated Lifecycle Management

Data Stewardship

Assign Data Stewards

Compliance Reporting

Generate Compliance Reports

Compliance Dashboards

Data Discovery

Self-Service Discovery

Best Practices

Related Topics