Skip to main content

Event Correlation Deep Dive

This guide provides an in-depth look at KillIT v3's event correlation engine, its strategies, and configuration options.

Architecture Overview

The correlation engine consists of three main components:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ Event Ingestion │────▶│ Correlation Queue │────▶│ Correlation │
│ Service │ │ (Bull/Redis) │ │ Worker │
└─────────────────┘ └──────────────────┘ └─────────────────┘


┌─────────────────┐
│ Correlation │
│ Service │
└─────────────────┘

┌───────┴────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Temporal │ │ Topology │
│ Strategy │ │ Strategy │
└──────────┘ └──────────┘
▼ ▼
┌──────────┐ ┌──────────┐
│ Pattern │ │ Service │
│ Strategy │ │ Strategy │
└──────────┘ └──────────┘

Correlation Strategies

1. Dependency Correlation (NEW)

Purpose: Identifies root causes and impacts based on CI dependencies

Features:

  • Analyzes CI relationship chains to find root causes
  • Identifies cascading failures and impact propagation
  • Uses relationship types to understand dependency direction
  • Considers both upstream (root cause) and downstream (impact) events

Relationship Analysis:

HANADB01 (Database - CRITICAL)
└─depends_on→ SAPPRD01 (Application - CRITICAL)
└─impacts→ Order Processing Service

Scoring:

  • Direct dependency: 0.9
  • One hop dependency: 0.7
  • Two hop dependency: 0.5
  • Time proximity boost: up to +0.2

Key Relationship Types:

  • depends_on - Direct operational dependency
  • runs_on - Application/service runs on infrastructure
  • hosted_on - Virtual/containerized workloads
  • database_connection - Database dependencies
  • network_connection - Network dependencies
  • uses - Service dependencies

2. Temporal Correlation

Purpose: Groups events occurring close together in time

Configuration:

  • Time Window: 30 minutes (extended from 5)
  • Score Calculation: Exponential decay based on time difference
  • Threshold: Events with score > 0.5 are considered related

Algorithm:

score = Math.exp(-(timeDifference / timeWindow) * 2)

Example:

  • Event A at 10:00:00
  • Event B at 10:05:00 (5 minutes later)
  • Score: e^(-0.33) ≈ 0.72 (strongly correlated)

3. Topology Correlation

Purpose: Correlates events from CIs with discovered relationships

Features:

  • Uses CMDB CI relationships
  • Considers relationship types and importance
  • Searches up to 3 hops in the relationship graph
  • Integrates with dependency correlation for enhanced accuracy

Scoring:

  • Base score: 0.7 for related CIs
  • Critical relationships: 0.9
  • Time proximity adjustment: up to 30% reduction
  • AI-enhanced importance weighting

Example Relationships:

Database Cluster
├── Member Of → HANADB01
├── Member Of → HANADB02
└── Provides Service To → SAP Production

4. Pattern Correlation

Purpose: Groups events with identical correlation signatures

How it works:

  • Uses enhanced signature generation with CI context
  • Identifies recurring patterns across different systems
  • Learns from historical correlations
  • Detects known problem patterns

Signature Components:

  • Event title pattern
  • Severity
  • CI type
  • Service/application
  • Error codes or keywords

Use Cases:

  • Multiple servers experiencing same error
  • Distributed system failures
  • Configuration-related issues
  • Known problem detection

5. Service Correlation

Purpose: Groups events affecting the same service or application

Configuration:

  • Time Window: 15 minutes
  • Matches on: service name, application name, or business service
  • Score: 0.8 for matches
  • Enhanced with service dependency mapping

Benefits:

  • Understands service-wide issues
  • Groups microservice failures
  • Identifies application-level problems
  • Tracks business service impacts

Correlation Scoring

Score Merging

When multiple strategies identify the same event pair:

  1. Maximum Score: Takes the highest score from all strategies
  2. Reason Tracking: Preserves all reasons for correlation
  3. Threshold: Final score > 0.7 triggers correlation

Score Interpretation

Score RangeInterpretationAction
0.9 - 1.0Very High ConfidenceDefinitely related
0.7 - 0.9High ConfidenceLikely related
0.5 - 0.7Medium ConfidencePossibly related
< 0.5Low ConfidenceNot correlated

Correlation Process Flow

1. Event Ingestion

// After deduplication and saving
await eventIngestionService.queueForCorrelation(event);

2. Queue Processing

// Worker picks up job
{
eventId: "507f1f77bcf86cd799439011",
eventData: {
ciId: "...",
timestamp: "2025-06-17T10:00:00Z",
severity: "critical",
source: "nagios"
}
}

3. Strategy Execution

  • All strategies run in parallel
  • Each returns scored events
  • Results are merged

4. Correlation Assignment

// Generate or reuse correlation ID
if (score > 0.7) {
event.correlationId = "COR-1750135638797-abc123";
}

5. Group Analysis

  • Severity breakdown
  • Affected CI analysis
  • Pattern detection
  • Root cause identification

Root Cause Analysis

The system identifies root cause candidates using:

  1. Earliest Severe Event: First critical/major event
  2. Topology Analysis: Events from upstream CIs
  3. Pattern Matching: Known root cause patterns
  4. AI Enhancement: ML-based root cause detection (if enabled)

Configuration Options

Environment Variables

# Enable/disable correlation worker
ENABLE_EVENT_WORKERS=true

# Redis configuration for queues
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=your-password

# Queue settings
QUEUE_PREFIX=killit-v3
MAX_CONCURRENT_JOBS=5

Correlation Settings

// In eventCorrelationService.js
const CORRELATION_CONFIG = {
temporal: {
timeWindow: 5 * 60 * 1000, // 5 minutes
scoreThreshold: 0.5
},
topology: {
timeWindow: 10 * 60 * 1000, // 10 minutes
maxHops: 2,
baseScore: 0.7
},
pattern: {
timeWindow: 60 * 60 * 1000, // 1 hour
minScore: 0.9
},
service: {
timeWindow: 15 * 60 * 1000, // 15 minutes
score: 0.8
}
};

Performance Considerations

Queue Optimization

  1. Batch Processing: Process multiple correlations together
  2. Caching: Cache CI relationships for faster topology correlation
  3. Indexing: Ensure proper indexes on correlation fields

Database Indexes

Required indexes for optimal performance:

// Event collection indexes
{ correlationId: 1 }
{ correlationSignature: 1, status: 1 }
{ timestamp: -1 }
{ ciId: 1, timestamp: -1 }
{ service: 1, timestamp: -1 }

Scaling Considerations

  1. Multiple Workers: Run multiple correlation workers
  2. Redis Cluster: Use Redis cluster for queue scaling
  3. Time Window Limits: Adjust windows based on event volume

Monitoring Correlation

Metrics to Track

  1. Correlation Rate: % of events that get correlated
  2. Processing Time: Average correlation processing time
  3. Queue Depth: Number of pending correlation jobs
  4. Strategy Effectiveness: Which strategies contribute most

Bull Dashboard

Monitor correlation queue at /admin/queues:

  • Job counts
  • Processing times
  • Failed jobs
  • Queue health

Troubleshooting

Events Not Correlating

  1. Check Time Windows: Events might be too far apart
  2. Verify CI Relationships: Topology correlation needs CMDB data
  3. Review Correlation Scores: Check if scores are below threshold
  4. Queue Processing: Ensure worker is running

High Correlation Rate

If too many unrelated events are being correlated:

  1. Increase Score Threshold: Raise from 0.7 to 0.8
  2. Reduce Time Windows: Make windows smaller
  3. Adjust Strategy Weights: Disable less effective strategies

Performance Issues

  1. Add Queue Workers: Scale horizontally
  2. Optimize Queries: Add missing indexes
  3. Reduce Lookback: Smaller time windows
  4. Cache Results: Implement caching layer

API Integration

Get Event with Correlations

GET /api/events/:eventId

Response includes:
{
"event": {
"_id": "68fb2d086aa9698a52774328",
"eventId": "EVT-1750135638797",
"title": "Database Process Crashed",
"correlationId": "COR-1750135638797-abc123",
"parentEventId": null,
"childEventIds": ["68fb2d0b6aa9698a5277434f"]
},
"correlationGroup": {
"events": [...],
"analysis": {
"correlationId": "COR-123",
"eventCount": 15,
"rootCauseCandidate": {
"eventId": "68fb2d086aa9698a52774328",
"confidence": 0.85,
"reasoning": "earliest_severe_event_upstream_dependency"
}
}
}
}

Correlate Specific Event

POST /api/events/:eventId/correlate

Response:
{
"success": true,
"correlations": {
"isCorrelated": true,
"correlationId": "COR-1750135638797-abc123",
"correlationScore": 0.9,
"strategiesUsed": ["dependency", "temporal", "topology"],
"correlatedEvents": [
{
"eventId": "68fb2d086aa9698a52774328",
"score": 0.9,
"reasons": [
{
"strategy": "dependency",
"reason": "upstream_dependency",
"details": "SAPPRD01 depends on HANADB01"
}
]
}
]
}
}

Get Event Correlations

GET /api/events/:eventId/correlations

Response:
{
"success": true,
"event": {
"_id": "68fb2d0b6aa9698a5277434f",
"title": "Database Connection Lost",
"correlationId": "COR-1750135638797-abc123",
"correlationMetadata": {
"correlationType": "dependency",
"rootCauseEvent": "68fb2d086aa9698a52774328",
"rootCauseCI": "HANADB01",
"confidence": 85
}
},
"correlationGroup": {
"events": [...],
"analysis": {
"timeSpan": {
"start": "2025-01-24T10:00:00Z",
"end": "2025-01-24T10:05:00Z"
},
"rootCauseCandidate": {...}
}
}
}

Analyze Recent Correlations

POST /api/events/analyze-correlations
{
"hours": 24, // Look back period
"autoCorrelate": true // Apply correlations automatically
}

Response:
{
"success": true,
"correlationGroups": 3,
"eventsAnalyzed": 45,
"eventsCorrelated": 12,
"results": [
{
"rootCause": {
"_id": "68fb2d086aa9698a52774328",
"title": "Database Process Crashed",
"severity": "critical",
"ciName": "HANADB01",
"timestamp": "2025-01-24T10:00:00Z"
},
"impactedEvents": [
{
"_id": "68fb2d0b6aa9698a5277434f",
"title": "Database Connection Lost",
"severity": "critical",
"ciName": "SAPPRD01",
"confidence": 0.9
}
],
"correlationId": "COR-1750135638797-abc123",
"analysis": {
"pattern": "database_cascade_failure",
"totalImpact": "high",
"affectedServices": ["SAP", "Order Processing"]
}
}
]
}

Manual Correlation

POST /api/events/correlate
{
"primaryEventId": "68fb2d086aa9698a52774328",
"relatedEventIds": ["68fb2d0b6aa9698a5277434f", "..."],
"reason": "manual_correlation",
"notes": "Related database failure events"
}

Practical Example: Database Cascade Failure

Here's a real-world example of how KillIT v3 correlates a database failure cascade:

Scenario

  1. 10:00:00 - HANADB01 database process crashes (Critical)
  2. 10:00:30 - SAPPRD01 loses database connection (Critical)
  3. 10:01:00 - Order Processing Service reports errors (Major)
  4. 10:01:30 - Web Portal shows 500 errors (Major)

Correlation Process

Step 1: Dependency Analysis

HANADB01 (Database)
└─provides_data_to→ SAPPRD01 (SAP Application)
└─serves→ Order Processing Service
└─used_by→ Web Portal

Step 2: Correlation Scoring

  • HANADB01 → SAPPRD01: 0.9 (direct dependency + 30s time diff)
  • SAPPRD01 → Order Service: 0.85 (one hop + 30s time diff)
  • Order Service → Web Portal: 0.75 (two hops + 30s time diff)

Step 3: Root Cause Identification

  • HANADB01 identified as root cause (earliest critical event in dependency chain)
  • Confidence: 85% (based on timing and dependency analysis)

Step 4: Result

All four events grouped under correlation ID COR-1750135638797-abc123 with HANADB01 as the root cause.

UI Display

  • Events show correlation badge in list view
  • Correlation tab shows full dependency chain
  • Root cause highlighted with special indicator
  • Impact analysis shows affected services

Best Practices

  1. CMDB Accuracy: Keep CI relationships up-to-date for accurate dependency correlation
  2. Service Tagging: Consistently tag events with service names
  3. Time Sync: Ensure all systems have synchronized clocks (NTP)
  4. Regular Reviews: Analyze correlation effectiveness monthly
  5. Tune Strategies: Adjust time windows and scores based on your environment
  6. Relationship Types: Use specific relationship types (depends_on, runs_on, etc.)
  7. Event Enrichment: Include CI names and service information in events
  8. Monitor Correlation Rate: Track percentage of correlated vs isolated events

Performance Tuning

Small (< 1000 events/day)

{
temporal: { timeWindow: 30 * 60 * 1000 }, // 30 min
workers: 1,
batchSize: 10
}

Medium (1000-10000 events/day)

{
temporal: { timeWindow: 15 * 60 * 1000 }, // 15 min
workers: 3,
batchSize: 25,
cacheEnabled: true
}

Large (> 10000 events/day)

{
temporal: { timeWindow: 10 * 60 * 1000 }, // 10 min
workers: 5,
batchSize: 50,
cacheEnabled: true,
redisCluster: true
}

Future Enhancements

  • Machine Learning: Pattern recognition for correlation prediction
  • Custom Strategies: Plugin architecture for industry-specific correlation
  • Real-time Visualization: Live correlation graph with drill-down
  • Predictive Correlation: Anticipate related events before they occur
  • Cross-tenant Correlation: For MSP environments with shared infrastructure
  • Correlation Templates: Pre-defined correlation patterns for common scenarios
  • Auto-remediation: Trigger automated fixes for correlated event patterns