Event Correlation Deep Dive

This guide provides an in-depth look at KillIT v3's event correlation engine, its strategies, and configuration options.

Architecture Overview

The correlation engine consists of three main components:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ Event Ingestion │────▶│ Correlation Queue │────▶│ Correlation     │
│ Service         │     │ (Bull/Redis)      │     │ Worker          │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                           │
                                                           ▼
                                                   ┌─────────────────┐
                                                   │ Correlation     │
                                                   │ Service         │
                                                   └─────────────────┘
                                                           │
                                                   ┌───────┴────────┐
                                                   ▼                ▼
                                             ┌──────────┐    ┌──────────┐
                                             │ Temporal │    │ Topology │
                                             │ Strategy │    │ Strategy │
                                             └──────────┘    └──────────┘
                                                   ▼                ▼
                                             ┌──────────┐    ┌──────────┐
                                             │ Pattern  │    │ Service  │
                                             │ Strategy │    │ Strategy │
                                             └──────────┘    └──────────┘

Correlation Strategies

1. Dependency Correlation (NEW)

Purpose: Identifies root causes and impacts based on CI dependencies

Features:

Analyzes CI relationship chains to find root causes
Identifies cascading failures and impact propagation
Uses relationship types to understand dependency direction
Considers both upstream (root cause) and downstream (impact) events

Relationship Analysis:

HANADB01 (Database - CRITICAL)
    └─depends_on→ SAPPRD01 (Application - CRITICAL)
                     └─impacts→ Order Processing Service

Scoring:

Direct dependency: 0.9
One hop dependency: 0.7
Two hop dependency: 0.5
Time proximity boost: up to +0.2

Key Relationship Types:

depends_on - Direct operational dependency
runs_on - Application/service runs on infrastructure
hosted_on - Virtual/containerized workloads
database_connection - Database dependencies
network_connection - Network dependencies
uses - Service dependencies

2. Temporal Correlation

Purpose: Groups events occurring close together in time

Configuration:

Time Window: 30 minutes (extended from 5)
Score Calculation: Exponential decay based on time difference
Threshold: Events with score > 0.5 are considered related

Algorithm:

score = Math.exp(-(timeDifference / timeWindow) * 2)

Example:

Event A at 10:00:00
Event B at 10:05:00 (5 minutes later)
Score: e^(-0.33) ≈ 0.72 (strongly correlated)

3. Topology Correlation

Purpose: Correlates events from CIs with discovered relationships

Features:

Uses CMDB CI relationships
Considers relationship types and importance
Searches up to 3 hops in the relationship graph
Integrates with dependency correlation for enhanced accuracy

Scoring:

Base score: 0.7 for related CIs
Critical relationships: 0.9
Time proximity adjustment: up to 30% reduction
AI-enhanced importance weighting

Example Relationships:

Database Cluster
    ├── Member Of → HANADB01
    ├── Member Of → HANADB02
    └── Provides Service To → SAP Production

4. Pattern Correlation

Purpose: Groups events with identical correlation signatures

How it works:

Uses enhanced signature generation with CI context
Identifies recurring patterns across different systems
Learns from historical correlations
Detects known problem patterns

Signature Components:

Event title pattern
Severity
CI type
Service/application
Error codes or keywords

Use Cases:

Multiple servers experiencing same error
Distributed system failures
Configuration-related issues
Known problem detection

5. Service Correlation

Purpose: Groups events affecting the same service or application

Configuration:

Time Window: 15 minutes
Matches on: service name, application name, or business service
Score: 0.8 for matches
Enhanced with service dependency mapping

Benefits:

Understands service-wide issues
Groups microservice failures
Identifies application-level problems
Tracks business service impacts

Correlation Scoring

Score Merging

When multiple strategies identify the same event pair:

Maximum Score: Takes the highest score from all strategies
Reason Tracking: Preserves all reasons for correlation
Threshold: Final score > 0.7 triggers correlation

Score Interpretation

Score Range	Interpretation	Action
0.9 - 1.0	Very High Confidence	Definitely related
0.7 - 0.9	High Confidence	Likely related
0.5 - 0.7	Medium Confidence	Possibly related
< 0.5	Low Confidence	Not correlated

Correlation Process Flow

1. Event Ingestion

// After deduplication and saving
await eventIngestionService.queueForCorrelation(event);

2. Queue Processing

// Worker picks up job
{
  eventId: "507f1f77bcf86cd799439011",
  eventData: {
    ciId: "...",
    timestamp: "2025-06-17T10:00:00Z",
    severity: "critical",
    source: "nagios"
  }
}

3. Strategy Execution

All strategies run in parallel
Each returns scored events
Results are merged

4. Correlation Assignment

// Generate or reuse correlation ID
if (score > 0.7) {
  event.correlationId = "COR-1750135638797-abc123";
}

5. Group Analysis

Severity breakdown
Affected CI analysis
Pattern detection
Root cause identification

Root Cause Analysis

The system identifies root cause candidates using:

Earliest Severe Event: First critical/major event
Topology Analysis: Events from upstream CIs
Pattern Matching: Known root cause patterns
AI Enhancement: ML-based root cause detection (if enabled)

Configuration Options

Environment Variables

# Enable/disable correlation worker
ENABLE_EVENT_WORKERS=true

# Redis configuration for queues
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=your-password

# Queue settings
QUEUE_PREFIX=killit-v3
MAX_CONCURRENT_JOBS=5

Correlation Settings

// In eventCorrelationService.js
const CORRELATION_CONFIG = {
  temporal: {
    timeWindow: 5 * 60 * 1000, // 5 minutes
    scoreThreshold: 0.5
  },
  topology: {
    timeWindow: 10 * 60 * 1000, // 10 minutes
    maxHops: 2,
    baseScore: 0.7
  },
  pattern: {
    timeWindow: 60 * 60 * 1000, // 1 hour
    minScore: 0.9
  },
  service: {
    timeWindow: 15 * 60 * 1000, // 15 minutes
    score: 0.8
  }
};

Performance Considerations

Queue Optimization

Batch Processing: Process multiple correlations together
Caching: Cache CI relationships for faster topology correlation
Indexing: Ensure proper indexes on correlation fields

Database Indexes

Required indexes for optimal performance:

// Event collection indexes
{ correlationId: 1 }
{ correlationSignature: 1, status: 1 }
{ timestamp: -1 }
{ ciId: 1, timestamp: -1 }
{ service: 1, timestamp: -1 }

Scaling Considerations

Multiple Workers: Run multiple correlation workers
Redis Cluster: Use Redis cluster for queue scaling
Time Window Limits: Adjust windows based on event volume

Monitoring Correlation

Metrics to Track

Correlation Rate: % of events that get correlated
Processing Time: Average correlation processing time
Queue Depth: Number of pending correlation jobs
Strategy Effectiveness: Which strategies contribute most

Bull Dashboard

Monitor correlation queue at /admin/queues:

Job counts
Processing times
Failed jobs
Queue health

Troubleshooting

Events Not Correlating

Check Time Windows: Events might be too far apart
Verify CI Relationships: Topology correlation needs CMDB data
Review Correlation Scores: Check if scores are below threshold
Queue Processing: Ensure worker is running

High Correlation Rate

If too many unrelated events are being correlated:

Increase Score Threshold: Raise from 0.7 to 0.8
Reduce Time Windows: Make windows smaller
Adjust Strategy Weights: Disable less effective strategies

Performance Issues

Add Queue Workers: Scale horizontally
Optimize Queries: Add missing indexes
Reduce Lookback: Smaller time windows
Cache Results: Implement caching layer

API Integration

Get Event with Correlations

GET /api/events/:eventId

Response includes:
{
  "event": {
    "_id": "68fb2d086aa9698a52774328",
    "eventId": "EVT-1750135638797",
    "title": "Database Process Crashed",
    "correlationId": "COR-1750135638797-abc123",
    "parentEventId": null,
    "childEventIds": ["68fb2d0b6aa9698a5277434f"]
  },
  "correlationGroup": {
    "events": [...],
    "analysis": {
      "correlationId": "COR-123",
      "eventCount": 15,
      "rootCauseCandidate": {
        "eventId": "68fb2d086aa9698a52774328",
        "confidence": 0.85,
        "reasoning": "earliest_severe_event_upstream_dependency"
      }
    }
  }
}

Correlate Specific Event

POST /api/events/:eventId/correlate

Response:
{
  "success": true,
  "correlations": {
    "isCorrelated": true,
    "correlationId": "COR-1750135638797-abc123",
    "correlationScore": 0.9,
    "strategiesUsed": ["dependency", "temporal", "topology"],
    "correlatedEvents": [
      {
        "eventId": "68fb2d086aa9698a52774328",
        "score": 0.9,
        "reasons": [
          {
            "strategy": "dependency",
            "reason": "upstream_dependency",
            "details": "SAPPRD01 depends on HANADB01"
          }
        ]
      }
    ]
  }
}

Get Event Correlations

GET /api/events/:eventId/correlations

Response:
{
  "success": true,
  "event": {
    "_id": "68fb2d0b6aa9698a5277434f",
    "title": "Database Connection Lost",
    "correlationId": "COR-1750135638797-abc123",
    "correlationMetadata": {
      "correlationType": "dependency",
      "rootCauseEvent": "68fb2d086aa9698a52774328",
      "rootCauseCI": "HANADB01",
      "confidence": 85
    }
  },
  "correlationGroup": {
    "events": [...],
    "analysis": {
      "timeSpan": {
        "start": "2025-01-24T10:00:00Z",
        "end": "2025-01-24T10:05:00Z"
      },
      "rootCauseCandidate": {...}
    }
  }
}

Analyze Recent Correlations

POST /api/events/analyze-correlations
{
  "hours": 24,           // Look back period
  "autoCorrelate": true  // Apply correlations automatically
}

Response:
{
  "success": true,
  "correlationGroups": 3,
  "eventsAnalyzed": 45,
  "eventsCorrelated": 12,
  "results": [
    {
      "rootCause": {
        "_id": "68fb2d086aa9698a52774328",
        "title": "Database Process Crashed",
        "severity": "critical",
        "ciName": "HANADB01",
        "timestamp": "2025-01-24T10:00:00Z"
      },
      "impactedEvents": [
        {
          "_id": "68fb2d0b6aa9698a5277434f",
          "title": "Database Connection Lost",
          "severity": "critical",
          "ciName": "SAPPRD01",
          "confidence": 0.9
        }
      ],
      "correlationId": "COR-1750135638797-abc123",
      "analysis": {
        "pattern": "database_cascade_failure",
        "totalImpact": "high",
        "affectedServices": ["SAP", "Order Processing"]
      }
    }
  ]
}

Manual Correlation

POST /api/events/correlate
{
  "primaryEventId": "68fb2d086aa9698a52774328",
  "relatedEventIds": ["68fb2d0b6aa9698a5277434f", "..."],
  "reason": "manual_correlation",
  "notes": "Related database failure events"
}

Practical Example: Database Cascade Failure

Here's a real-world example of how KillIT v3 correlates a database failure cascade:

Scenario

10:00:00 - HANADB01 database process crashes (Critical)
10:00:30 - SAPPRD01 loses database connection (Critical)
10:01:00 - Order Processing Service reports errors (Major)
10:01:30 - Web Portal shows 500 errors (Major)

Correlation Process

Step 1: Dependency Analysis

HANADB01 (Database)
    └─provides_data_to→ SAPPRD01 (SAP Application)
        └─serves→ Order Processing Service
            └─used_by→ Web Portal

Step 2: Correlation Scoring

HANADB01 → SAPPRD01: 0.9 (direct dependency + 30s time diff)
SAPPRD01 → Order Service: 0.85 (one hop + 30s time diff)
Order Service → Web Portal: 0.75 (two hops + 30s time diff)

Step 3: Root Cause Identification

HANADB01 identified as root cause (earliest critical event in dependency chain)
Confidence: 85% (based on timing and dependency analysis)

Step 4: Result

All four events grouped under correlation ID COR-1750135638797-abc123 with HANADB01 as the root cause.

UI Display

Events show correlation badge in list view
Correlation tab shows full dependency chain
Root cause highlighted with special indicator
Impact analysis shows affected services

Best Practices

CMDB Accuracy: Keep CI relationships up-to-date for accurate dependency correlation
Service Tagging: Consistently tag events with service names
Time Sync: Ensure all systems have synchronized clocks (NTP)
Regular Reviews: Analyze correlation effectiveness monthly
Tune Strategies: Adjust time windows and scores based on your environment
Relationship Types: Use specific relationship types (depends_on, runs_on, etc.)
Event Enrichment: Include CI names and service information in events
Monitor Correlation Rate: Track percentage of correlated vs isolated events

Performance Tuning

Recommended Settings by Environment Size

Small (< 1000 events/day)

{
  temporal: { timeWindow: 30 * 60 * 1000 },  // 30 min
  workers: 1,
  batchSize: 10
}

Medium (1000-10000 events/day)

{
  temporal: { timeWindow: 15 * 60 * 1000 },  // 15 min
  workers: 3,
  batchSize: 25,
  cacheEnabled: true
}

Large (> 10000 events/day)

{
  temporal: { timeWindow: 10 * 60 * 1000 },  // 10 min
  workers: 5,
  batchSize: 50,
  cacheEnabled: true,
  redisCluster: true
}

Future Enhancements

Machine Learning: Pattern recognition for correlation prediction
Custom Strategies: Plugin architecture for industry-specific correlation
Real-time Visualization: Live correlation graph with drill-down
Predictive Correlation: Anticipate related events before they occur
Cross-tenant Correlation: For MSP environments with shared infrastructure
Correlation Templates: Pre-defined correlation patterns for common scenarios
Auto-remediation: Trigger automated fixes for correlated event patterns

Architecture Overview​

Correlation Strategies​

1. Dependency Correlation (NEW)​

2. Temporal Correlation​

3. Topology Correlation​

4. Pattern Correlation​

5. Service Correlation​

Correlation Scoring​

Score Merging​

Score Interpretation​

Correlation Process Flow​

1. Event Ingestion​

2. Queue Processing​

3. Strategy Execution​

4. Correlation Assignment​

5. Group Analysis​

Root Cause Analysis​

Configuration Options​

Environment Variables​

Correlation Settings​

Performance Considerations​

Queue Optimization​

Database Indexes​

Scaling Considerations​

Monitoring Correlation​

Metrics to Track​

Bull Dashboard​

Troubleshooting​

Events Not Correlating​

High Correlation Rate​

Performance Issues​

API Integration​

Get Event with Correlations​

Correlate Specific Event​

Get Event Correlations​

Analyze Recent Correlations​

Manual Correlation​

Practical Example: Database Cascade Failure​

Scenario​

Correlation Process​

Step 1: Dependency Analysis​

Step 2: Correlation Scoring​

Step 3: Root Cause Identification​

Step 4: Result​

UI Display​

Best Practices​

Performance Tuning​

Recommended Settings by Environment Size​

Small (< 1000 events/day)​

Medium (1000-10000 events/day)​

Large (> 10000 events/day)​

Future Enhancements​