Event Correlation Deep Dive
This guide provides an in-depth look at KillIT v3's event correlation engine, its strategies, and configuration options.
Architecture Overview
The correlation engine consists of three main components:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Event Ingestion │────▶│ Correlation Queue │────▶│ Correlation │
│ Service │ │ (Bull/Redis) │ │ Worker │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Correlation │
│ Service │
└─────────────────┘
│
┌───────┴────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Temporal │ │ Topology │
│ Strategy │ │ Strategy │
└──────────┘ └──────────┘
▼ ▼
┌──────────┐ ┌──────────┐
│ Pattern │ │ Service │
│ Strategy │ │ Strategy │
└──────────┘ └──────────┘
Correlation Strategies
1. Dependency Correlation (NEW)
Purpose: Identifies root causes and impacts based on CI dependencies
Features:
- Analyzes CI relationship chains to find root causes
- Identifies cascading failures and impact propagation
- Uses relationship types to understand dependency direction
- Considers both upstream (root cause) and downstream (impact) events
Relationship Analysis:
HANADB01 (Database - CRITICAL)
└─depends_on→ SAPPRD01 (Application - CRITICAL)
└─impacts→ Order Processing Service
Scoring:
- Direct dependency: 0.9
- One hop dependency: 0.7
- Two hop dependency: 0.5
- Time proximity boost: up to +0.2
Key Relationship Types:
depends_on- Direct operational dependencyruns_on- Application/service runs on infrastructurehosted_on- Virtual/containerized workloadsdatabase_connection- Database dependenciesnetwork_connection- Network dependenciesuses- Service dependencies
2. Temporal Correlation
Purpose: Groups events occurring close together in time
Configuration:
- Time Window: 30 minutes (extended from 5)
- Score Calculation: Exponential decay based on time difference
- Threshold: Events with score > 0.5 are considered related
Algorithm:
score = Math.exp(-(timeDifference / timeWindow) * 2)
Example:
- Event A at 10:00:00
- Event B at 10:05:00 (5 minutes later)
- Score: e^(-0.33) ≈ 0.72 (strongly correlated)
3. Topology Correlation
Purpose: Correlates events from CIs with discovered relationships
Features:
- Uses CMDB CI relationships
- Considers relationship types and importance
- Searches up to 3 hops in the relationship graph
- Integrates with dependency correlation for enhanced accuracy
Scoring:
- Base score: 0.7 for related CIs
- Critical relationships: 0.9
- Time proximity adjustment: up to 30% reduction
- AI-enhanced importance weighting
Example Relationships:
Database Cluster
├── Member Of → HANADB01
├── Member Of → HANADB02
└── Provides Service To → SAP Production
4. Pattern Correlation
Purpose: Groups events with identical correlation signatures
How it works:
- Uses enhanced signature generation with CI context
- Identifies recurring patterns across different systems
- Learns from historical correlations
- Detects known problem patterns
Signature Components:
- Event title pattern
- Severity
- CI type
- Service/application
- Error codes or keywords
Use Cases:
- Multiple servers experiencing same error
- Distributed system failures
- Configuration-related issues
- Known problem detection
5. Service Correlation
Purpose: Groups events affecting the same service or application
Configuration:
- Time Window: 15 minutes
- Matches on: service name, application name, or business service
- Score: 0.8 for matches
- Enhanced with service dependency mapping
Benefits:
- Understands service-wide issues
- Groups microservice failures
- Identifies application-level problems
- Tracks business service impacts
Correlation Scoring
Score Merging
When multiple strategies identify the same event pair:
- Maximum Score: Takes the highest score from all strategies
- Reason Tracking: Preserves all reasons for correlation
- Threshold: Final score > 0.7 triggers correlation
Score Interpretation
| Score Range | Interpretation | Action |
|---|---|---|
| 0.9 - 1.0 | Very High Confidence | Definitely related |
| 0.7 - 0.9 | High Confidence | Likely related |
| 0.5 - 0.7 | Medium Confidence | Possibly related |
| < 0.5 | Low Confidence | Not correlated |
Correlation Process Flow
1. Event Ingestion
// After deduplication and saving
await eventIngestionService.queueForCorrelation(event);
2. Queue Processing
// Worker picks up job
{
eventId: "507f1f77bcf86cd799439011",
eventData: {
ciId: "...",
timestamp: "2025-06-17T10:00:00Z",
severity: "critical",
source: "nagios"
}
}
3. Strategy Execution
- All strategies run in parallel
- Each returns scored events
- Results are merged
4. Correlation Assignment
// Generate or reuse correlation ID
if (score > 0.7) {
event.correlationId = "COR-1750135638797-abc123";
}
5. Group Analysis
- Severity breakdown
- Affected CI analysis
- Pattern detection
- Root cause identification
Root Cause Analysis
The system identifies root cause candidates using:
- Earliest Severe Event: First critical/major event
- Topology Analysis: Events from upstream CIs
- Pattern Matching: Known root cause patterns
- AI Enhancement: ML-based root cause detection (if enabled)
Configuration Options
Environment Variables
# Enable/disable correlation worker
ENABLE_EVENT_WORKERS=true
# Redis configuration for queues
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=your-password
# Queue settings
QUEUE_PREFIX=killit-v3
MAX_CONCURRENT_JOBS=5
Correlation Settings
// In eventCorrelationService.js
const CORRELATION_CONFIG = {
temporal: {
timeWindow: 5 * 60 * 1000, // 5 minutes
scoreThreshold: 0.5
},
topology: {
timeWindow: 10 * 60 * 1000, // 10 minutes
maxHops: 2,
baseScore: 0.7
},
pattern: {
timeWindow: 60 * 60 * 1000, // 1 hour
minScore: 0.9
},
service: {
timeWindow: 15 * 60 * 1000, // 15 minutes
score: 0.8
}
};
Performance Considerations
Queue Optimization
- Batch Processing: Process multiple correlations together
- Caching: Cache CI relationships for faster topology correlation
- Indexing: Ensure proper indexes on correlation fields
Database Indexes
Required indexes for optimal performance:
// Event collection indexes
{ correlationId: 1 }
{ correlationSignature: 1, status: 1 }
{ timestamp: -1 }
{ ciId: 1, timestamp: -1 }
{ service: 1, timestamp: -1 }
Scaling Considerations
- Multiple Workers: Run multiple correlation workers
- Redis Cluster: Use Redis cluster for queue scaling
- Time Window Limits: Adjust windows based on event volume
Monitoring Correlation
Metrics to Track
- Correlation Rate: % of events that get correlated
- Processing Time: Average correlation processing time
- Queue Depth: Number of pending correlation jobs
- Strategy Effectiveness: Which strategies contribute most
Bull Dashboard
Monitor correlation queue at /admin/queues:
- Job counts
- Processing times
- Failed jobs
- Queue health
Troubleshooting
Events Not Correlating
- Check Time Windows: Events might be too far apart
- Verify CI Relationships: Topology correlation needs CMDB data
- Review Correlation Scores: Check if scores are below threshold
- Queue Processing: Ensure worker is running
High Correlation Rate
If too many unrelated events are being correlated:
- Increase Score Threshold: Raise from 0.7 to 0.8
- Reduce Time Windows: Make windows smaller
- Adjust Strategy Weights: Disable less effective strategies
Performance Issues
- Add Queue Workers: Scale horizontally
- Optimize Queries: Add missing indexes
- Reduce Lookback: Smaller time windows
- Cache Results: Implement caching layer
API Integration
Get Event with Correlations
GET /api/events/:eventId
Response includes:
{
"event": {
"_id": "68fb2d086aa9698a52774328",
"eventId": "EVT-1750135638797",
"title": "Database Process Crashed",
"correlationId": "COR-1750135638797-abc123",
"parentEventId": null,
"childEventIds": ["68fb2d0b6aa9698a5277434f"]
},
"correlationGroup": {
"events": [...],
"analysis": {
"correlationId": "COR-123",
"eventCount": 15,
"rootCauseCandidate": {
"eventId": "68fb2d086aa9698a52774328",
"confidence": 0.85,
"reasoning": "earliest_severe_event_upstream_dependency"
}
}
}
}
Correlate Specific Event
POST /api/events/:eventId/correlate
Response:
{
"success": true,
"correlations": {
"isCorrelated": true,
"correlationId": "COR-1750135638797-abc123",
"correlationScore": 0.9,
"strategiesUsed": ["dependency", "temporal", "topology"],
"correlatedEvents": [
{
"eventId": "68fb2d086aa9698a52774328",
"score": 0.9,
"reasons": [
{
"strategy": "dependency",
"reason": "upstream_dependency",
"details": "SAPPRD01 depends on HANADB01"
}
]
}
]
}
}
Get Event Correlations
GET /api/events/:eventId/correlations
Response:
{
"success": true,
"event": {
"_id": "68fb2d0b6aa9698a5277434f",
"title": "Database Connection Lost",
"correlationId": "COR-1750135638797-abc123",
"correlationMetadata": {
"correlationType": "dependency",
"rootCauseEvent": "68fb2d086aa9698a52774328",
"rootCauseCI": "HANADB01",
"confidence": 85
}
},
"correlationGroup": {
"events": [...],
"analysis": {
"timeSpan": {
"start": "2025-01-24T10:00:00Z",
"end": "2025-01-24T10:05:00Z"
},
"rootCauseCandidate": {...}
}
}
}
Analyze Recent Correlations
POST /api/events/analyze-correlations
{
"hours": 24, // Look back period
"autoCorrelate": true // Apply correlations automatically
}
Response:
{
"success": true,
"correlationGroups": 3,
"eventsAnalyzed": 45,
"eventsCorrelated": 12,
"results": [
{
"rootCause": {
"_id": "68fb2d086aa9698a52774328",
"title": "Database Process Crashed",
"severity": "critical",
"ciName": "HANADB01",
"timestamp": "2025-01-24T10:00:00Z"
},
"impactedEvents": [
{
"_id": "68fb2d0b6aa9698a5277434f",
"title": "Database Connection Lost",
"severity": "critical",
"ciName": "SAPPRD01",
"confidence": 0.9
}
],
"correlationId": "COR-1750135638797-abc123",
"analysis": {
"pattern": "database_cascade_failure",
"totalImpact": "high",
"affectedServices": ["SAP", "Order Processing"]
}
}
]
}
Manual Correlation
POST /api/events/correlate
{
"primaryEventId": "68fb2d086aa9698a52774328",
"relatedEventIds": ["68fb2d0b6aa9698a5277434f", "..."],
"reason": "manual_correlation",
"notes": "Related database failure events"
}
Practical Example: Database Cascade Failure
Here's a real-world example of how KillIT v3 correlates a database failure cascade:
Scenario
- 10:00:00 - HANADB01 database process crashes (Critical)
- 10:00:30 - SAPPRD01 loses database connection (Critical)
- 10:01:00 - Order Processing Service reports errors (Major)
- 10:01:30 - Web Portal shows 500 errors (Major)
Correlation Process
Step 1: Dependency Analysis
HANADB01 (Database)
└─provides_data_to→ SAPPRD01 (SAP Application)
└─serves→ Order Processing Service
└─used_by→ Web Portal
Step 2: Correlation Scoring
- HANADB01 → SAPPRD01: 0.9 (direct dependency + 30s time diff)
- SAPPRD01 → Order Service: 0.85 (one hop + 30s time diff)
- Order Service → Web Portal: 0.75 (two hops + 30s time diff)
Step 3: Root Cause Identification
- HANADB01 identified as root cause (earliest critical event in dependency chain)
- Confidence: 85% (based on timing and dependency analysis)
Step 4: Result
All four events grouped under correlation ID COR-1750135638797-abc123 with HANADB01 as the root cause.
UI Display
- Events show correlation badge in list view
- Correlation tab shows full dependency chain
- Root cause highlighted with special indicator
- Impact analysis shows affected services
Best Practices
- CMDB Accuracy: Keep CI relationships up-to-date for accurate dependency correlation
- Service Tagging: Consistently tag events with service names
- Time Sync: Ensure all systems have synchronized clocks (NTP)
- Regular Reviews: Analyze correlation effectiveness monthly
- Tune Strategies: Adjust time windows and scores based on your environment
- Relationship Types: Use specific relationship types (depends_on, runs_on, etc.)
- Event Enrichment: Include CI names and service information in events
- Monitor Correlation Rate: Track percentage of correlated vs isolated events
Performance Tuning
Recommended Settings by Environment Size
Small (< 1000 events/day)
{
temporal: { timeWindow: 30 * 60 * 1000 }, // 30 min
workers: 1,
batchSize: 10
}
Medium (1000-10000 events/day)
{
temporal: { timeWindow: 15 * 60 * 1000 }, // 15 min
workers: 3,
batchSize: 25,
cacheEnabled: true
}
Large (> 10000 events/day)
{
temporal: { timeWindow: 10 * 60 * 1000 }, // 10 min
workers: 5,
batchSize: 50,
cacheEnabled: true,
redisCluster: true
}
Future Enhancements
- Machine Learning: Pattern recognition for correlation prediction
- Custom Strategies: Plugin architecture for industry-specific correlation
- Real-time Visualization: Live correlation graph with drill-down
- Predictive Correlation: Anticipate related events before they occur
- Cross-tenant Correlation: For MSP environments with shared infrastructure
- Correlation Templates: Pre-defined correlation patterns for common scenarios
- Auto-remediation: Trigger automated fixes for correlated event patterns