Managing Events
This guide covers the day-to-day operations of managing events in KillIT v3's Event Management system.
Event Dashboard
The Event Dashboard provides a real-time view of your IT environment's health.
Key Metrics
- Total Events: Count of all events in the selected time range
- Critical/Major Open: High-severity events requiring attention
- Average Resolution Time: Mean time to resolve events
- Active Incidents: Currently open events
Filtering Events
Use filters to focus on specific events:
- Status: Open, Acknowledged, Resolved, Suppressed
- Severity: Critical, Major, Minor, Warning, Info
- Source: Filter by monitoring tool
- Time Range: Last hour, 6 hours, 24 hours, 3 days, 7 days
Event Lifecycle
1. Open State
- New events enter the system in "Open" state
- Automatic correlation and AI analysis begin
- Notifications sent based on severity
2. Acknowledged State
- Indicates someone is investigating
- Stops escalation timers
- Records who acknowledged and when
3. Resolved State
- Issue has been fixed
- Resolution notes document the fix
- Metrics calculated for reporting
4. Suppressed State
- Event deemed not actionable
- Useful for known issues or maintenance
- Removes from active event count
Working with Events
Viewing Event Details
Click any event to see:
Overview Tab
- Event title and description
- Severity and status indicators
- Source and timing information
- Related CI information
- Assignment details
AI Analysis Tab
- Anomaly score (0-100%)
- Root cause analysis
- Contributing factors
- Suggested remediation actions
- Automation opportunities
Correlation Tab
- Related events in the correlation group
- Affected Configuration Items
- Root cause candidate identification
- Pattern analysis
Timeline Tab
- Complete event history
- Status changes
- User actions
- Automation executions
Updating Event Status
- Click Update Status button
- Select new status:
- Acknowledged: "I'm working on this"
- Resolved: "The issue is fixed"
- Suppressed: "This is not actionable"
- Add notes explaining your action
- For resolved events, select resolution category:
- Auto-resolved
- Manual fix
- False positive
- Duplicate
- No action needed
Bulk Operations
Select multiple events to:
- Acknowledge all
- Assign to team member
- Suppress similar events
- Export for reporting
Event Deduplication
How Deduplication Works
KillIT v3 automatically identifies and groups duplicate events to reduce alert noise. When multiple instances of the same issue occur, they are consolidated into a single event with an occurrence count.
Correlation Signature
Each event receives a unique correlation signature based on:
- Source: The monitoring system (Nagios, Zabbix, etc.)
- CI/Host: The Configuration Item ID, hostname, or IP address
- Normalized Title: Event title with dynamic parts removed
- Service: The affected service name
The signature is generated as an MD5 hash of these components joined with "-".
Title Normalization
To handle dynamic content in event titles, the system normalizes them by replacing:
- Dates (YYYY-MM-DD) → "DATE"
- Times (HH:MM:SS) → "TIME"
- Numbers → "NUM"
- UUIDs → "UUID"
- Multiple spaces → Single space
Example:
- Original: "Database connection failed 3 times at 14:35:20 on 2025-06-17"
- Normalized: "database connection failed NUM times at TIME on DATE"
Deduplication Process
- New Event Arrives: System generates correlation signature
- Duplicate Check: Searches for existing events with:
- Same correlation signature
- Status is "open" or "acknowledged"
- Created within the last hour (configurable window)
- If Duplicate Found:
- Increment
occurrenceCount - Update
lastOccurrencetimestamp - Update severity if new event is more severe
- Merge additional details
- Return existing event (no new event created)
- Increment
- If No Duplicate:
- Create new event with
occurrenceCount = 1
- Create new event with
Viewing Deduplicated Events
In the event list, deduplicated events show:
- Occurrence Badge: Shows count when > 1
- First Occurrence: Original event timestamp
- Last Occurrence: Most recent duplicate timestamp
- Severity: Highest severity across all occurrences
Benefits of Deduplication
- Reduces Alert Fatigue: One alert instead of hundreds
- Preserves Information: Track frequency with occurrence count
- Smart Matching: Dynamic content doesn't prevent deduplication
- Time-Based Windows: Only recent events are considered duplicates
- Status-Aware: Resolved events won't be matched
Configuring Deduplication
Administrators can adjust deduplication behavior:
- Time Window: Default 1 hour (Settings → Event Management)
- Status Matching: Which statuses to consider for duplicates
- Field Weights: Customize signature generation
- Exclusion Patterns: Events to never deduplicate
Event Correlation
How Correlation Works
KillIT v3 uses a multi-strategy correlation engine that automatically groups related events into incidents. This helps identify root causes and understand the full impact of issues.
Correlation Strategies
The system employs four intelligent correlation strategies:
1. Temporal Correlation
- Groups events occurring within 5-minute windows
- Scores based on time proximity (closer events = higher score)
- Catches cascading failures and alert storms
2. Topology Correlation
- Uses CMDB relationships to find events from related CIs
- Considers relationship types (critical relationships get higher scores)
- Identifies impact across connected infrastructure
- Example: Database failure → Application errors → Web timeouts
3. Pattern Correlation
- Matches events with the same correlation signature
- Groups recurring instances of the same issue
- Different from deduplication - these are related but distinct events
4. Service Correlation
- Groups events affecting the same service or application
- 15-minute time window for service-related issues
- Helps understand service-wide problems
Correlation Process
- Event Arrives → Saved and queued for correlation
- Worker Processing → Runs all 4 strategies in parallel
- Score Merging → Combines results, taking highest scores
- Correlation Assignment → Events with score > 0.7 are correlated
- Root Cause Analysis → Identifies the earliest critical/major event
Understanding Correlation Groups
When viewing correlated events, you'll see:
- Correlation ID: Unique identifier for the group (e.g., COR-123456789)
- Event Count: Number of related events
- Time Span: Duration from first to last event
- Severity Breakdown: Distribution of event severities
- Affected CIs: All Configuration Items involved
- Root Cause Candidate: Most likely originating event
Correlation Benefits
- Noise Reduction: See one incident instead of hundreds
- Root Cause Identification: Quickly find the source
- Impact Analysis: Understand full scope
- Faster Resolution: Address root cause, not symptoms
- Better Prioritization: Focus on critical issues
Viewing Correlation Information
- In Event List: Look for correlation badges
- In Event Details: Check the Correlation Tab
- Correlation Group View: See all related events together
Manual Correlation
If automatic correlation misses related events:
- Select primary event
- Click "Add to Correlation"
- Search for related events
- Confirm correlation
Correlation vs Deduplication
| Aspect | Deduplication | Correlation |
|---|---|---|
| Purpose | Reduce duplicate alerts | Group related incidents |
| Scope | Same event repeating | Different related events |
| Result | Single event, count increases | Multiple events, same correlation ID |
| Time Window | 1 hour | 5-15 minutes (varies by strategy) |
| Example | "DB down" × 100 → 1 event | DB down + App errors + User complaints |
Real-World Example
Database Incident Timeline:
10:00 - Database CPU hits 95% (Critical)
10:01 - Database connection pool exhausted (Major)
10:02 - App server connection errors (Major) × 5
10:03 - Web server timeouts (Warning) × 10
10:04 - User login failures (Minor) × 50
Correlation Result:
- Correlation ID: COR-20250617-abc123
- Total Events: 67
- Root Cause: Database CPU spike
- Affected Services: Database, Application, Web
- Recommended Action: Scale database resources
AI-Powered Features
Anomaly Detection
The AI analyzes each event for anomalies:
- Score 0-30%: Normal behavior
- Score 30-70%: Unusual but not critical
- Score 70-100%: Highly anomalous, investigate
Root Cause Analysis
AI identifies probable root causes by analyzing:
- Event timing and sequence
- CI relationships and dependencies
- Historical patterns
- Current system state
Suggested Actions
For each event, AI may suggest:
- Immediate remediation steps
- Long-term fixes
- Automation opportunities
- Prevention strategies
Automation
Available Automations
- Service Restart: Safely restart failed services
- Resource Scaling: Add CPU/memory when needed
- Log Rotation: Clear full disks
- Cache Clearing: Reset application caches
Enabling Automation
- Review suggested actions in event details
- Click "Enable Automation" for approved actions
- Monitor automation execution in timeline
- Automation results appear in event notes
Creating Custom Automations
- Navigate to Settings → Automation
- Create new runbook
- Define trigger conditions
- Add automation steps
- Test in non-production first
Best Practices
Response Times
Aim for these targets:
| Severity | Acknowledgment | Resolution |
|---|---|---|
| Critical | 5 minutes | 1 hour |
| Major | 15 minutes | 4 hours |
| Minor | 1 hour | 1 day |
| Warning | 4 hours | 1 week |
Event Hygiene
- Acknowledge Promptly: Shows you're aware and working
- Update Regularly: Add notes as you investigate
- Document Resolution: Help future troubleshooting
- Review Suppressed: Periodically check suppressed events
- Learn from Patterns: Use insights to prevent recurrence
Team Collaboration
- Use @mentions in notes to loop in experts
- Share Findings in resolution notes
- Create Knowledge Base entries for complex issues
- Review Post-Mortems for major incidents
Reporting
Available Reports
- Event Summary: Overview by severity and source
- MTTR Analysis: Resolution time trends
- Top Issues: Most frequent problems
- Team Performance: Who resolves what
- SLA Compliance: Meeting service levels
Creating Custom Reports
- Navigate to Reports → Event Reports
- Select report type
- Choose filters and date range
- Schedule or run immediately
- Export to PDF/Excel
Integration with Other Modules
CMDB Integration
- Events automatically linked to CIs
- CI health status updated
- Relationship impact analysis
Service Request Integration
- Convert events to service requests
- Track resolution through ITSM workflow
- Maintain audit trail
Change Management
- Link events to changes
- Identify change-related incidents
- Improve change planning
Troubleshooting Common Issues
Events Not Correlating
- Check time windows in correlation settings
- Verify CI relationships are discovered
- Review correlation patterns
- Manually correlate if needed
Missing AI Analysis
- Ensure AI service is enabled
- Check event has required data
- Verify AI quota not exceeded
- Wait 2-5 minutes for analysis
Automation Not Executing
- Verify automation is enabled
- Check runbook permissions
- Review execution logs
- Test runbook manually