Skip to main content

Managing Events

This guide covers the day-to-day operations of managing events in KillIT v3's Event Management system.

Event Dashboard

The Event Dashboard provides a real-time view of your IT environment's health.

Key Metrics

  • Total Events: Count of all events in the selected time range
  • Critical/Major Open: High-severity events requiring attention
  • Average Resolution Time: Mean time to resolve events
  • Active Incidents: Currently open events

Filtering Events

Use filters to focus on specific events:

  • Status: Open, Acknowledged, Resolved, Suppressed
  • Severity: Critical, Major, Minor, Warning, Info
  • Source: Filter by monitoring tool
  • Time Range: Last hour, 6 hours, 24 hours, 3 days, 7 days

Event Lifecycle

1. Open State

  • New events enter the system in "Open" state
  • Automatic correlation and AI analysis begin
  • Notifications sent based on severity

2. Acknowledged State

  • Indicates someone is investigating
  • Stops escalation timers
  • Records who acknowledged and when

3. Resolved State

  • Issue has been fixed
  • Resolution notes document the fix
  • Metrics calculated for reporting

4. Suppressed State

  • Event deemed not actionable
  • Useful for known issues or maintenance
  • Removes from active event count

Working with Events

Viewing Event Details

Click any event to see:

Overview Tab

  • Event title and description
  • Severity and status indicators
  • Source and timing information
  • Related CI information
  • Assignment details

AI Analysis Tab

  • Anomaly score (0-100%)
  • Root cause analysis
  • Contributing factors
  • Suggested remediation actions
  • Automation opportunities

Correlation Tab

  • Related events in the correlation group
  • Affected Configuration Items
  • Root cause candidate identification
  • Pattern analysis

Timeline Tab

  • Complete event history
  • Status changes
  • User actions
  • Automation executions

Updating Event Status

  1. Click Update Status button
  2. Select new status:
    • Acknowledged: "I'm working on this"
    • Resolved: "The issue is fixed"
    • Suppressed: "This is not actionable"
  3. Add notes explaining your action
  4. For resolved events, select resolution category:
    • Auto-resolved
    • Manual fix
    • False positive
    • Duplicate
    • No action needed

Bulk Operations

Select multiple events to:

  • Acknowledge all
  • Assign to team member
  • Suppress similar events
  • Export for reporting

Event Deduplication

How Deduplication Works

KillIT v3 automatically identifies and groups duplicate events to reduce alert noise. When multiple instances of the same issue occur, they are consolidated into a single event with an occurrence count.

Correlation Signature

Each event receives a unique correlation signature based on:

  1. Source: The monitoring system (Nagios, Zabbix, etc.)
  2. CI/Host: The Configuration Item ID, hostname, or IP address
  3. Normalized Title: Event title with dynamic parts removed
  4. Service: The affected service name

The signature is generated as an MD5 hash of these components joined with "-".

Title Normalization

To handle dynamic content in event titles, the system normalizes them by replacing:

  • Dates (YYYY-MM-DD) → "DATE"
  • Times (HH:MM:SS) → "TIME"
  • Numbers → "NUM"
  • UUIDs → "UUID"
  • Multiple spaces → Single space

Example:

  • Original: "Database connection failed 3 times at 14:35:20 on 2025-06-17"
  • Normalized: "database connection failed NUM times at TIME on DATE"

Deduplication Process

  1. New Event Arrives: System generates correlation signature
  2. Duplicate Check: Searches for existing events with:
    • Same correlation signature
    • Status is "open" or "acknowledged"
    • Created within the last hour (configurable window)
  3. If Duplicate Found:
    • Increment occurrenceCount
    • Update lastOccurrence timestamp
    • Update severity if new event is more severe
    • Merge additional details
    • Return existing event (no new event created)
  4. If No Duplicate:
    • Create new event with occurrenceCount = 1

Viewing Deduplicated Events

In the event list, deduplicated events show:

  • Occurrence Badge: Shows count when > 1
  • First Occurrence: Original event timestamp
  • Last Occurrence: Most recent duplicate timestamp
  • Severity: Highest severity across all occurrences

Benefits of Deduplication

  • Reduces Alert Fatigue: One alert instead of hundreds
  • Preserves Information: Track frequency with occurrence count
  • Smart Matching: Dynamic content doesn't prevent deduplication
  • Time-Based Windows: Only recent events are considered duplicates
  • Status-Aware: Resolved events won't be matched

Configuring Deduplication

Administrators can adjust deduplication behavior:

  1. Time Window: Default 1 hour (Settings → Event Management)
  2. Status Matching: Which statuses to consider for duplicates
  3. Field Weights: Customize signature generation
  4. Exclusion Patterns: Events to never deduplicate

Event Correlation

How Correlation Works

KillIT v3 uses a multi-strategy correlation engine that automatically groups related events into incidents. This helps identify root causes and understand the full impact of issues.

Correlation Strategies

The system employs four intelligent correlation strategies:

1. Temporal Correlation

  • Groups events occurring within 5-minute windows
  • Scores based on time proximity (closer events = higher score)
  • Catches cascading failures and alert storms

2. Topology Correlation

  • Uses CMDB relationships to find events from related CIs
  • Considers relationship types (critical relationships get higher scores)
  • Identifies impact across connected infrastructure
  • Example: Database failure → Application errors → Web timeouts

3. Pattern Correlation

  • Matches events with the same correlation signature
  • Groups recurring instances of the same issue
  • Different from deduplication - these are related but distinct events

4. Service Correlation

  • Groups events affecting the same service or application
  • 15-minute time window for service-related issues
  • Helps understand service-wide problems

Correlation Process

  1. Event Arrives → Saved and queued for correlation
  2. Worker Processing → Runs all 4 strategies in parallel
  3. Score Merging → Combines results, taking highest scores
  4. Correlation Assignment → Events with score > 0.7 are correlated
  5. Root Cause Analysis → Identifies the earliest critical/major event

Understanding Correlation Groups

When viewing correlated events, you'll see:

  • Correlation ID: Unique identifier for the group (e.g., COR-123456789)
  • Event Count: Number of related events
  • Time Span: Duration from first to last event
  • Severity Breakdown: Distribution of event severities
  • Affected CIs: All Configuration Items involved
  • Root Cause Candidate: Most likely originating event

Correlation Benefits

  • Noise Reduction: See one incident instead of hundreds
  • Root Cause Identification: Quickly find the source
  • Impact Analysis: Understand full scope
  • Faster Resolution: Address root cause, not symptoms
  • Better Prioritization: Focus on critical issues

Viewing Correlation Information

  1. In Event List: Look for correlation badges
  2. In Event Details: Check the Correlation Tab
  3. Correlation Group View: See all related events together

Manual Correlation

If automatic correlation misses related events:

  1. Select primary event
  2. Click "Add to Correlation"
  3. Search for related events
  4. Confirm correlation

Correlation vs Deduplication

AspectDeduplicationCorrelation
PurposeReduce duplicate alertsGroup related incidents
ScopeSame event repeatingDifferent related events
ResultSingle event, count increasesMultiple events, same correlation ID
Time Window1 hour5-15 minutes (varies by strategy)
Example"DB down" × 100 → 1 eventDB down + App errors + User complaints

Real-World Example

Database Incident Timeline:
10:00 - Database CPU hits 95% (Critical)
10:01 - Database connection pool exhausted (Major)
10:02 - App server connection errors (Major) × 5
10:03 - Web server timeouts (Warning) × 10
10:04 - User login failures (Minor) × 50

Correlation Result:
- Correlation ID: COR-20250617-abc123
- Total Events: 67
- Root Cause: Database CPU spike
- Affected Services: Database, Application, Web
- Recommended Action: Scale database resources

AI-Powered Features

Anomaly Detection

The AI analyzes each event for anomalies:

  • Score 0-30%: Normal behavior
  • Score 30-70%: Unusual but not critical
  • Score 70-100%: Highly anomalous, investigate

Root Cause Analysis

AI identifies probable root causes by analyzing:

  • Event timing and sequence
  • CI relationships and dependencies
  • Historical patterns
  • Current system state

Suggested Actions

For each event, AI may suggest:

  • Immediate remediation steps
  • Long-term fixes
  • Automation opportunities
  • Prevention strategies

Automation

Available Automations

  • Service Restart: Safely restart failed services
  • Resource Scaling: Add CPU/memory when needed
  • Log Rotation: Clear full disks
  • Cache Clearing: Reset application caches

Enabling Automation

  1. Review suggested actions in event details
  2. Click "Enable Automation" for approved actions
  3. Monitor automation execution in timeline
  4. Automation results appear in event notes

Creating Custom Automations

  1. Navigate to Settings → Automation
  2. Create new runbook
  3. Define trigger conditions
  4. Add automation steps
  5. Test in non-production first

Best Practices

Response Times

Aim for these targets:

SeverityAcknowledgmentResolution
Critical5 minutes1 hour
Major15 minutes4 hours
Minor1 hour1 day
Warning4 hours1 week

Event Hygiene

  1. Acknowledge Promptly: Shows you're aware and working
  2. Update Regularly: Add notes as you investigate
  3. Document Resolution: Help future troubleshooting
  4. Review Suppressed: Periodically check suppressed events
  5. Learn from Patterns: Use insights to prevent recurrence

Team Collaboration

  • Use @mentions in notes to loop in experts
  • Share Findings in resolution notes
  • Create Knowledge Base entries for complex issues
  • Review Post-Mortems for major incidents

Reporting

Available Reports

  1. Event Summary: Overview by severity and source
  2. MTTR Analysis: Resolution time trends
  3. Top Issues: Most frequent problems
  4. Team Performance: Who resolves what
  5. SLA Compliance: Meeting service levels

Creating Custom Reports

  1. Navigate to Reports → Event Reports
  2. Select report type
  3. Choose filters and date range
  4. Schedule or run immediately
  5. Export to PDF/Excel

Integration with Other Modules

CMDB Integration

  • Events automatically linked to CIs
  • CI health status updated
  • Relationship impact analysis

Service Request Integration

  • Convert events to service requests
  • Track resolution through ITSM workflow
  • Maintain audit trail

Change Management

  • Link events to changes
  • Identify change-related incidents
  • Improve change planning

Troubleshooting Common Issues

Events Not Correlating

  1. Check time windows in correlation settings
  2. Verify CI relationships are discovered
  3. Review correlation patterns
  4. Manually correlate if needed

Missing AI Analysis

  1. Ensure AI service is enabled
  2. Check event has required data
  3. Verify AI quota not exceeded
  4. Wait 2-5 minutes for analysis

Automation Not Executing

  1. Verify automation is enabled
  2. Check runbook permissions
  3. Review execution logs
  4. Test runbook manually

Next Steps