Troubleshooting Discovery
This guide helps you diagnose and resolve common discovery issues in Tripl-i. Use the systematic approach and tools provided to quickly identify and fix problems that may arise during infrastructure discovery.
Diagnostic Framework
Troubleshooting Workflow
Discovery Health Check
The Tripl-i platform provides built-in health check capabilities to verify your discovery system is functioning properly:
-
Discovery Services Status
- Check if discovery agents are running
- Verify service connectivity
- Monitor background processes
-
Network Connectivity
- Test API endpoint accessibility
- Verify network routes to target systems
- Check firewall configurations
-
Credential Vault
- Test stored credentials
- Verify credential access
- Check for expired credentials
-
Recent Discovery Activity
- Review discovery run history
- Check success/failure rates
- Identify patterns in issues
-
Error Analysis
- Review recent error logs
- Identify recurring problems
- Track resolution progress
-
Resource Usage
- Monitor CPU and memory utilization
- Check disk space availability
- Track network bandwidth usage
Common Issues
No Discovery Data
Symptoms
- CIs not appearing in CMDB
- Discovery shows "No devices found"
- Empty discovery results
Diagnostic Steps
1. Network Connectivity:
Test: ping target_device
Test: telnet target_device 22/135/161
Check: Firewall rules
Check: Network ACLs
2. Discovery Service:
Check: Service status
Check: Worker processes
Check: Queue backlog
Review: Service logs
3. Target Availability:
Verify: Device is powered on
Verify: Services are running
Check: Local firewall
Check: SELinux/AppArmor
4. Discovery Scope:
Review: IP ranges
Review: Exclusion rules
Check: Discovery filters
Verify: Schedule active
Common Solutions
Firewall Configuration
# Windows - Allow WMI
netsh advfirewall firewall add rule name="WMI-In" dir=in action=allow protocol=TCP localport=135
netsh advfirewall firewall add rule name="WMI-Async-In" dir=in action=allow protocol=TCP localport=49152-65535
# Linux - Allow SSH
sudo ufw allow from 10.0.0.0/8 to any port 22
sudo iptables -A INPUT -s 10.0.0.0/8 -p tcp --dport 22 -j ACCEPT
# Network Device - Allow SNMP
access-list 100 permit udp host 10.1.1.100 any eq 161
snmp-server community public RO 100
Service Configuration
# Windows - Enable WMI
sc config winmgmt start= auto
net start winmgmt
# Enable Remote Registry
sc config RemoteRegistry start= auto
net start RemoteRegistry
# Linux - Configure SSH
sudo systemctl enable sshd
sudo systemctl start sshd
# Configure sudo for discovery
echo "discovery ALL=(ALL) NOPASSWD: /usr/bin/dmidecode, /bin/netstat, /sbin/ip" | sudo tee /etc/sudoers.d/discovery
Authentication Failures
Symptoms
- "Access denied" errors
- "Invalid credentials" messages
- Partial discovery with auth errors
Diagnostic Tests
# Test Windows credentials
$cred = Get-Credential
Test-WSMan -ComputerName target-server -Credential $cred -Authentication Negotiate
# Test WMI access
Get-WmiObject -Class Win32_OperatingSystem -ComputerName target-server -Credential $cred
# Test specific permissions
Get-WmiObject -Class Win32_Process -ComputerName target-server -Credential $cred
# Test Linux SSH
ssh -o PasswordAuthentication=yes discovery@target-host 'echo "Connection successful"'
# Test sudo permissions
ssh discovery@target-host 'sudo -l'
# Test specific commands
ssh discovery@target-host 'sudo dmidecode -t system'
Permission Requirements
Windows Requirements:
Local Groups:
- Performance Monitor Users
- Event Log Readers
- Distributed COM Users
User Rights:
- Log on as a service
- Access this computer from network
DCOM Permissions:
- Local Launch
- Remote Launch
- Local Activation
- Remote Activation
Linux Requirements:
SSH Access: Required
Sudo Commands:
- /usr/bin/dmidecode
- /bin/netstat or /sbin/ss
- /sbin/ip or /sbin/ifconfig
- /usr/bin/lsof (optional)
- /bin/ps
File Access:
- /proc/* (read)
- /sys/* (read)
- /etc/os-release (read)
Incomplete Discovery
Symptoms
- Missing software inventory
- Partial hardware information
- No relationship data
- Incomplete attributes
Root Cause Analysis
Check Collection Modules:
1. Agent Configuration:
- Verify enabled collectors
- Check module errors
- Review timeout settings
2. Data Collection:
- Process discovery enabled?
- Software scanning active?
- Network connections tracked?
3. Processing Pipeline:
- Normalization errors?
- Pattern matching failures?
- Enrichment timeouts?
Module-Specific Fixes
Software Discovery Issues
# Windows - Registry access
# Check if remote registry is enabled
sc \\target-server query RemoteRegistry
# Linux - Package manager access
# Verify package database readable
ssh discovery@target "rpm -qa | head -5"
ssh discovery@target "dpkg -l | head -5"
# Fix: Add discovery user to required groups
usermod -a -G rpm discovery # For RPM-based systems
Network Connection Discovery
# Enable network discovery
# Windows
netsh advfirewall firewall set rule group="File and Printer Sharing" new enable=Yes
# Linux - ensure ss/netstat available
# Check if commands exist
which ss netstat lsof
# Install if missing
sudo yum install -y iproute # For ss
sudo apt install -y net-tools # For netstat
Performance Issues
Symptoms
- Slow discovery completion
- High CPU/memory usage
- Network congestion
- Timeout errors
Performance Diagnostics
-- Analyze discovery performance
WITH discovery_stats AS (
SELECT
discovery_method,
target_type,
AVG(duration_seconds) as avg_duration,
MAX(duration_seconds) as max_duration,
COUNT(*) as total_discoveries,
SUM(CASE WHEN status = 'timeout' THEN 1 ELSE 0 END) as timeouts
FROM discovery_runs
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY discovery_method, target_type
)
SELECT * FROM discovery_stats
ORDER BY avg_duration DESC;
Performance Tuning
Optimization Strategies:
1. Parallel Processing:
Default: 10 concurrent
High Performance: 50 concurrent
Conservative: 5 concurrent
2. Timeout Adjustments:
Network Devices: 30s → 60s
Busy Servers: 60s → 120s
Slow Links: 30s → 90s
3. Discovery Scope:
- Reduce frequency for stable devices
- Use incremental for frequent scans
- Limit deep discovery to off-hours
4. Resource Limits:
CPU: Max 70% utilization
Memory: Max 4GB per worker
Network: Max 100Mbps total
Data Quality Issues
Symptoms
- Duplicate CIs created
- Incorrect classifications
- Missing relationships
- Stale data
Data Validation
# CI Deduplication Check
def check_duplicates():
duplicates = db.cis.aggregate([
{"$group": {
"_id": {"name": "$name", "serial": "$serialNumber"},
"count": {"$sum": 1},
"ids": {"$push": "$_id"}
}},
{"$match": {"count": {"$gt": 1}}}
])
for dup in duplicates:
print(f"Duplicate found: {dup['_id']} ({dup['count']} instances)")
# Merge or remove duplicates
merge_duplicate_cis(dup['ids'])
Data Cleanup Procedures
Cleanup Tasks:
1. Remove Orphaned CIs:
- No discovery update > 30 days
- No relationships
- Status = "Unknown"
2. Fix Misclassified Items:
- Re-run AI classification
- Apply pattern matching
- Manual review flagged items
3. Rebuild Relationships:
- Clear stale connections
- Re-discover network topology
- Validate service dependencies
Advanced Diagnostics
Debug Mode
# Enable debug logging for specific target
nopesight discovery debug --target 10.1.1.50 --verbose
# Debug output example:
[DEBUG] Starting discovery for 10.1.1.50
[DEBUG] Using credential: windows_domain_cred
[DEBUG] Attempting WMI connection...
[DEBUG] WMI connection established
[DEBUG] Querying Win32_ComputerSystem...
[DEBUG] Result: {Name: "SERVER01", Domain: "CORP.LOCAL", ...}
[DEBUG] Querying Win32_OperatingSystem...
[ERROR] Access denied to Win32_Process class
[DEBUG] Falling back to limited discovery mode
Network Packet Analysis
# Capture discovery traffic
sudo tcpdump -i eth0 -w discovery.pcap \
'host 10.1.1.50 and (port 22 or port 135 or port 445 or port 161)'
# Analyze WMI traffic
sudo tcpdump -nn -r discovery.pcap 'port 135' | head -20
# Check for SNMP timeouts
sudo tcpdump -nn -r discovery.pcap 'port 161' | \
grep -E "Timeout|No Response"
Discovery Agent Diagnostics
# Agent health check
nopesight-agent diagnose
# Output:
=== Tripl-i Agent Diagnostics ===
Version: 3.2.1
Status: Running
Uptime: 5d 14h 23m
Configuration:
Server: https://nopesight.company.com ✓
API Key: ****1234 ✓
Department: IT ✓
Connectivity:
Server reachable: ✓
Last check-in: 2 minutes ago ✓
SSL Certificate: Valid ✓
Collectors:
System Info: ✓ Enabled
Software: ✓ Enabled
Network: ✗ Error: Permission denied on /proc/net/tcp
Processes: ✓ Enabled
Recent Errors:
[2024-01-15 10:23:45] NetworkCollector: Cannot read /proc/net/tcp
[2024-01-15 09:15:32] SoftwareCollector: dpkg timeout
Recommendations:
1. Add agent user to 'proc' group for network collection
2. Increase software scan timeout to 60s
Troubleshooting Tools
Built-in Diagnostics
Web UI Tools:
Discovery Test:
- Target specific device
- Test specific credential
- Use specific method
- View real-time logs
Credential Tester:
- Validate credentials
- Check permissions
- Test connectivity
- Show capabilities
Pattern Debugger:
- Test pattern matching
- View match details
- Debug regex
- Validate logic
Command Line Tools
# Discovery CLI toolkit
# Test specific discovery method
nopesight discover test \
--method wmi \
--target 10.1.1.50 \
--credential prod_windows \
--debug
# Validate discovery scope
nopesight discover validate-scope \
--ranges "10.0.0.0/16,192.168.0.0/24" \
--show-conflicts
# Analyze discovery queue
nopesight queue status --queue discovery
nopesight queue peek discovery --count 10
# Force discovery retry
nopesight discover retry --failed --last 1h
Log Analysis
# Common log locations
/var/log/nopesight/discovery.log # Main discovery log
/var/log/nopesight/agent.log # Agent logs
/var/log/nopesight/scheduler.log # Scheduler logs
/var/log/nopesight/error.log # Error aggregation
# Useful grep patterns
# Find authentication failures
grep -i "auth\|denied\|permission" discovery.log
# Find timeout issues
grep -i "timeout\|timed out" discovery.log
# Find network errors
grep -i "unreachable\|refused\|network" discovery.log
# Find pattern matching issues
grep -i "pattern\|match\|regex" discovery.log
# Performance issues
grep -i "slow\|performance\|exceeded" discovery.log
Recovery Procedures
Emergency Recovery
When Discovery Completely Fails:
1. Stop all discovery:
systemctl stop nopesight-discovery
nopesight discovery pause --all
2. Clear stuck jobs:
nopesight queue clear discovery --stuck
redis-cli DEL "bull:discovery:*"
3. Reset discovery state:
nopesight discovery reset --confirm
4. Restart services:
systemctl start nopesight-discovery
systemctl restart nopesight-scheduler
5. Test with single target:
nopesight discover now --target 10.1.1.1
6. Resume normal operations:
nopesight discovery resume --all
Data Recovery
-- Restore CIs from discovery history
INSERT INTO cis (name, type, attributes, last_discovered)
SELECT
raw_data->>'hostname' as name,
raw_data->>'device_type' as type,
raw_data->'attributes' as attributes,
discovered_at as last_discovered
FROM discovery_history
WHERE discovered_at >= '2024-01-14'
AND status = 'success'
AND raw_data->>'hostname' NOT IN (
SELECT name FROM cis WHERE tenant_id = 'IT'
);
Prevention Strategies
Monitoring Setup
Proactive Monitoring:
Metrics to Track:
- Discovery success rate < 95%
- Average duration increasing
- Timeout rate > 5%
- Queue depth > 1000
- Error rate > 2%
Alerts:
- Discovery failures > 10 in 5 min
- No discoveries in expected window
- Credential failures spike
- Resource exhaustion warning
Dashboards:
- Real-time discovery status
- Success rate trends
- Performance metrics
- Error categorization
Best Practices
-
Regular Maintenance
- Weekly credential validation
- Monthly discovery audit
- Quarterly pattern review
- Annual architecture review
-
Documentation
- Document all custom patterns
- Maintain troubleshooting runbook
- Record common solutions
- Update network diagrams
-
Testing
- Test credentials before production
- Validate patterns in dev
- Load test discovery system
- Practice recovery procedures
Getting Help
Support Resources
Internal Resources:
- Discovery team Slack: #discovery-help
- Wiki: https://wiki.company.com/nopesight
- Runbooks: https://runbooks.company.com
Tripl-i Support:
- Email: support@nopesight.com
- Portal: https://support.nopesight.com
- Phone: 1-800-NOPESIGHT
Community:
- Forums: https://community.nopesight.com
- GitHub: https://github.com/nopesight/patterns
- Slack: nopesight-users.slack.com
Diagnostic Package
#!/bin/bash
# Create diagnostic package for support
DIAG_DIR="/tmp/nopesight-diag-$(date +%Y%m%d-%H%M%S)"
mkdir -p $DIAG_DIR
# Collect system info
nopesight system info > $DIAG_DIR/system-info.txt
nopesight discovery status > $DIAG_DIR/discovery-status.txt
# Collect recent logs
tail -n 10000 /var/log/nopesight/*.log > $DIAG_DIR/recent-logs.txt
# Collect configuration (sanitized)
nopesight config export --sanitize > $DIAG_DIR/config.yaml
# Create archive
tar -czf $DIAG_DIR.tar.gz -C /tmp $(basename $DIAG_DIR)
echo "Diagnostic package created: $DIAG_DIR.tar.gz"
Next Steps
- 📖 Best Practices - CMDB best practices
- 📖 Performance Tuning - System optimization
- 📖 Support Guide - Getting help