Troubleshooting Discovery
This guide helps you diagnose and resolve common discovery issues in NopeSight. Use the systematic approach and tools provided to quickly identify and fix problems that may arise during infrastructure discovery.
Diagnostic Framework
Troubleshooting Workflow
Discovery Health Check
#!/bin/bash
# NopeSight Discovery Health Check Script
echo "=== Discovery System Health Check ==="
echo
# Check discovery services
echo "1. Checking Discovery Services..."
systemctl status nopesight-discovery
systemctl status nopesight-agent
systemctl status nopesight-scheduler
# Check connectivity
echo -e "\n2. Checking Network Connectivity..."
curl -s https://nopesight.company.com/api/health || echo "API unreachable"
# Check credentials
echo -e "\n3. Checking Credential Vault..."
nopesight credential test --all --summary
# Check recent discoveries
echo -e "\n4. Recent Discovery Status..."
nopesight discovery status --last 1h
# Check error logs
echo -e "\n5. Recent Errors..."
grep ERROR /var/log/nopesight/discovery.log | tail -20
# Check resource usage
echo -e "\n6. Resource Usage..."
top -bn1 | grep nopesight
df -h | grep -E "^/|nopesight"
echo -e "\n=== Health Check Complete ==="
Common Issues
No Discovery Data
Symptoms
- CIs not appearing in CMDB
- Discovery shows "No devices found"
- Empty discovery results
Diagnostic Steps
1. Network Connectivity:
Test: ping target_device
Test: telnet target_device 22/135/161
Check: Firewall rules
Check: Network ACLs
2. Discovery Service:
Check: Service status
Check: Worker processes
Check: Queue backlog
Review: Service logs
3. Target Availability:
Verify: Device is powered on
Verify: Services are running
Check: Local firewall
Check: SELinux/AppArmor
4. Discovery Scope:
Review: IP ranges
Review: Exclusion rules
Check: Discovery filters
Verify: Schedule active
Common Solutions
Firewall Configuration
# Windows - Allow WMI
netsh advfirewall firewall add rule name="WMI-In" dir=in action=allow protocol=TCP localport=135
netsh advfirewall firewall add rule name="WMI-Async-In" dir=in action=allow protocol=TCP localport=49152-65535
# Linux - Allow SSH
sudo ufw allow from 10.0.0.0/8 to any port 22
sudo iptables -A INPUT -s 10.0.0.0/8 -p tcp --dport 22 -j ACCEPT
# Network Device - Allow SNMP
access-list 100 permit udp host 10.1.1.100 any eq 161
snmp-server community public RO 100
Service Configuration
# Windows - Enable WMI
sc config winmgmt start= auto
net start winmgmt
# Enable Remote Registry
sc config RemoteRegistry start= auto
net start RemoteRegistry
# Linux - Configure SSH
sudo systemctl enable sshd
sudo systemctl start sshd
# Configure sudo for discovery
echo "discovery ALL=(ALL) NOPASSWD: /usr/bin/dmidecode, /bin/netstat, /sbin/ip" | sudo tee /etc/sudoers.d/discovery
Authentication Failures
Symptoms
- "Access denied" errors
- "Invalid credentials" messages
- Partial discovery with auth errors
Diagnostic Tests
# Test Windows credentials
$cred = Get-Credential
Test-WSMan -ComputerName target-server -Credential $cred -Authentication Negotiate
# Test WMI access
Get-WmiObject -Class Win32_OperatingSystem -ComputerName target-server -Credential $cred
# Test specific permissions
Get-WmiObject -Class Win32_Process -ComputerName target-server -Credential $cred
# Test Linux SSH
ssh -o PasswordAuthentication=yes discovery@target-host 'echo "Connection successful"'
# Test sudo permissions
ssh discovery@target-host 'sudo -l'
# Test specific commands
ssh discovery@target-host 'sudo dmidecode -t system'
Permission Requirements
Windows Requirements:
Local Groups:
- Performance Monitor Users
- Event Log Readers
- Distributed COM Users
User Rights:
- Log on as a service
- Access this computer from network
DCOM Permissions:
- Local Launch
- Remote Launch
- Local Activation
- Remote Activation
Linux Requirements:
SSH Access: Required
Sudo Commands:
- /usr/bin/dmidecode
- /bin/netstat or /sbin/ss
- /sbin/ip or /sbin/ifconfig
- /usr/bin/lsof (optional)
- /bin/ps
File Access:
- /proc/* (read)
- /sys/* (read)
- /etc/os-release (read)
Incomplete Discovery
Symptoms
- Missing software inventory
- Partial hardware information
- No relationship data
- Incomplete attributes
Root Cause Analysis
Check Collection Modules:
1. Agent Configuration:
- Verify enabled collectors
- Check module errors
- Review timeout settings
2. Data Collection:
- Process discovery enabled?
- Software scanning active?
- Network connections tracked?
3. Processing Pipeline:
- Normalization errors?
- Pattern matching failures?
- Enrichment timeouts?
Module-Specific Fixes
Software Discovery Issues
# Windows - Registry access
# Check if remote registry is enabled
sc \\target-server query RemoteRegistry
# Linux - Package manager access
# Verify package database readable
ssh discovery@target "rpm -qa | head -5"
ssh discovery@target "dpkg -l | head -5"
# Fix: Add discovery user to required groups
usermod -a -G rpm discovery # For RPM-based systems
Network Connection Discovery
# Enable network discovery
# Windows
netsh advfirewall firewall set rule group="File and Printer Sharing" new enable=Yes
# Linux - ensure ss/netstat available
# Check if commands exist
which ss netstat lsof
# Install if missing
sudo yum install -y iproute # For ss
sudo apt install -y net-tools # For netstat
Performance Issues
Symptoms
- Slow discovery completion
- High CPU/memory usage
- Network congestion
- Timeout errors
Performance Diagnostics
-- Analyze discovery performance
WITH discovery_stats AS (
SELECT
discovery_method,
target_type,
AVG(duration_seconds) as avg_duration,
MAX(duration_seconds) as max_duration,
COUNT(*) as total_discoveries,
SUM(CASE WHEN status = 'timeout' THEN 1 ELSE 0 END) as timeouts
FROM discovery_runs
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY discovery_method, target_type
)
SELECT * FROM discovery_stats
ORDER BY avg_duration DESC;
Performance Tuning
Optimization Strategies:
1. Parallel Processing:
Default: 10 concurrent
High Performance: 50 concurrent
Conservative: 5 concurrent
2. Timeout Adjustments:
Network Devices: 30s → 60s
Busy Servers: 60s → 120s
Slow Links: 30s → 90s
3. Discovery Scope:
- Reduce frequency for stable devices
- Use incremental for frequent scans
- Limit deep discovery to off-hours
4. Resource Limits:
CPU: Max 70% utilization
Memory: Max 4GB per worker
Network: Max 100Mbps total
Data Quality Issues
Symptoms
- Duplicate CIs created
- Incorrect classifications
- Missing relationships
- Stale data
Data Validation
# CI Deduplication Check
def check_duplicates():
duplicates = db.cis.aggregate([
{"$group": {
"_id": {"name": "$name", "serial": "$serialNumber"},
"count": {"$sum": 1},
"ids": {"$push": "$_id"}
}},
{"$match": {"count": {"$gt": 1}}}
])
for dup in duplicates:
print(f"Duplicate found: {dup['_id']} ({dup['count']} instances)")
# Merge or remove duplicates
merge_duplicate_cis(dup['ids'])
Data Cleanup Procedures
Cleanup Tasks:
1. Remove Orphaned CIs:
- No discovery update > 30 days
- No relationships
- Status = "Unknown"
2. Fix Misclassified Items:
- Re-run AI classification
- Apply pattern matching
- Manual review flagged items
3. Rebuild Relationships:
- Clear stale connections
- Re-discover network topology
- Validate service dependencies
Advanced Diagnostics
Debug Mode
# Enable debug logging for specific target
nopesight discovery debug --target 10.1.1.50 --verbose
# Debug output example:
[DEBUG] Starting discovery for 10.1.1.50
[DEBUG] Using credential: windows_domain_cred
[DEBUG] Attempting WMI connection...
[DEBUG] WMI connection established
[DEBUG] Querying Win32_ComputerSystem...
[DEBUG] Result: {Name: "SERVER01", Domain: "CORP.LOCAL", ...}
[DEBUG] Querying Win32_OperatingSystem...
[ERROR] Access denied to Win32_Process class
[DEBUG] Falling back to limited discovery mode
Network Packet Analysis
# Capture discovery traffic
sudo tcpdump -i eth0 -w discovery.pcap \
'host 10.1.1.50 and (port 22 or port 135 or port 445 or port 161)'
# Analyze WMI traffic
sudo tcpdump -nn -r discovery.pcap 'port 135' | head -20
# Check for SNMP timeouts
sudo tcpdump -nn -r discovery.pcap 'port 161' | \
grep -E "Timeout|No Response"
Discovery Agent Diagnostics
# Agent health check
nopesight-agent diagnose
# Output:
=== NopeSight Agent Diagnostics ===
Version: 3.2.1
Status: Running
Uptime: 5d 14h 23m
Configuration:
Server: https://nopesight.company.com ✓
API Key: ****1234 ✓
Department: IT ✓
Connectivity:
Server reachable: ✓
Last check-in: 2 minutes ago ✓
SSL Certificate: Valid ✓
Collectors:
System Info: ✓ Enabled
Software: ✓ Enabled
Network: ✗ Error: Permission denied on /proc/net/tcp
Processes: ✓ Enabled
Recent Errors:
[2024-01-15 10:23:45] NetworkCollector: Cannot read /proc/net/tcp
[2024-01-15 09:15:32] SoftwareCollector: dpkg timeout
Recommendations:
1. Add agent user to 'proc' group for network collection
2. Increase software scan timeout to 60s
Troubleshooting Tools
Built-in Diagnostics
Web UI Tools:
Discovery Test:
- Target specific device
- Test specific credential
- Use specific method
- View real-time logs
Credential Tester:
- Validate credentials
- Check permissions
- Test connectivity
- Show capabilities
Pattern Debugger:
- Test pattern matching
- View match details
- Debug regex
- Validate logic
Command Line Tools
# Discovery CLI toolkit
# Test specific discovery method
nopesight discover test \
--method wmi \
--target 10.1.1.50 \
--credential prod_windows \
--debug
# Validate discovery scope
nopesight discover validate-scope \
--ranges "10.0.0.0/16,192.168.0.0/24" \
--show-conflicts
# Analyze discovery queue
nopesight queue status --queue discovery
nopesight queue peek discovery --count 10
# Force discovery retry
nopesight discover retry --failed --last 1h
Log Analysis
# Common log locations
/var/log/nopesight/discovery.log # Main discovery log
/var/log/nopesight/agent.log # Agent logs
/var/log/nopesight/scheduler.log # Scheduler logs
/var/log/nopesight/error.log # Error aggregation
# Useful grep patterns
# Find authentication failures
grep -i "auth\|denied\|permission" discovery.log
# Find timeout issues
grep -i "timeout\|timed out" discovery.log
# Find network errors
grep -i "unreachable\|refused\|network" discovery.log
# Find pattern matching issues
grep -i "pattern\|match\|regex" discovery.log
# Performance issues
grep -i "slow\|performance\|exceeded" discovery.log
Recovery Procedures
Emergency Recovery
When Discovery Completely Fails:
1. Stop all discovery:
systemctl stop nopesight-discovery
nopesight discovery pause --all
2. Clear stuck jobs:
nopesight queue clear discovery --stuck
redis-cli DEL "bull:discovery:*"
3. Reset discovery state:
nopesight discovery reset --confirm
4. Restart services:
systemctl start nopesight-discovery
systemctl restart nopesight-scheduler
5. Test with single target:
nopesight discover now --target 10.1.1.1
6. Resume normal operations:
nopesight discovery resume --all
Data Recovery
-- Restore CIs from discovery history
INSERT INTO cis (name, type, attributes, last_discovered)
SELECT
raw_data->>'hostname' as name,
raw_data->>'device_type' as type,
raw_data->'attributes' as attributes,
discovered_at as last_discovered
FROM discovery_history
WHERE discovered_at >= '2024-01-14'
AND status = 'success'
AND raw_data->>'hostname' NOT IN (
SELECT name FROM cis WHERE tenant_id = 'IT'
);
Prevention Strategies
Monitoring Setup
Proactive Monitoring:
Metrics to Track:
- Discovery success rate < 95%
- Average duration increasing
- Timeout rate > 5%
- Queue depth > 1000
- Error rate > 2%
Alerts:
- Discovery failures > 10 in 5 min
- No discoveries in expected window
- Credential failures spike
- Resource exhaustion warning
Dashboards:
- Real-time discovery status
- Success rate trends
- Performance metrics
- Error categorization
Best Practices
-
Regular Maintenance
- Weekly credential validation
- Monthly discovery audit
- Quarterly pattern review
- Annual architecture review
-
Documentation
- Document all custom patterns
- Maintain troubleshooting runbook
- Record common solutions
- Update network diagrams
-
Testing
- Test credentials before production
- Validate patterns in dev
- Load test discovery system
- Practice recovery procedures
Getting Help
Support Resources
Internal Resources:
- Discovery team Slack: #discovery-help
- Wiki: https://wiki.company.com/nopesight
- Runbooks: https://runbooks.company.com
NopeSight Support:
- Email: support@nopesight.com
- Portal: https://support.nopesight.com
- Phone: 1-800-NOPESIGHT
Community:
- Forums: https://community.nopesight.com
- GitHub: https://github.com/nopesight/patterns
- Slack: nopesight-users.slack.com
Diagnostic Package
#!/bin/bash
# Create diagnostic package for support
DIAG_DIR="/tmp/nopesight-diag-$(date +%Y%m%d-%H%M%S)"
mkdir -p $DIAG_DIR
# Collect system info
nopesight system info > $DIAG_DIR/system-info.txt
nopesight discovery status > $DIAG_DIR/discovery-status.txt
# Collect recent logs
tail -n 10000 /var/log/nopesight/*.log > $DIAG_DIR/recent-logs.txt
# Collect configuration (sanitized)
nopesight config export --sanitize > $DIAG_DIR/config.yaml
# Create archive
tar -czf $DIAG_DIR.tar.gz -C /tmp $(basename $DIAG_DIR)
echo "Diagnostic package created: $DIAG_DIR.tar.gz"
Next Steps
- 📖 Best Practices - CMDB best practices
- 📖 Performance Tuning - System optimization
- 📖 Support Guide - Getting help