Skip to main content

Troubleshooting Discovery

This guide helps you diagnose and resolve common discovery issues in NopeSight. Use the systematic approach and tools provided to quickly identify and fix problems that may arise during infrastructure discovery.

Diagnostic Framework

Troubleshooting Workflow

Discovery Health Check

#!/bin/bash
# NopeSight Discovery Health Check Script

echo "=== Discovery System Health Check ==="
echo

# Check discovery services
echo "1. Checking Discovery Services..."
systemctl status nopesight-discovery
systemctl status nopesight-agent
systemctl status nopesight-scheduler

# Check connectivity
echo -e "\n2. Checking Network Connectivity..."
curl -s https://nopesight.company.com/api/health || echo "API unreachable"

# Check credentials
echo -e "\n3. Checking Credential Vault..."
nopesight credential test --all --summary

# Check recent discoveries
echo -e "\n4. Recent Discovery Status..."
nopesight discovery status --last 1h

# Check error logs
echo -e "\n5. Recent Errors..."
grep ERROR /var/log/nopesight/discovery.log | tail -20

# Check resource usage
echo -e "\n6. Resource Usage..."
top -bn1 | grep nopesight
df -h | grep -E "^/|nopesight"

echo -e "\n=== Health Check Complete ==="

Common Issues

No Discovery Data

Symptoms

  • CIs not appearing in CMDB
  • Discovery shows "No devices found"
  • Empty discovery results

Diagnostic Steps

1. Network Connectivity:
Test: ping target_device
Test: telnet target_device 22/135/161
Check: Firewall rules
Check: Network ACLs

2. Discovery Service:
Check: Service status
Check: Worker processes
Check: Queue backlog
Review: Service logs

3. Target Availability:
Verify: Device is powered on
Verify: Services are running
Check: Local firewall
Check: SELinux/AppArmor

4. Discovery Scope:
Review: IP ranges
Review: Exclusion rules
Check: Discovery filters
Verify: Schedule active

Common Solutions

Firewall Configuration
# Windows - Allow WMI
netsh advfirewall firewall add rule name="WMI-In" dir=in action=allow protocol=TCP localport=135
netsh advfirewall firewall add rule name="WMI-Async-In" dir=in action=allow protocol=TCP localport=49152-65535

# Linux - Allow SSH
sudo ufw allow from 10.0.0.0/8 to any port 22
sudo iptables -A INPUT -s 10.0.0.0/8 -p tcp --dport 22 -j ACCEPT

# Network Device - Allow SNMP
access-list 100 permit udp host 10.1.1.100 any eq 161
snmp-server community public RO 100
Service Configuration
# Windows - Enable WMI
sc config winmgmt start= auto
net start winmgmt

# Enable Remote Registry
sc config RemoteRegistry start= auto
net start RemoteRegistry

# Linux - Configure SSH
sudo systemctl enable sshd
sudo systemctl start sshd

# Configure sudo for discovery
echo "discovery ALL=(ALL) NOPASSWD: /usr/bin/dmidecode, /bin/netstat, /sbin/ip" | sudo tee /etc/sudoers.d/discovery

Authentication Failures

Symptoms

  • "Access denied" errors
  • "Invalid credentials" messages
  • Partial discovery with auth errors

Diagnostic Tests

# Test Windows credentials
$cred = Get-Credential
Test-WSMan -ComputerName target-server -Credential $cred -Authentication Negotiate

# Test WMI access
Get-WmiObject -Class Win32_OperatingSystem -ComputerName target-server -Credential $cred

# Test specific permissions
Get-WmiObject -Class Win32_Process -ComputerName target-server -Credential $cred
# Test Linux SSH
ssh -o PasswordAuthentication=yes discovery@target-host 'echo "Connection successful"'

# Test sudo permissions
ssh discovery@target-host 'sudo -l'

# Test specific commands
ssh discovery@target-host 'sudo dmidecode -t system'

Permission Requirements

Windows Requirements:
Local Groups:
- Performance Monitor Users
- Event Log Readers
- Distributed COM Users

User Rights:
- Log on as a service
- Access this computer from network

DCOM Permissions:
- Local Launch
- Remote Launch
- Local Activation
- Remote Activation

Linux Requirements:
SSH Access: Required
Sudo Commands:
- /usr/bin/dmidecode
- /bin/netstat or /sbin/ss
- /sbin/ip or /sbin/ifconfig
- /usr/bin/lsof (optional)
- /bin/ps

File Access:
- /proc/* (read)
- /sys/* (read)
- /etc/os-release (read)

Incomplete Discovery

Symptoms

  • Missing software inventory
  • Partial hardware information
  • No relationship data
  • Incomplete attributes

Root Cause Analysis

Check Collection Modules:
1. Agent Configuration:
- Verify enabled collectors
- Check module errors
- Review timeout settings

2. Data Collection:
- Process discovery enabled?
- Software scanning active?
- Network connections tracked?

3. Processing Pipeline:
- Normalization errors?
- Pattern matching failures?
- Enrichment timeouts?

Module-Specific Fixes

Software Discovery Issues
# Windows - Registry access
# Check if remote registry is enabled
sc \\target-server query RemoteRegistry

# Linux - Package manager access
# Verify package database readable
ssh discovery@target "rpm -qa | head -5"
ssh discovery@target "dpkg -l | head -5"

# Fix: Add discovery user to required groups
usermod -a -G rpm discovery # For RPM-based systems
Network Connection Discovery
# Enable network discovery
# Windows
netsh advfirewall firewall set rule group="File and Printer Sharing" new enable=Yes

# Linux - ensure ss/netstat available
# Check if commands exist
which ss netstat lsof

# Install if missing
sudo yum install -y iproute # For ss
sudo apt install -y net-tools # For netstat

Performance Issues

Symptoms

  • Slow discovery completion
  • High CPU/memory usage
  • Network congestion
  • Timeout errors

Performance Diagnostics

-- Analyze discovery performance
WITH discovery_stats AS (
SELECT
discovery_method,
target_type,
AVG(duration_seconds) as avg_duration,
MAX(duration_seconds) as max_duration,
COUNT(*) as total_discoveries,
SUM(CASE WHEN status = 'timeout' THEN 1 ELSE 0 END) as timeouts
FROM discovery_runs
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY discovery_method, target_type
)
SELECT * FROM discovery_stats
ORDER BY avg_duration DESC;

Performance Tuning

Optimization Strategies:
1. Parallel Processing:
Default: 10 concurrent
High Performance: 50 concurrent
Conservative: 5 concurrent

2. Timeout Adjustments:
Network Devices: 30s → 60s
Busy Servers: 60s → 120s
Slow Links: 30s → 90s

3. Discovery Scope:
- Reduce frequency for stable devices
- Use incremental for frequent scans
- Limit deep discovery to off-hours

4. Resource Limits:
CPU: Max 70% utilization
Memory: Max 4GB per worker
Network: Max 100Mbps total

Data Quality Issues

Symptoms

  • Duplicate CIs created
  • Incorrect classifications
  • Missing relationships
  • Stale data

Data Validation

# CI Deduplication Check
def check_duplicates():
duplicates = db.cis.aggregate([
{"$group": {
"_id": {"name": "$name", "serial": "$serialNumber"},
"count": {"$sum": 1},
"ids": {"$push": "$_id"}
}},
{"$match": {"count": {"$gt": 1}}}
])

for dup in duplicates:
print(f"Duplicate found: {dup['_id']} ({dup['count']} instances)")
# Merge or remove duplicates
merge_duplicate_cis(dup['ids'])

Data Cleanup Procedures

Cleanup Tasks:
1. Remove Orphaned CIs:
- No discovery update > 30 days
- No relationships
- Status = "Unknown"

2. Fix Misclassified Items:
- Re-run AI classification
- Apply pattern matching
- Manual review flagged items

3. Rebuild Relationships:
- Clear stale connections
- Re-discover network topology
- Validate service dependencies

Advanced Diagnostics

Debug Mode

# Enable debug logging for specific target
nopesight discovery debug --target 10.1.1.50 --verbose

# Debug output example:
[DEBUG] Starting discovery for 10.1.1.50
[DEBUG] Using credential: windows_domain_cred
[DEBUG] Attempting WMI connection...
[DEBUG] WMI connection established
[DEBUG] Querying Win32_ComputerSystem...
[DEBUG] Result: {Name: "SERVER01", Domain: "CORP.LOCAL", ...}
[DEBUG] Querying Win32_OperatingSystem...
[ERROR] Access denied to Win32_Process class
[DEBUG] Falling back to limited discovery mode

Network Packet Analysis

# Capture discovery traffic
sudo tcpdump -i eth0 -w discovery.pcap \
'host 10.1.1.50 and (port 22 or port 135 or port 445 or port 161)'

# Analyze WMI traffic
sudo tcpdump -nn -r discovery.pcap 'port 135' | head -20

# Check for SNMP timeouts
sudo tcpdump -nn -r discovery.pcap 'port 161' | \
grep -E "Timeout|No Response"

Discovery Agent Diagnostics

# Agent health check
nopesight-agent diagnose

# Output:
=== NopeSight Agent Diagnostics ===
Version: 3.2.1
Status: Running
Uptime: 5d 14h 23m

Configuration:
Server: https://nopesight.company.com ✓
API Key: ****1234 ✓
Department: IT ✓

Connectivity:
Server reachable: ✓
Last check-in: 2 minutes ago ✓
SSL Certificate: Valid ✓

Collectors:
System Info: ✓ Enabled
Software: ✓ Enabled
Network: ✗ Error: Permission denied on /proc/net/tcp
Processes: ✓ Enabled

Recent Errors:
[2024-01-15 10:23:45] NetworkCollector: Cannot read /proc/net/tcp
[2024-01-15 09:15:32] SoftwareCollector: dpkg timeout

Recommendations:
1. Add agent user to 'proc' group for network collection
2. Increase software scan timeout to 60s

Troubleshooting Tools

Built-in Diagnostics

Web UI Tools:
Discovery Test:
- Target specific device
- Test specific credential
- Use specific method
- View real-time logs

Credential Tester:
- Validate credentials
- Check permissions
- Test connectivity
- Show capabilities

Pattern Debugger:
- Test pattern matching
- View match details
- Debug regex
- Validate logic

Command Line Tools

# Discovery CLI toolkit

# Test specific discovery method
nopesight discover test \
--method wmi \
--target 10.1.1.50 \
--credential prod_windows \
--debug

# Validate discovery scope
nopesight discover validate-scope \
--ranges "10.0.0.0/16,192.168.0.0/24" \
--show-conflicts

# Analyze discovery queue
nopesight queue status --queue discovery
nopesight queue peek discovery --count 10

# Force discovery retry
nopesight discover retry --failed --last 1h

Log Analysis

# Common log locations
/var/log/nopesight/discovery.log # Main discovery log
/var/log/nopesight/agent.log # Agent logs
/var/log/nopesight/scheduler.log # Scheduler logs
/var/log/nopesight/error.log # Error aggregation

# Useful grep patterns
# Find authentication failures
grep -i "auth\|denied\|permission" discovery.log

# Find timeout issues
grep -i "timeout\|timed out" discovery.log

# Find network errors
grep -i "unreachable\|refused\|network" discovery.log

# Find pattern matching issues
grep -i "pattern\|match\|regex" discovery.log

# Performance issues
grep -i "slow\|performance\|exceeded" discovery.log

Recovery Procedures

Emergency Recovery

When Discovery Completely Fails:
1. Stop all discovery:
systemctl stop nopesight-discovery
nopesight discovery pause --all

2. Clear stuck jobs:
nopesight queue clear discovery --stuck
redis-cli DEL "bull:discovery:*"

3. Reset discovery state:
nopesight discovery reset --confirm

4. Restart services:
systemctl start nopesight-discovery
systemctl restart nopesight-scheduler

5. Test with single target:
nopesight discover now --target 10.1.1.1

6. Resume normal operations:
nopesight discovery resume --all

Data Recovery

-- Restore CIs from discovery history
INSERT INTO cis (name, type, attributes, last_discovered)
SELECT
raw_data->>'hostname' as name,
raw_data->>'device_type' as type,
raw_data->'attributes' as attributes,
discovered_at as last_discovered
FROM discovery_history
WHERE discovered_at >= '2024-01-14'
AND status = 'success'
AND raw_data->>'hostname' NOT IN (
SELECT name FROM cis WHERE tenant_id = 'IT'
);

Prevention Strategies

Monitoring Setup

Proactive Monitoring:
Metrics to Track:
- Discovery success rate < 95%
- Average duration increasing
- Timeout rate > 5%
- Queue depth > 1000
- Error rate > 2%

Alerts:
- Discovery failures > 10 in 5 min
- No discoveries in expected window
- Credential failures spike
- Resource exhaustion warning

Dashboards:
- Real-time discovery status
- Success rate trends
- Performance metrics
- Error categorization

Best Practices

  1. Regular Maintenance

    • Weekly credential validation
    • Monthly discovery audit
    • Quarterly pattern review
    • Annual architecture review
  2. Documentation

    • Document all custom patterns
    • Maintain troubleshooting runbook
    • Record common solutions
    • Update network diagrams
  3. Testing

    • Test credentials before production
    • Validate patterns in dev
    • Load test discovery system
    • Practice recovery procedures

Getting Help

Support Resources

Internal Resources:
- Discovery team Slack: #discovery-help
- Wiki: https://wiki.company.com/nopesight
- Runbooks: https://runbooks.company.com

NopeSight Support:
- Email: support@nopesight.com
- Portal: https://support.nopesight.com
- Phone: 1-800-NOPESIGHT

Community:
- Forums: https://community.nopesight.com
- GitHub: https://github.com/nopesight/patterns
- Slack: nopesight-users.slack.com

Diagnostic Package

#!/bin/bash
# Create diagnostic package for support

DIAG_DIR="/tmp/nopesight-diag-$(date +%Y%m%d-%H%M%S)"
mkdir -p $DIAG_DIR

# Collect system info
nopesight system info > $DIAG_DIR/system-info.txt
nopesight discovery status > $DIAG_DIR/discovery-status.txt

# Collect recent logs
tail -n 10000 /var/log/nopesight/*.log > $DIAG_DIR/recent-logs.txt

# Collect configuration (sanitized)
nopesight config export --sanitize > $DIAG_DIR/config.yaml

# Create archive
tar -czf $DIAG_DIR.tar.gz -C /tmp $(basename $DIAG_DIR)
echo "Diagnostic package created: $DIAG_DIR.tar.gz"

Next Steps