Common Issues & Solutions¶
Quick reference guide for troubleshooting common WithinEarth infrastructure issues.
Database Connection Issues¶
Problem: "Cannot connect to SQL Server"¶
Symptoms: - Application throws SQL connection timeout errors - Database queries hang indefinitely - API returns 500 errors
Troubleshooting Steps:
# 1. Check if SQL Server is listening
telnet 10.32.8.130 1988
# 2. Check network connectivity
ping 10.32.8.130
# 3. Verify SQL Server service is running
# On Windows SQL Server:
services.msc
# Find "SQL Server (MSSQLSERVER)" - ensure it's Running
# 4. Check firewall
netsh advfirewall firewall show rule name="SQL Server"
# 5. Test with sqlcmd
sqlcmd -S 10.32.8.130,1988 -U sa -P 'password'
Common Solutions: - Restart SQL Server service - Check firewall rules - Verify connection string credentials - Use Hybrid Failover System for automatic failover
Problem: "MongoDB connection timeout"¶
Symptoms: - Cache lookups failing - Slow search responses - MongoDB connection errors in logs
Troubleshooting:
# 1. Check MongoDB is running
telnet 10.32.8.51 27017
# 2. Connect with mongo shell
mongosh mongodb://10.32.8.51:27017/
# 3. Check MongoDB logs
# On MongoDB server:
tail -f /var/log/mongodb/mongod.log
# 4. Check MongoDB status
systemctl status mongod
Solutions:
- Restart MongoDB: systemctl restart mongod
- Clear MongoDB locks: rm /var/lib/mongodb/mongod.lock && mongod --repair
- Check disk space: df -h
Performance Issues¶
Problem: "API response time > 2 seconds"¶
Symptoms: - Slow search results - High client receive time in HAProxy - User complaints about slowness
Diagnosis Steps:
# 1. Check HAProxy stats
curl http://10.32.8.209:8080/stats
# 2. Check server CPU/Memory
# On Windows API server:
Get-Counter '\Processor(_Total)\% Processor Time'
Get-Counter '\Memory\Available MBytes'
# 3. Check database query performance
# Run on SQL Server:
SELECT TOP 10
total_worker_time/execution_count AS avg_cpu_time,
total_elapsed_time/execution_count AS avg_elapsed_time,
SUBSTRING(text, 1, 200) AS query_text
FROM sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
ORDER BY avg_elapsed_time DESC
Common Causes: 1. Cache miss - check cache hit rate 2. Slow supplier responses - check supplier timeout logs 3. Database lock contention 4. Network latency
Solutions: - Increase cache timeout - Review API-2 Slowness Guide - Optimize slow queries - Increase connection pool size
Problem: "High memory usage on API server"¶
Symptoms: - Application slowing down over time - Out of memory exceptions - Frequent application restarts
Check Memory Usage:
# On Windows API server
Get-Process w3wp | Select-Object -Property ProcessName, @{Name="Memory (MB)"; Expression={$_.WorkingSet64 / 1MB}}
# Check .NET GC stats
Get-Counter '\\.NET CLR Memory(*)\# Bytes in all Heaps'
Solutions: - Restart IIS Application Pool - Increase application pool memory limit - Fix memory leaks (review code) - Enable server GC in web.config
Cache Issues¶
Problem: "Cache hit rate < 50%"¶
Symptoms: - Excessive supplier API calls - High database load - Slow response times
Diagnosis:
# Check Redis cache statistics
redis-cli -h 10.32.8.205 INFO stats
# Check MongoDB cache collection size
mongosh mongodb://10.32.8.51:27017/withinearth --eval "db.XConnect_Live.count()"
# Check cache expiry settings
# Review appsettings.json:
"CacheTimeOut": "35" # Should be 30-60 minutes
Solutions: - Increase cache timeout from 35 to 60 minutes - Pre-warm cache for popular routes - Implement cache stampede protection - Review Caching Strategy
SSL Certificate Issues¶
Problem: "SSL certificate expired"¶
Symptoms: - Browser shows "Your connection is not private" - API calls fail with SSL errors - Email delivery failures
Check Certificate Expiry:
# Check certificate expiry
echo | openssl s_client -connect mail.withinearth.com:465 2>/dev/null | openssl x509 -noout -dates
# Check all certificates
/home/monitor/ssl_monitor.sh
Solutions: - Renew certificate before expiry - Update certificate in IIS/Apache - Update certificate in email server - Restart web server after renewal
Network Issues¶
Problem: "Intermittent connection failures"¶
Symptoms: - Random database connection drops - API requests timeout sporadically - HAProxy showing failed health checks
Troubleshooting:
# 1. Check for packet loss
ping -c 100 10.32.8.130
# 2. Check network latency
mtr 10.32.8.130
# 3. Check for port exhaustion (Windows)
netstat -ano | find /c "TIME_WAIT"
# If > 10,000, you have port exhaustion
# 4. Check TCP connections
netstat -ano | find "ESTABLISHED" | find "1988"
Solutions: - Reduce TIME_WAIT timeout - Increase ephemeral port range - Enable TCP connection pooling - Use connection string pooling options
HAProxy Issues¶
Problem: "HAProxy not routing requests"¶
Symptoms: - 503 Service Unavailable - All backend servers marked as DOWN - No traffic reaching API servers
Check HAProxy Status:
# Check HAProxy is running
systemctl status haproxy
# Check HAProxy stats page
curl http://10.32.8.209:8080/stats
# Check HAProxy logs
tail -f /var/log/haproxy.log
# Test backend connectivity
curl -v http://10.32.8.134/health
Solutions:
- Restart HAProxy: systemctl restart haproxy
- Check backend health
- Verify haproxy.cfg configuration
- Check firewall between HAProxy and API servers
RabbitMQ Issues¶
Problem: "Messages not being processed"¶
Symptoms: - Queue depth increasing - Background jobs not running - Delayed notifications
Check RabbitMQ:
# Check RabbitMQ status
systemctl status rabbitmq-server
# Check queue depth
curl -u admin:password http://10.32.8.90:15672/api/queues
# Check RabbitMQ logs
tail -f /var/log/rabbitmq/rabbit@hostname.log
Solutions:
- Restart RabbitMQ: systemctl restart rabbitmq-server
- Purge stuck queues
- Increase consumer count
- Check disk space (RabbitMQ stops accepting messages when disk < 50MB)
Deployment Issues¶
Problem: "Application won't start after deployment"¶
Symptoms: - IIS Application Pool crashes - 500 errors on all endpoints - Application event log errors
Troubleshooting:
# Check IIS Application Pool status
Get-WebAppPoolState "WithinEarthAppPool"
# Check event logs
Get-EventLog -LogName Application -Newest 50 | Where-Object {$_.Source -like "*ASP.NET*"}
# Check IIS logs
Get-Content C:\inetpub\logs\LogFiles\W3SVC1\*.log -Tail 50
Common Causes: 1. Missing DLL dependencies 2. Wrong .NET runtime version 3. Invalid appsettings.json 4. Database connection failure 5. Missing file permissions
Solutions: - Verify .NET 9 runtime is installed - Check appsettings.json syntax - Test database connectivity - Grant IIS_IUSRS permissions to application folder - Restart Application Pool
Emergency Procedures¶
Complete System Outage¶
-
Check HAProxy - Is load balancer up?
-
Check API Servers - Are all API servers down?
-
Check Databases - Is primary database accessible?
-
Failover Sequence:
- HAProxy down → Restart:
systemctl restart haproxy - All APIs down → Restart IIS on each server
- Primary DB down → Activate Hybrid Failover
Monitoring & Alerts¶
Setup Alert Notifications¶
Critical Alerts: - Database down - All API servers down - SSL certificate expiring in < 7 days - Disk space < 10%
Warning Alerts: - High CPU/Memory usage - Cache hit rate < 50% - Response time > 2 seconds - Failed health checks
Tools: - Pushover for mobile alerts - Slack for team notifications - Email for non-critical alerts
Useful Commands Reference¶
Database¶
# SQL Server - Check version
sqlcmd -S 10.32.8.130,1988 -Q "SELECT @@VERSION"
# MongoDB - Check replica set status
mongosh mongodb://10.32.8.51:27017/ --eval "rs.status()"
# Redis - Check memory usage
redis-cli -h 10.32.8.205 INFO memory
Network¶
# Check open ports
netstat -tulpn | grep LISTEN
# Check firewall rules
iptables -L -n
# Test HTTP endpoint
curl -v http://10.32.8.134/health
System¶
# Check disk space
df -h
# Check memory
free -m
# Check CPU
top
# Check logs
tail -f /var/log/messages
Getting Help¶
If you can't resolve an issue:
- Check this troubleshooting guide
- Review related documentation
- Check application logs
- Contact DevOps team
- Create incident ticket with:
- Problem description
- Steps taken so far
- Error messages
- Screenshots/logs
Last Updated: 2025-11-15