Skip to content

Common Issues & Solutions

Quick reference guide for troubleshooting common WithinEarth infrastructure issues.


Database Connection Issues

Problem: "Cannot connect to SQL Server"

Symptoms: - Application throws SQL connection timeout errors - Database queries hang indefinitely - API returns 500 errors

Troubleshooting Steps:

# 1. Check if SQL Server is listening
telnet 10.32.8.130 1988

# 2. Check network connectivity
ping 10.32.8.130

# 3. Verify SQL Server service is running
# On Windows SQL Server:
services.msc
# Find "SQL Server (MSSQLSERVER)" - ensure it's Running

# 4. Check firewall
netsh advfirewall firewall show rule name="SQL Server"

# 5. Test with sqlcmd
sqlcmd -S 10.32.8.130,1988 -U sa -P 'password'

Common Solutions: - Restart SQL Server service - Check firewall rules - Verify connection string credentials - Use Hybrid Failover System for automatic failover


Problem: "MongoDB connection timeout"

Symptoms: - Cache lookups failing - Slow search responses - MongoDB connection errors in logs

Troubleshooting:

# 1. Check MongoDB is running
telnet 10.32.8.51 27017

# 2. Connect with mongo shell
mongosh mongodb://10.32.8.51:27017/

# 3. Check MongoDB logs
# On MongoDB server:
tail -f /var/log/mongodb/mongod.log

# 4. Check MongoDB status
systemctl status mongod

Solutions: - Restart MongoDB: systemctl restart mongod - Clear MongoDB locks: rm /var/lib/mongodb/mongod.lock && mongod --repair - Check disk space: df -h


Performance Issues

Problem: "API response time > 2 seconds"

Symptoms: - Slow search results - High client receive time in HAProxy - User complaints about slowness

Diagnosis Steps:

# 1. Check HAProxy stats
curl http://10.32.8.209:8080/stats

# 2. Check server CPU/Memory
# On Windows API server:
Get-Counter '\Processor(_Total)\% Processor Time'
Get-Counter '\Memory\Available MBytes'

# 3. Check database query performance
# Run on SQL Server:
SELECT TOP 10
    total_worker_time/execution_count AS avg_cpu_time,
    total_elapsed_time/execution_count AS avg_elapsed_time,
    SUBSTRING(text, 1, 200) AS query_text
FROM sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
ORDER BY avg_elapsed_time DESC

Common Causes: 1. Cache miss - check cache hit rate 2. Slow supplier responses - check supplier timeout logs 3. Database lock contention 4. Network latency

Solutions: - Increase cache timeout - Review API-2 Slowness Guide - Optimize slow queries - Increase connection pool size


Problem: "High memory usage on API server"

Symptoms: - Application slowing down over time - Out of memory exceptions - Frequent application restarts

Check Memory Usage:

# On Windows API server
Get-Process w3wp | Select-Object -Property ProcessName, @{Name="Memory (MB)"; Expression={$_.WorkingSet64 / 1MB}}

# Check .NET GC stats
Get-Counter '\\.NET CLR Memory(*)\# Bytes in all Heaps'

Solutions: - Restart IIS Application Pool - Increase application pool memory limit - Fix memory leaks (review code) - Enable server GC in web.config


Cache Issues

Problem: "Cache hit rate < 50%"

Symptoms: - Excessive supplier API calls - High database load - Slow response times

Diagnosis:

# Check Redis cache statistics
redis-cli -h 10.32.8.205 INFO stats

# Check MongoDB cache collection size
mongosh mongodb://10.32.8.51:27017/withinearth --eval "db.XConnect_Live.count()"

# Check cache expiry settings
# Review appsettings.json:
"CacheTimeOut": "35"  # Should be 30-60 minutes

Solutions: - Increase cache timeout from 35 to 60 minutes - Pre-warm cache for popular routes - Implement cache stampede protection - Review Caching Strategy


SSL Certificate Issues

Problem: "SSL certificate expired"

Symptoms: - Browser shows "Your connection is not private" - API calls fail with SSL errors - Email delivery failures

Check Certificate Expiry:

# Check certificate expiry
echo | openssl s_client -connect mail.withinearth.com:465 2>/dev/null | openssl x509 -noout -dates

# Check all certificates
/home/monitor/ssl_monitor.sh

Solutions: - Renew certificate before expiry - Update certificate in IIS/Apache - Update certificate in email server - Restart web server after renewal


Network Issues

Problem: "Intermittent connection failures"

Symptoms: - Random database connection drops - API requests timeout sporadically - HAProxy showing failed health checks

Troubleshooting:

# 1. Check for packet loss
ping -c 100 10.32.8.130

# 2. Check network latency
mtr 10.32.8.130

# 3. Check for port exhaustion (Windows)
netstat -ano | find /c "TIME_WAIT"
# If > 10,000, you have port exhaustion

# 4. Check TCP connections
netstat -ano | find "ESTABLISHED" | find "1988"

Solutions: - Reduce TIME_WAIT timeout - Increase ephemeral port range - Enable TCP connection pooling - Use connection string pooling options


HAProxy Issues

Problem: "HAProxy not routing requests"

Symptoms: - 503 Service Unavailable - All backend servers marked as DOWN - No traffic reaching API servers

Check HAProxy Status:

# Check HAProxy is running
systemctl status haproxy

# Check HAProxy stats page
curl http://10.32.8.209:8080/stats

# Check HAProxy logs
tail -f /var/log/haproxy.log

# Test backend connectivity
curl -v http://10.32.8.134/health

Solutions: - Restart HAProxy: systemctl restart haproxy - Check backend health - Verify haproxy.cfg configuration - Check firewall between HAProxy and API servers


RabbitMQ Issues

Problem: "Messages not being processed"

Symptoms: - Queue depth increasing - Background jobs not running - Delayed notifications

Check RabbitMQ:

# Check RabbitMQ status
systemctl status rabbitmq-server

# Check queue depth
curl -u admin:password http://10.32.8.90:15672/api/queues

# Check RabbitMQ logs
tail -f /var/log/rabbitmq/rabbit@hostname.log

Solutions: - Restart RabbitMQ: systemctl restart rabbitmq-server - Purge stuck queues - Increase consumer count - Check disk space (RabbitMQ stops accepting messages when disk < 50MB)


Deployment Issues

Problem: "Application won't start after deployment"

Symptoms: - IIS Application Pool crashes - 500 errors on all endpoints - Application event log errors

Troubleshooting:

# Check IIS Application Pool status
Get-WebAppPoolState "WithinEarthAppPool"

# Check event logs
Get-EventLog -LogName Application -Newest 50 | Where-Object {$_.Source -like "*ASP.NET*"}

# Check IIS logs
Get-Content C:\inetpub\logs\LogFiles\W3SVC1\*.log -Tail 50

Common Causes: 1. Missing DLL dependencies 2. Wrong .NET runtime version 3. Invalid appsettings.json 4. Database connection failure 5. Missing file permissions

Solutions: - Verify .NET 9 runtime is installed - Check appsettings.json syntax - Test database connectivity - Grant IIS_IUSRS permissions to application folder - Restart Application Pool


Emergency Procedures

Complete System Outage

  1. Check HAProxy - Is load balancer up?

    systemctl status haproxy
    

  2. Check API Servers - Are all API servers down?

    Test-NetConnection -ComputerName 10.32.8.134 -Port 80
    

  3. Check Databases - Is primary database accessible?

    telnet 10.32.8.130 1988
    

  4. Failover Sequence:

  5. HAProxy down → Restart: systemctl restart haproxy
  6. All APIs down → Restart IIS on each server
  7. Primary DB down → Activate Hybrid Failover

Monitoring & Alerts

Setup Alert Notifications

Critical Alerts: - Database down - All API servers down - SSL certificate expiring in < 7 days - Disk space < 10%

Warning Alerts: - High CPU/Memory usage - Cache hit rate < 50% - Response time > 2 seconds - Failed health checks

Tools: - Pushover for mobile alerts - Slack for team notifications - Email for non-critical alerts


Useful Commands Reference

Database

# SQL Server - Check version
sqlcmd -S 10.32.8.130,1988 -Q "SELECT @@VERSION"

# MongoDB - Check replica set status
mongosh mongodb://10.32.8.51:27017/ --eval "rs.status()"

# Redis - Check memory usage
redis-cli -h 10.32.8.205 INFO memory

Network

# Check open ports
netstat -tulpn | grep LISTEN

# Check firewall rules
iptables -L -n

# Test HTTP endpoint
curl -v http://10.32.8.134/health

System

# Check disk space
df -h

# Check memory
free -m

# Check CPU
top

# Check logs
tail -f /var/log/messages

Getting Help

If you can't resolve an issue:

  1. Check this troubleshooting guide
  2. Review related documentation
  3. Check application logs
  4. Contact DevOps team
  5. Create incident ticket with:
  6. Problem description
  7. Steps taken so far
  8. Error messages
  9. Screenshots/logs

Last Updated: 2025-11-15