Failed migration
Symptoms
- Init container fails during deployment
- Logs show migration errors
- New tasks never become healthy
Common causes
1. Duplicate records violating unique constraint
Example: Adding a uniqueness filter to a table that has duplicate records. Error message:- Connect to the database (see Database connection)
- Find duplicates:
- Remove duplicates (keep the first record):
- Update Atlas migration table:
- Redeploy the service
2. Migration file syntax error
Error message:- Fix the migration file in the repository
- Commit and push the fix
- Create a new release to redeploy
3. Database connection timeout
Error message:- Check RDS status: Go to RDS → Databases
- Verify security groups: Ensure ECS tasks can reach RDS
- Check RDS Performance Insights: Look for connection issues
- Restart RDS if necessary (last resort)
Google API issues
1. Missing API key
Symptoms:- Server fails to start
- Logs show “GOOGLE_GEOCODING_API_KEY is required”
- Add the API key to Secrets Manager:
- Restart the ECS service:
2. Google API is down
Symptoms:- Geocoding requests fail with 500 errors
- Cached addresses still work
- Logs show Google API errors
- Cached addresses: Return successfully from cache
- New addresses: Fail with 500 error
- Check Google Cloud Status: Visit Google Cloud Status Dashboard
- Monitor the situation: Wait for Google to resolve the issue
- Notify users: If prolonged, inform users of the outage
- Check API quotas: Ensure you haven’t exceeded quota limits
- Implement retry logic with exponential backoff
- Consider a fallback geocoding provider
- Increase cache hit rate by pre-caching common addresses
RDS issues
1. Database is down
Symptoms:/healthzendpoint returns unhealthy- Better Stack alerts triggered
- All API requests fail
- Check RDS status:
- If stopped, start it:
- If failed, check RDS events:
- Restore from backup if necessary:
2. Connection pool exhausted
Symptoms:- Slow response times
- Timeouts on requests
- Logs show “connection pool exhausted”
- Check connection pool metrics in
/healthz:
- Increase max connections in Terraform:
- Scale up RDS instance if needed:
- Check for connection leaks in application code
3. Storage full
Symptoms:- Write operations fail
- RDS status shows “storage-full”
- CloudWatch alarm triggered
- Check storage usage:
- Increase max allocated storage:
- Clean up old data if appropriate:
ISO 3166 content issues
Missing state code
Symptoms:- Geocoding returns state code but not full state name
- No error, just incomplete data
- The API doesn’t fail
- Returns the short name from Google
- Logs a warning
- Ingest the missing state:
- Verify the data:
Deployment failures
1. Circuit breaker triggered
Symptoms:- Deployment fails
- ECS automatically rolls back
- New tasks fail health checks
- Check CloudWatch logs for errors
- Verify environment variables are correct
- Test Docker image locally:
- Fix the issue and redeploy
2. Terraform apply failed
Symptoms:- GitHub Actions workflow fails at Terraform step
- Infrastructure changes not applied
- Review Terraform error in GitHub Actions logs
- Check for resource conflicts
- Manually apply if needed:
Health check failures
Symptoms
- ALB marks tasks as unhealthy
- Tasks are repeatedly replaced
- Service is unstable
- Database connection issues
- Slow startup time (exceeds grace period)
- Application crashes
- Check health endpoint:
- Increase health check grace period:
- Fix application issues causing slow startup