Troubleshooting
This guide covers common issues encountered when using Stackbooster.io and provides solutions to help you resolve them quickly.
AWS Integration
Access Denied Errors
If you encounter "Access Denied" errors when connecting your AWS account:
Check IAM Role Permissions
- Verify that the IAM role has all required permissions listed in the AWS Integration Guide
- Ensure the role trust relationship includes the correct Stackbooster.io account ID
- Confirm the external ID matches exactly what's displayed in the Stackbooster.io dashboard
Organizational Policies
- Check if your AWS organization has Service Control Policies (SCPs) that might restrict certain actions
- Review any permission boundaries applied to the IAM role
- Ensure your AWS account isn't restricting service-linked role creation
Region Restrictions
- Verify that the regions you're trying to access aren't restricted in the IAM policy
AWS Account Connection Failures
If your AWS account fails to connect:
CloudFormation Stack Issues
- Check the CloudFormation events log for specific error messages
- Verify that the user creating the stack has sufficient permissions
- Try deleting the stack and recreating it from the Stackbooster.io dashboard
Manual Setup Problems
- Double-check the ARN entered in the Stackbooster.io dashboard
- Ensure the custom policy document matches exactly what's provided
- Verify that all required managed policies are attached
Cross-Account Role Issues
- Check that the trust relationship is correctly configured
- Ensure MFA or source IP restrictions aren't blocking Stackbooster.io's access
Kubernetes Agent
Agent Installation Failures
If the Kubernetes agent fails to install:
RBAC Permissions
- Ensure your kubectl context has cluster-admin privileges
- Check for any admission controllers that might block the installation
- Verify that the namespace for the agent isn't subject to restrictive policies
Networking Issues
- Check if your cluster has outbound internet access
- Verify that firewall rules allow connections to the Stackbooster.io API endpoints
- Ensure DNS resolution is working correctly within the cluster
Resource Constraints
- Check if the cluster has sufficient resources for the agent pods
- Verify that there are no strict resource quotas preventing deployment
Agent Communication Problems
If the agent installs but shows "Disconnected" status:
Network Connectivity
- Verify outbound connectivity from the cluster to Stackbooster.io API endpoints
- Check for any proxies or firewalls that might be blocking traffic
- Ensure the agent pod has the correct API key in its configuration
Pod Status Issues
- Check the agent pod logs for error messages:
kubectl logs -n stackbooster-system -l app=stackbooster-agent - Verify that the pod is in Running state:
kubectl get pods -n stackbooster-system - Check for any CrashLoopBackOff or other pod status issues
- Check the agent pod logs for error messages:
API Key Problems
- Verify that the API key used during installation is valid and active
- Try reinstalling the agent with a newly generated API key
Optimization Issues
No Cost Savings Recommendations
If you don't see any cost optimization recommendations:
Insufficient Data Collection
- Ensure the agent has been running for at least 24 hours to collect sufficient data
- Check if the agent is reporting metrics correctly in the dashboard
- Verify that you have workloads running on the cluster
Already Optimized Cluster
- Your cluster might already be well-optimized
- Try running a test workload to generate more varied usage patterns
- Check if cost-saving features are already enabled on your cluster
Configuration Issues
- Review your optimization settings and adjust the aggressiveness level
- Ensure that the right namespaces are included in the analysis
- Check if any exclusion rules might be preventing recommendations
Unexpected Scaling Behavior
If your cluster is scaling in unexpected ways:
Competing Autoscalers
- Check if the native Kubernetes Cluster Autoscaler is also running
- Look for other tools that might be modifying node groups
- Review AWS Auto Scaling Group settings for manual modifications
Misconfigured Scaling Settings
- Review your scaling configuration in the Stackbooster.io dashboard
- Check workload priority settings and adjust if needed
- Verify that scale-up and scale-down thresholds are appropriate
Workload Spikes
- Examine usage patterns for unexpected load spikes
- Check for cron jobs or batch processes that might cause sudden resource demands
- Review application logs for issues causing excessive resource consumption
Application Performance
Pods Failing to Schedule
If pods can't be scheduled after optimization:
Resource Availability
- Check if node resources are too fragmented to schedule your pods
- Verify that the cluster has enough nodes to accommodate your workloads
- Review pod resource requests and limits for accuracy
Node Selectors and Taints
- Ensure your pods' node selectors match available nodes
- Check for taints on nodes that might prevent scheduling
- Verify that node affinity and anti-affinity rules can be satisfied
PodDisruptionBudgets
- Review PDBs to ensure they're not too restrictive
- Verify that enough replicas exist to satisfy PDB requirements
- Check if kube-system pods have appropriate PDBs
Degraded Application Performance
If applications perform poorly after optimization:
Resource Contention
- Check if pods are experiencing CPU throttling
- Look for memory pressure on nodes
- Verify that pods have appropriate resource requests and limits
Node Type Mismatches
- Ensure that workloads are running on suitable instance types
- Check if CPU or memory-intensive apps are placed appropriately
- Verify that specialized workloads (GPU, high I/O) are on appropriate nodes
Aggressive Consolidation
- Review workload consolidation settings and reduce aggressiveness
- Check if pod anti-affinity rules need to be added
- Consider adjusting bin-packing settings to leave more headroom
Dashboard and Reporting
Missing or Incorrect Data
If your dashboard shows incomplete or incorrect data:
Agent Connection Issues
- Check agent connectivity status in the dashboard
- Verify that all agent pods are running correctly
- Review agent logs for data collection errors
Metrics Collection Problems
- Ensure the Kubernetes metrics server is running properly
- Check if custom metrics sources are correctly configured
- Verify that cloud provider metrics are being collected
Dashboard Caching
- Try refreshing the browser or clearing browser cache
- Check the data timestamp to verify recency
- Log out and back in to refresh your session
Report Generation Failures
If cost or optimization reports fail to generate:
Data Availability
- Ensure sufficient data exists for the requested time period
- Check if the cluster was connected during the entire reporting period
- Verify that cost data from AWS is being collected properly
Report Configurations
- Check for invalid filters or groupings in report settings
- Verify that the report timeframe is valid
- Try generating a report with default settings
Browser Issues
- Try a different browser or clear browser cache
- Ensure your browser is updated to the latest version
- Check for browser console errors when generating reports
Account and Billing
Access and Permission Issues
If users can't access features or clusters:
User Permissions
- Review role assignments in the account settings
- Check if the user's role has permission for the specific action
- Verify cluster-specific access controls
Team Configuration
- Ensure the user is assigned to the correct team
- Check if team permissions are correctly configured
- Verify that team resource limits aren't restricting access
Session Issues
- Try logging out and back in to refresh the session
- Clear browser cookies and cache
- Verify that the user's account is active and not suspended
Billing and Subscription Problems
If you encounter billing or subscription issues:
Payment Method
- Verify that the payment method on file is current and valid
- Check for any failed payment notifications
- Update payment details if necessary
Subscription Status
- Verify that your subscription is active
- Check if you've reached any limits in your current plan
- Review subscription renewal dates
Usage Accounting
- Verify that cluster and node counts match your expectations
- Check for any unexpected changes in resource usage
- Review billing statements for accuracy
Still Need Help?
If you're still experiencing issues after trying these troubleshooting steps:
- Email [email protected] with details of your issue
- Include relevant logs, screenshots, and steps to reproduce
- Provide your account ID and affected cluster names
Community Forums
- Check our community forums for similar issues
- Post a detailed description of your problem for community assistance
Office Hours
- Join our weekly office hours for live troubleshooting
- Schedule a one-on-one session with our support engineers
