Downscaling

Intelligent downscaling is essential for optimizing Kubernetes costs without compromising application performance. This guide explains Stackbooster.io's approach to reducing cluster size when resources are underutilized.

How Stackbooster.io Downscaling Works

Stackbooster.io employs sophisticated algorithms to safely reduce cluster size when extra capacity is no longer needed:

Smart Capacity Reduction

Our platform goes beyond simple utilization thresholds:

Analyzes sustained underutilization patterns across multiple metrics
Considers pod distribution and resource requirements
Identifies nodes that can be safely drained and removed
Avoids disruptive scaling that could impact application performance

Workload Consolidation

Before removing nodes, Stackbooster.io optimizes pod placement:

Identifies pods that can be relocated to increase node efficiency
Uses bin-packing algorithms to maximize resource utilization
Respects pod affinity/anti-affinity rules and node selectors
Considers performance implications of pod migrations

Graceful Node Decommissioning

When removing nodes, our system ensures minimal disruption:

Cordons nodes to prevent new pod scheduling
Gradually drains pods with appropriate termination grace periods
Monitors pod migrations to ensure successful rescheduling
Aborts the process if any critical issues are detected

Configuring Downscaling

Basic Configuration

To set up basic downscaling parameters:

Navigate to your cluster in the Stackbooster.io dashboard
Select "Scaling Configuration" > "Downscaling"
Configure the following settings:
- Underutilization Threshold: Resource level that triggers downscaling (default: 40%)
- Scale-down Delay: Time a node must be underutilized before removal (default: 10 minutes)
- Max Scale-down Rate: Maximum nodes to remove in a single scaling action
- Workload Respect Level: How strictly to honor workload constraints

Advanced Settings

For more granular control, configure:

Node Protection Rules

Protect specific nodes from downscaling:

Navigate to "Node Management" > "Protection Rules"
Create rules based on:
- Node labels or names
- Time-based protection windows
- Workload importance running on nodes

Pod Disruption Budgets

Honor Kubernetes PodDisruptionBudgets for controlled downscaling:

Navigate to "Workload Settings" > "Disruption Controls"
Configure how strictly to adhere to PDBs:
- Strict: Never violate PDBs, even if it means delaying downscaling
- Balanced: Respect PDBs but proceed after reasonable waiting period
- Relaxed: Consider PDBs as guidelines only

Time-Based Rules

Create rules that modify downscaling behavior based on time:

Go to "Scheduling Rules"
Create rules with:
- Time windows for aggressive downscaling (e.g., nights, weekends)
- Special handling for maintenance windows
- Freeze periods when downscaling should be avoided

Downscaling Strategies

Stackbooster.io offers several downscaling strategies to match your operational needs:

Balanced (Default)

Moderate approach to node removal
Waits for sustained underutilization before taking action
Considers both resource efficiency and operational stability

Aggressive Cost Optimization

Prioritizes cost savings with more rapid downscaling
Removes nodes more quickly when underutilized
Maintains minimal excess capacity
Best for non-critical environments or dev/test clusters

Conservative

Takes a cautious approach to node removal
Requires longer periods of underutilization before downscaling
Maintains higher buffer capacity
Ideal for production environments with stringent reliability requirements

Custom

Define your own parameters for underutilization thresholds and timing
Create different strategies for different environments
Implement special handling for specific use cases

Best Practices

Determining Optimal Underutilization Thresholds

The ideal underutilization threshold depends on your workload characteristics:

Stable workloads: 30-40% is typically appropriate
Variable workloads: 40-50% provides better buffer for fluctuations
Critical applications: 50-60% ensures capacity for unexpected demands

Scheduling Downscaling Windows

For predictable cost optimization:

Identify low-usage periods in your application traffic patterns
Create scheduled downscaling windows during these periods
Configure more aggressive thresholds during these times
Return to normal settings when usage typically increases

Handling Stateful Workloads

When your cluster runs stateful applications:

Create node protection rules for nodes running stateful workloads
Configure longer grace periods for pods with persistent storage
Set stricter PDB adherence for database or caching systems
Consider manual approval for downscaling actions affecting critical stateful services

Monitoring Downscaling Performance

To ensure your downscaling is effective:

Monitor the "Scaling Performance" dashboard
Review metrics such as:
- Cost savings from downscaling actions
- Application performance impact during node removals
- Failed pod migrations during node draining
- Frequency of scaling reversals (down then up quickly)
Adjust configuration based on observations:
- Increase buffer if applications experience resource pressure
- Decrease scale-down delay if nodes remain idle too long
- Modify protection rules if certain workloads are affected

Troubleshooting

Common Downscaling Issues

Nodes Not Scaling Down Despite Low Utilization

Potential causes and solutions:

DaemonSets: Check if DaemonSets are blocking node removal
PodDisruptionBudgets: Review if strict PDBs are preventing pod evictions
Node Selectors/Taints: Verify if pods require specific nodes
Protection Rules: Check for active protection rules blocking downscaling

Workload Disruption During Downscaling

If applications are negatively impacted:

Increase the scale-down delay to ensure stability before node removal
Configure stricter PDB adherence in the downscaling settings
Add protection for sensitive workload nodes
Increase the pod eviction timeout to allow proper termination

Rapid Scale Down/Up Cycles

If your cluster shows "thrashing" between scaling down and up:

Increase the buffer threshold to maintain more spare capacity
Lengthen the scale-down evaluation period
Implement cooldown periods between scaling actions
Review your application's resource requests for accuracy

Advanced Topics

Cost-Aware Node Selection

When determining which nodes to remove, Stackbooster.io considers:

Instance pricing and billing cycle position
Reserved instance coverage and commitment
Spot instance interruption probability
Node age and maintenance status

To optimize this selection:

Navigate to "Cost Settings" > "Node Removal Priorities"
Configure priorities based on your cost structure
Adjust weighting between different cost factors

Integration with Cluster Autoscaler

If you're also using Kubernetes Cluster Autoscaler:

Navigate to "Integration Settings" > "Kubernetes Autoscaler"
Configure coordination mode:
- Replace: Let Stackbooster.io handle all scaling (recommended)
- Complement: Define separate responsibilities
- Observe: Run alongside but don't interfere with existing autoscaler

Downscaling with Spot Instances

For clusters using Spot instances:

Configure "Spot Management" settings
Set preferences for:
- Spot vs On-Demand priority in downscaling
- Handling of Spot termination notices
- Replacement strategies when Spot availability changes

By implementing these downscaling strategies, your Kubernetes clusters will maintain optimal size and resource utilization, reducing costs while preserving application performance and reliability.

Downscaling ​

How Stackbooster.io Downscaling Works ​

Smart Capacity Reduction ​

Workload Consolidation ​

Graceful Node Decommissioning ​

Configuring Downscaling ​

Basic Configuration ​

Advanced Settings ​

Node Protection Rules ​

Pod Disruption Budgets ​

Time-Based Rules ​

Downscaling Strategies ​

Balanced (Default) ​

Aggressive Cost Optimization ​

Conservative ​

Custom ​

Best Practices ​

Determining Optimal Underutilization Thresholds ​

Scheduling Downscaling Windows ​

Handling Stateful Workloads ​

Monitoring Downscaling Performance ​

Troubleshooting ​

Common Downscaling Issues ​

Nodes Not Scaling Down Despite Low Utilization ​

Workload Disruption During Downscaling ​

Rapid Scale Down/Up Cycles ​

Advanced Topics ​

Cost-Aware Node Selection ​

Integration with Cluster Autoscaler ​

Downscaling with Spot Instances ​