Cluster Hibernation

Cluster hibernation is a powerful cost-saving feature that allows you to automatically scale down your Kubernetes clusters during periods of inactivity. This guide explains how Stackbooster.io's hibernation works and how to configure it for maximum savings without compromising availability.

Understanding Cluster Hibernation

Cluster hibernation is the process of scaling a Kubernetes cluster to zero or near-zero resources during periods of inactivity, then restoring it when needed. This approach can provide significant cost savings, especially for non-production environments that don't require 24/7 availability.

Key Benefits

Cost Reduction: Save up to 70% on infrastructure costs by eliminating charges during inactive periods
Automated Management: Schedule hibernation based on usage patterns without manual intervention
State Preservation: Maintain cluster configuration and state during hibernation
Fast Recovery: Quickly restore clusters to operational state when needed

How Stackbooster.io Hibernation Works

Stackbooster.io implements a sophisticated hibernation system:

Intelligent Detection

The platform determines when hibernation is appropriate by:

Monitoring overall cluster utilization across all resources
Analyzing usage patterns to identify predictable inactive periods
Detecting when active workloads have completed their tasks
Considering configured hibernation policies and schedules

Graceful Shutdown

When initiating hibernation, the system:

Marks the cluster as entering hibernation mode
Captures the current state for future restoration
Scales down workloads according to configured policies
Drains and terminates worker nodes in a controlled manner
Preserves control plane state (for EKS) or scales it down (for self-managed)

Smart Wake-Up

When the cluster needs to be restored, the system:

Detects wake-up triggers (schedule, API calls, webhooks)
Restores the control plane if needed
Provisions required worker nodes
Restores workloads according to priority settings
Verifies cluster health before marking as fully operational

Hibernation Types

Stackbooster.io offers several hibernation approaches:

Full Hibernation

Scales down all nodes to zero
Provides maximum cost savings
Longer recovery time (typically 3-5 minutes)
Best for dev/test environments with predictable usage

Partial Hibernation

Maintains minimal node count (typically 1-2 nodes)
Preserves critical services in running state
Faster recovery time (typically 1-2 minutes)
Good balance between savings and availability

Selective Hibernation

Scales down specific node groups or namespaces
Keeps essential services running while hibernating others
Customizable approach based on workload priorities
Ideal for mixed-use clusters with varying importance

Configuring Hibernation

Basic Configuration

To set up basic hibernation parameters:

Navigate to your cluster in the Stackbooster.io dashboard
Select "Cost Optimization" > "Hibernation"
Configure the following settings:
- Hibernation Type: Full, Partial, or Selective
- Inactivity Threshold: Time without activity before hibernation (default: 60 minutes)
- Minimum Active Period: Time to stay active after wake-up (default: 30 minutes)
- Priority Workloads: Services to restore first on wake-up

Schedule-Based Hibernation

Create time-based rules for predictable hibernation:

Navigate to "Hibernation" > "Schedules"
Create schedules such as:
- Nightly hibernation (e.g., 8 PM - 7 AM on weekdays)
- Weekend hibernation (e.g., Friday 6 PM - Monday 6 AM)
- Holiday periods or company-wide time off
Configure exceptions for maintenance windows or planned usage

Usage-Based Hibernation

Configure dynamic hibernation based on actual usage:

Navigate to "Hibernation" > "Usage Triggers"
Define conditions such as:
- CPU utilization below X% for Y minutes
- No incoming traffic for Z minutes
- Completion of specific batch jobs or workflows
Set confirmation requirements before hibernation occurs

Wake-Up Triggers

Configure methods to wake up hibernated clusters:

Navigate to "Hibernation" > "Wake-Up Triggers"
Configure triggers such as:
- Scheduled times (aligned with work hours)
- API calls or webhook endpoints
- CI/CD pipeline triggers
- Manual wake-up via dashboard or CLI

Best Practices

Identifying Hibernation Candidates

Not all clusters are equally suitable for hibernation. Good candidates include:

Development and testing environments
Demo or sandbox clusters
Training environments
Batch processing clusters with predictable run times
CI/CD build clusters with intermittent usage

Preparing Workloads for Hibernation

To ensure smooth hibernation and recovery:

Use persistent storage for critical state information
Implement proper startup and shutdown procedures in your applications
Set appropriate readiness and liveness probes
Configure workload priority classes for orderly restoration

Testing Hibernation

Before implementing in production:

Test hibernation in a non-critical environment
Verify all services restore correctly after wake-up
Measure actual recovery time to set expectations
Create runbooks for manual intervention if needed

Monitoring and Optimization

To maximize hibernation benefits:

Monitor hibernation patterns in the "Cost Optimization" dashboard
Review metrics such as:
- Cost savings achieved through hibernation
- Hibernation frequency and duration
- Recovery time and success rate
Adjust configuration based on observations:
- Fine-tune schedules based on actual usage patterns
- Modify inactivity thresholds if triggering is too sensitive
- Adjust wake-up procedures if recovery time is too long

Troubleshooting

Common Hibernation Issues

Unexpected Wake-Ups

If your cluster wakes up when it should be hibernating:

Review the wake-up logs to identify the trigger
Check for scheduled tasks or external systems calling your APIs
Verify webhook configurations aren't causing unintended activations
Review monitoring systems that might be probing endpoints

Slow Recovery

If wake-up takes longer than expected:

Check if required instance types have limited availability
Review workload startup sequences for bottlenecks
Verify that persistent volumes are mounting correctly
Consider partial hibernation instead of full hibernation

Failed State Restoration

If workloads don't properly restore:

Review application logs for startup errors
Check for services with dependencies on external systems
Verify persistent volume claims are being bound correctly
Ensure stateful services have appropriate startup procedures

Advanced Topics

Stateful Services Handling

For clusters with stateful services:

Navigate to "Hibernation" > "Stateful Services"
Configure special handling for databases and other stateful components:
- Pre-hibernation backup procedures
- Controlled shutdown sequences
- Prioritized restoration order
- Health verification before accepting traffic

Hibernation with GitOps

If you use GitOps tools like Flux or ArgoCD:

Configure your GitOps tools to be hibernation-aware
Ensure application reconciliation occurs after cluster wake-up
Set appropriate retry mechanisms for the recovery period

Cost-Benefit Analysis

Track and optimize the financial impact of hibernation:

Navigate to "Cost Reports" > "Hibernation Savings"
Review detailed breakdown of:
- Hourly cost of running vs. hibernated state
- Total monthly savings from hibernation
- Overhead costs of hibernation/wake-up cycles
- Optimization recommendations for higher savings

By implementing these hibernation strategies, your Kubernetes clusters can achieve significant cost savings during periods of inactivity while remaining readily available when needed.

Cluster Hibernation ​

Understanding Cluster Hibernation ​

Key Benefits ​

How Stackbooster.io Hibernation Works ​

Intelligent Detection ​

Graceful Shutdown ​

Smart Wake-Up ​

Hibernation Types ​

Full Hibernation ​

Partial Hibernation ​

Selective Hibernation ​

Configuring Hibernation ​

Basic Configuration ​

Schedule-Based Hibernation ​

Usage-Based Hibernation ​

Wake-Up Triggers ​

Best Practices ​

Identifying Hibernation Candidates ​

Preparing Workloads for Hibernation ​

Testing Hibernation ​

Monitoring and Optimization ​

Troubleshooting ​

Common Hibernation Issues ​

Unexpected Wake-Ups ​

Slow Recovery ​

Failed State Restoration ​

Advanced Topics ​

Stateful Services Handling ​

Hibernation with GitOps ​

Cost-Benefit Analysis ​