Skip to content

Cluster Hibernation

Cluster hibernation is a powerful cost-saving feature that allows you to automatically scale down your Kubernetes clusters during periods of inactivity. This guide explains how Stackbooster.io's hibernation works and how to configure it for maximum savings without compromising availability.

Understanding Cluster Hibernation

Cluster hibernation is the process of scaling a Kubernetes cluster to zero or near-zero resources during periods of inactivity, then restoring it when needed. This approach can provide significant cost savings, especially for non-production environments that don't require 24/7 availability.

Key Benefits

  • Cost Reduction: Save up to 70% on infrastructure costs by eliminating charges during inactive periods
  • Automated Management: Schedule hibernation based on usage patterns without manual intervention
  • State Preservation: Maintain cluster configuration and state during hibernation
  • Fast Recovery: Quickly restore clusters to operational state when needed

How Stackbooster.io Hibernation Works

Stackbooster.io implements a sophisticated hibernation system:

Intelligent Detection

The platform determines when hibernation is appropriate by:

  • Monitoring overall cluster utilization across all resources
  • Analyzing usage patterns to identify predictable inactive periods
  • Detecting when active workloads have completed their tasks
  • Considering configured hibernation policies and schedules

Graceful Shutdown

When initiating hibernation, the system:

  1. Marks the cluster as entering hibernation mode
  2. Captures the current state for future restoration
  3. Scales down workloads according to configured policies
  4. Drains and terminates worker nodes in a controlled manner
  5. Preserves control plane state (for EKS) or scales it down (for self-managed)

Smart Wake-Up

When the cluster needs to be restored, the system:

  1. Detects wake-up triggers (schedule, API calls, webhooks)
  2. Restores the control plane if needed
  3. Provisions required worker nodes
  4. Restores workloads according to priority settings
  5. Verifies cluster health before marking as fully operational

Hibernation Types

Stackbooster.io offers several hibernation approaches:

Full Hibernation

  • Scales down all nodes to zero
  • Provides maximum cost savings
  • Longer recovery time (typically 3-5 minutes)
  • Best for dev/test environments with predictable usage

Partial Hibernation

  • Maintains minimal node count (typically 1-2 nodes)
  • Preserves critical services in running state
  • Faster recovery time (typically 1-2 minutes)
  • Good balance between savings and availability

Selective Hibernation

  • Scales down specific node groups or namespaces
  • Keeps essential services running while hibernating others
  • Customizable approach based on workload priorities
  • Ideal for mixed-use clusters with varying importance

Configuring Hibernation

Basic Configuration

To set up basic hibernation parameters:

  1. Navigate to your cluster in the Stackbooster.io dashboard
  2. Select "Cost Optimization" > "Hibernation"
  3. Configure the following settings:
    • Hibernation Type: Full, Partial, or Selective
    • Inactivity Threshold: Time without activity before hibernation (default: 60 minutes)
    • Minimum Active Period: Time to stay active after wake-up (default: 30 minutes)
    • Priority Workloads: Services to restore first on wake-up

Schedule-Based Hibernation

Create time-based rules for predictable hibernation:

  1. Navigate to "Hibernation" > "Schedules"
  2. Create schedules such as:
    • Nightly hibernation (e.g., 8 PM - 7 AM on weekdays)
    • Weekend hibernation (e.g., Friday 6 PM - Monday 6 AM)
    • Holiday periods or company-wide time off
  3. Configure exceptions for maintenance windows or planned usage

Usage-Based Hibernation

Configure dynamic hibernation based on actual usage:

  1. Navigate to "Hibernation" > "Usage Triggers"
  2. Define conditions such as:
    • CPU utilization below X% for Y minutes
    • No incoming traffic for Z minutes
    • Completion of specific batch jobs or workflows
  3. Set confirmation requirements before hibernation occurs

Wake-Up Triggers

Configure methods to wake up hibernated clusters:

  1. Navigate to "Hibernation" > "Wake-Up Triggers"
  2. Configure triggers such as:
    • Scheduled times (aligned with work hours)
    • API calls or webhook endpoints
    • CI/CD pipeline triggers
    • Manual wake-up via dashboard or CLI

Best Practices

Identifying Hibernation Candidates

Not all clusters are equally suitable for hibernation. Good candidates include:

  • Development and testing environments
  • Demo or sandbox clusters
  • Training environments
  • Batch processing clusters with predictable run times
  • CI/CD build clusters with intermittent usage

Preparing Workloads for Hibernation

To ensure smooth hibernation and recovery:

  • Use persistent storage for critical state information
  • Implement proper startup and shutdown procedures in your applications
  • Set appropriate readiness and liveness probes
  • Configure workload priority classes for orderly restoration

Testing Hibernation

Before implementing in production:

  1. Test hibernation in a non-critical environment
  2. Verify all services restore correctly after wake-up
  3. Measure actual recovery time to set expectations
  4. Create runbooks for manual intervention if needed

Monitoring and Optimization

To maximize hibernation benefits:

  1. Monitor hibernation patterns in the "Cost Optimization" dashboard

  2. Review metrics such as:

    • Cost savings achieved through hibernation
    • Hibernation frequency and duration
    • Recovery time and success rate
  3. Adjust configuration based on observations:

    • Fine-tune schedules based on actual usage patterns
    • Modify inactivity thresholds if triggering is too sensitive
    • Adjust wake-up procedures if recovery time is too long

Troubleshooting

Common Hibernation Issues

Unexpected Wake-Ups

If your cluster wakes up when it should be hibernating:

  • Review the wake-up logs to identify the trigger
  • Check for scheduled tasks or external systems calling your APIs
  • Verify webhook configurations aren't causing unintended activations
  • Review monitoring systems that might be probing endpoints

Slow Recovery

If wake-up takes longer than expected:

  • Check if required instance types have limited availability
  • Review workload startup sequences for bottlenecks
  • Verify that persistent volumes are mounting correctly
  • Consider partial hibernation instead of full hibernation

Failed State Restoration

If workloads don't properly restore:

  • Review application logs for startup errors
  • Check for services with dependencies on external systems
  • Verify persistent volume claims are being bound correctly
  • Ensure stateful services have appropriate startup procedures

Advanced Topics

Stateful Services Handling

For clusters with stateful services:

  1. Navigate to "Hibernation" > "Stateful Services"
  2. Configure special handling for databases and other stateful components:
    • Pre-hibernation backup procedures
    • Controlled shutdown sequences
    • Prioritized restoration order
    • Health verification before accepting traffic

Hibernation with GitOps

If you use GitOps tools like Flux or ArgoCD:

  1. Configure your GitOps tools to be hibernation-aware
  2. Ensure application reconciliation occurs after cluster wake-up
  3. Set appropriate retry mechanisms for the recovery period

Cost-Benefit Analysis

Track and optimize the financial impact of hibernation:

  1. Navigate to "Cost Reports" > "Hibernation Savings"
  2. Review detailed breakdown of:
    • Hourly cost of running vs. hibernated state
    • Total monthly savings from hibernation
    • Overhead costs of hibernation/wake-up cycles
    • Optimization recommendations for higher savings

By implementing these hibernation strategies, your Kubernetes clusters can achieve significant cost savings during periods of inactivity while remaining readily available when needed.

Released under the MIT License. Contact us at [email protected]