CloudHub 2.0 High Availability and Disaster Recovery

CloudHub 2.0 provides high availability (HA) and disaster recovery (DR) capabilities for applications and protection against hardware failures.

CloudHub 2.0 leverages Amazon AWS for its cloud infrastructure, so availability is directly dependent on Amazon services. CloudHub 2.0 deployments and availability are region-based and correspond to Amazon regions. If an Amazon region goes down, the applications within that region are unavailable and not automatically replicated in other regions.

In the event of a network partition between the control plane and the runtime plane, applications in the runtime plane continue to run. The runtime plane continues to locally buffer log and telemetry data until control plane availability is restored.

CloudHub 2.0 leverages external messaging services, such as Anypoint MQ, to ensure message reliability through persistent queues. These external queues are highly available within a region, but they can‌ be inaccessible for short periods, generally seconds to minutes, during a regional outage leading to some data loss. Communication with the queues resumes after the region becomes available. For more information, see Configuring Cross-Region Failover for Standard Queues.

Anypoint Object Store v2 enables CloudHub 2.0 applications to store data and states across various components and applications. Object Store v2 is maintained in the same region as the deployed CloudHub 2.0 application. Data persists and becomes available after the region returns to service. However, Object Store v2 is a regional service. It doesn’t provide any region failover functionality. You can’t access data in Object Store v2 from an application in a different region. Implement your own external storage solution for cross-region data persistence and access. Object Store v2 doesn’t provide periodic backups.

High Availability Versus Disaster Recovery

High availability (HA) is the measure of a system’s ability to remain accessible despite a system component failure. You generally implement HA by building in multiple levels of fault tolerance or load balancing capabilities into a system. In CloudHub 2.0, you can achieve high availability by deploying your application with multiple replicas.

Disaster recovery (DR) refers to the process of restoring a system to an acceptable previous state after a natural or man-made disaster, such as flooding, tornadoes, earthquakes, fires, power failures, server failures, and misconfigurations.

While they both increase overall availability, the difference is that with HA there’s generally no loss of service. HA retains the service, and DR retains the data, but with DR, there’s usually a slight loss of service while the DR plan executes and the system restores.

These key terms are essential to understanding HA and DR strategies in CloudHub 2.0:

Recovery Time Objective (RTO): The maximum amount of downtime a business can tolerate. The RTO represents the time it takes for the system to recover after a business disruption.
Recovery Point Objective (RPO): The maximum amount of time acceptable for data loss after a disaster. The RPO influences the frequency of data backups.

High Availability in CloudHub 2.0

CloudHub 2.0 implements high availability through built-in mechanisms, which MuleSoft manages automatically, and require no additional configuration.

Multiple-Replica Deployment: If an application uses multiple replicas, CloudHub 2.0 deploys these replicas across two or more availability zones (AZs) by default. In case of an AZ failure, CloudHub 2.0 automatically restarts the application in a different AZ to maintain availability.

Disaster Recovery in CloudHub 2.0

Disaster recovery in CloudHub 2.0 focuses on the process of restoring systems after significant disruptions.

Global Distribution

You can deploy applications to CloudHub 2.0 in various global regions, including North America, South America, the European Union, and Asia-Pacific.

If you have the Global Deployment entitlement, and are in the US Cloud or EU cloud, you can deploy applications to CloudHub 2.0 in more than one runtime plane region.

You can host integrations in a runtime plane region that is closer to your services to reduce latency.

For a list of available runtime plane regions based on the control plane region where your organization is provisioned, see: Runtime Plane Regions and DNS Records.

The runtime plane region is the region where you deploy your CloudHub 2.0 applications and create CloudHub 2.0 private spaces.

Regional Infrastructure

You can create your private space in a region of your choice.

Customer Responsibility for Disaster Recovery

If your organization has cross-region DR requirements for your applications, build your applications accordingly and consider deploying them across multiple regions. You can use a load balancer, either cloud-based or on-premises, for applications deployed to different regions to switch traffic to your backup region as part of your DR strategy.

For more information about how to deploy for HA and DR strategies, see High Availability and Disaster Recovery.

Disaster Recovery Use Cases and Architecture for the Runtime Plane

CloudHub 2.0 supports various multi-region disaster recovery strategies based on application impact and statefulness. These strategies typically involve a bring-your-own (BYO) global load balancer to manage traffic distribution and failover across regions.

To decouple your runtime plane application availability from the control plane region, configure your primary and backup runtime plane regions to be different from the region where your Anypoint Platform control plane is located. This is applicable to EU, with the control plane in eu-central-1, and US, with the control plane in us-east-1.

You can use a load balancer, either cloud-based or on-premises, for applications deployed to different regions to provide a better disaster recovery strategy.

Diagram showing basic disaster recovery configuration

Use Case 1: Active-Active Configuration

When to Use

High-impact, stateless applications that require continuous availability with minimal downtime.

Application State

Set up the infrastructure with applications in different regions, with these applications running simultaneously and fully scaled up.

Traffic Management

Use a BYO global load balancer to distribute traffic between both regions.
Perform health checks to detect disasters.
Applications continue to run even if the control plane region is unavailable, as long as these applications are deployed in different regions than the control plane.

Licensing

For organizations using UBP, this configuration distributes Mule messages and Data throughput between the two regions. It consumes Mule flows based on the number of running replicas per application deployment.
For organizations not using UBP, this configuration consumes additional vCores.

Use Case 2: Warm Standby Configuration

When to Use

Medium-impact stateless APIs that tolerate brief downtime during failover or stateful APIs.

Application State

Set up the infrastructure across multiple regions. In the passive region, you can deploy your backup applications either in a fully scaled-up state or in a scaled-down state, with fewer replicas or smaller replica sizes. Scaled-down applications increase reliance on the control plane to redeploy and scale up during an outage in the primary runtime plane region.
You can configure CPU-based Horizontal Pod Autoscaling (HPA) where applicable if available to your organization. For more information, see Configuring Horizontal Autoscaling (HPA) for CloudHub 2.0 Deployments.

Traffic Management

Use a BYO global load balancer to route traffic to only one region at a time.
Perform health checks to determine DR and switch routing.
For non-HTTP use cases, such as Schedulers, externalize a DR flag in your application and use a Groovy script to enable and disable scheduler flows directly in the runtime plane. This reduces reliance on the control plane to trigger your backup applications via a cold standby approach.

Licensing

For organizations using UBP, this configuration doesn’t consume messages or throughput. The configuration consumes fewer flows if you configure the application with CPU-based HPA.
For organizations not using UBP, this configuration consumes lower vCores if the backup deployment is in a scaled-down state before it scales up in response to an outage in the primary region.

Recovery Time Objective

If the backup application is in a fully scaled state, you experience lower disruption in the event of an outage. If the backup application is in a scaled-down state, the RTO depends on the time it takes to scale the application up either through control-plane-triggered redeployments or through CPU-based HPA.

For integrating systems to resend events or data of in-flight transactions, capture recovery points in logs and monitoring. This builds reliability in your applications.

Use Case 3: Cold Standby Configuration

When to Use

Low-impact, stateless, or stateful applications that tolerate longer downtime during recovery.

Application State

Set up the infrastructure, but configure your application to a stopped state in the passive region.
Start the application only if you detect an outage.

Traffic Management

Use a BYO global load balancer to route traffic only to the region that is up.
Perform health checks to determine DR, and use scripts to start your applications in backup state and switch traffic routing.
If the control plane is unavailable, you can’t start the backup application in the runtime plane.

Licensing

Because the backup application is in a stopped state, this configuration doesn’t consume additional vCores, Mule flows, Mule messages, or Data throughput until the backup application starts in the event of a disaster.

Recovery Time Objective

The RTO depends on the time it takes to start applications via the control plane.

Runtime Plane High Availability and Disaster Recovery Summary

High-Impact Stateless API

Use an active-active setup with active replicas in two regions and an external load balancer.

High-Impact Stateful API

Use an external load balancer to route traffic, and perform health checks to determine DR and switch traffic.

Medium-Impact Stateless and Stateful API

Use cold/warm standby, with applications deployed to both regions.
Stop or scale down the applications or replicas in the secondary region.