CloudHub High Availability and Disaster Recovery

CloudHub provides high availability (HA) and disaster recovery for application and hardware failures.

CloudHub uses Amazon AWS for its cloud infrastructure, so availability depends on Amazon. CloudHub runs deployments in different regions that map to Amazon regions. If an Amazon region goes down, applications in that region become unavailable. CloudHub doesn’t replicate them to other regions.

For example, when the US East region is down, the CloudHub management UI and the REST services that enable deployments remain unavailable until the region recovers. You can’t deploy new applications while US East is down.

While the control plane is unavailable, the runtime plane continues to send log data and other telemetry data. The worker buffers up to 1 GB of data until the control plane recovers.

CloudHub provides persistent queues for message reliability. Within a region, persistent queues are highly available, but when the region or part of it is down—usually for a few seconds or minutes—they become inaccessible and you sometimes lose data. When the region recovers, CloudHub resumes communication with the queues.

Some CloudHub modules, such as Anypoint Object Store v1, application settings, and Insight-related information, reside in the US East region for all applications. Anypoint Object Store v2 resides in the same region as the deployed application. For both Object Store v1 and v2, when a region is down, data persists and becomes available again when the region returns to service.

Anypoint Virtual Private Cloud (Anypoint VPC) applies at the region level. When a region is down, that region’s VPC is down unless you’ve set up a VPC instance in another region.

High Availability Versus Disaster Recovery

High availability (HA) is the measure of a system’s ability to remain accessible despite a system component failure. You generally implement HA by building multiple levels of fault tolerance or load balancing into a system. In CloudHub, you can achieve high availability by deploying your application with multiple workers and enabling persistent queues where appropriate.

Disaster recovery (DR) refers to the process of restoring a system to an acceptable previous state after a natural or man-made disaster, such as flooding, fires, power failures, server failures, or misconfigurations.

Both HA and DR increase availability, but with HA you typically see no loss of service. HA keeps the service up, DR preserves data. With DR, you usually see a brief loss of service while the DR plan runs and the system restores.

These terms help you plan HA and DR on CloudHub:

Recovery Time Objective (RTO): The maximum downtime a business tolerates. RTO is the time the system takes to recover after a disruption.
Recovery Point Objective (RPO): The maximum acceptable data loss after a disaster. RPO drives how often you back up data.

Anypoint CloudHub Default Deployment Model

If the application uses multiple workers, CloudHub deploys the workers in separate availability zones by default, providing HA across availability zones. The distance between the availability zones is variable and generally doesn’t exceed 350 miles.

If an application uses a single worker and that availability zone goes down, CloudHub restarts the application in a different availability zone. The application can experience downtime during the restart.

You can set up status.mulesoft.com to receive alerts when a failure occurs in an availability zone or region.

Shared Responsibility for Disaster Recovery

MuleSoft manages CloudHub control plane and worker infrastructure within each region. You’re responsible for cross-region strategy, application-level failover, and data synchronization. This table lists who does what for disaster recovery on CloudHub.

Party	Responsibility
MuleSoft	Control plane availability: Anypoint Platform UI, deployment APIs, and platform services within the provisioned region.
MuleSoft	Infrastructure patching, security updates, and maintenance of the worker cloud.
MuleSoft	Multi-AZ worker distribution: when you use multiple workers, CloudHub deploys them across two or more availability zones within the same region.
MuleSoft	Automatic restart of applications in a different availability zone when one worker or AZ fails.
You	Define and implement a cross-region DR strategy for primary and backup regions.
You	Decide when to trigger regional failover—for example, based on health checks or business criteria.
You	Configure Global Server Load Balancing (GSLB) or a dedicated load balancer (DLB) and routing rules to direct traffic to a backup region during a disaster.
You	Implement application-level failover strategy: deploy and maintain applications in more than one region when you need cross-region DR.
You	Replicate and back up external data stores such as databases, object stores, and other systems that your applications use across regions.
You	Set up Anypoint VPC in each region when you need network connectivity there for DR.

Party

Responsibility

MuleSoft

Control plane availability: Anypoint Platform UI, deployment APIs, and platform services within the provisioned region.

MuleSoft

Infrastructure patching, security updates, and maintenance of the worker cloud.

MuleSoft

Multi-AZ worker distribution: when you use multiple workers, CloudHub deploys them across two or more availability zones within the same region.

MuleSoft

Automatic restart of applications in a different availability zone when one worker or AZ fails.

You

Define and implement a cross-region DR strategy for primary and backup regions.

You

Decide when to trigger regional failover—for example, based on health checks or business criteria.

You

Configure Global Server Load Balancing (GSLB) or a dedicated load balancer (DLB) and routing rules to direct traffic to a backup region during a disaster.

You

Implement application-level failover strategy: deploy and maintain applications in more than one region when you need cross-region DR.

You

Replicate and back up external data stores such as databases, object stores, and other systems that your applications use across regions.

You

Set up Anypoint VPC in each region when you need network connectivity there for DR.

Your Responsibilities for Disaster Recovery

If your organization needs cross-region DR, design and operate your applications for it. MuleSoft doesn’t automatically replicate applications or fail over traffic to another region. You’re responsible for:

Regional failover strategy: Decide when to switch traffic to a backup region—for example, after a region outage or based on health checks.
Traffic management: Use a load balancer, cloud-based or on-premises, such as a Dedicated Load Balancer (DLB) or external GSLB, to route traffic to applications in different regions and to switch to the backup region as part of your DR plan.
Application deployment: Deploy the same or equivalent applications in a backup region and keep them in sync with configuration and code.
Data and state: Replicate or back up external data stores such as databases, caches, and object stores that your integrations use so applications in a DR region can access the data they need. Anypoint Object Store v1 and v2 are regional; they don’t provide cross-region failover.

For guidance on designing HA and DR topologies, including active-active, warm standby, and cold standby, see High Availability and Disaster Recovery.

Restoring After a Disaster

Restoration depends on the DR strategy you put in place. In general, after you confirm the primary region or application is unavailable:

Switch traffic to the backup region.

Use your load balancer, such as a GSLB or Dedicated Load Balancer (DLB), to route traffic to the backup region. The health checks you configured earlier mark the primary as unhealthy and direct traffic to the backup endpoints. If you use cold or warm standby, bring your backup applications online.
Bring backup applications online when you use cold or warm standby.

If the control plane is available, use Anypoint Runtime Manager or the CloudHub API to start the backup application or scale it up. If the control plane is in the same region as the failed primary, it is unavailable. You can’t start or scale apps until the control plane recovers, unless you use automation that doesn’t depend on the control plane.
Verify that the backup region is serving traffic and that dependent systems use the correct endpoints or data stores.
When the primary region recovers, optionally fail back by switching traffic from the backup region back to the primary and resyncing data when needed.

Your RTO depends on how quickly you complete these steps and, for cold or warm standby, on how long it takes to start or scale the backup application. For active-active setups, traffic continues on the remaining region without a switch. For more on recovery types and topologies, see High Availability and Disaster Recovery and CloudHub 2.0 High Availability and Disaster Recovery.

Suggested Alternative Deployment Model

You can use a cloud-based or on-premises load balancer for applications deployed to different regions to improve your disaster recovery strategy. Configure your load balancer to perform health checks and to route traffic to your backup region when the primary region goes down. For CloudHub-specific load balancing options, see CloudHub Load Balancers and Dedicated Load Balancers.

Keep Integrations Stateless

Keep integrations stateless. Don’t share transactional information between client invocations or scheduled runs. When the middleware needs to keep data because of a system limitation, store it in an external store such as a database or messaging queue, not in the middleware infrastructure or memory.

As you scale, especially in the cloud, keep each worker’s state and resources independent of other workers. This model gives you better performance, scalability, and reliability.

CloudHub High Availability and Disaster Recovery

High Availability Versus Disaster Recovery

Anypoint CloudHub Default Deployment Model

Shared Responsibility for Disaster Recovery

Your Responsibilities for Disaster Recovery

Restoring After a Disaster

Suggested Alternative Deployment Model

Keep Integrations Stateless

See Also