Contact Us 1-800-596-4880

Scalability Benchmarks

Runtime Fabric is a core Kubernetes (K8s) software and works alike in all K8s environments which comply with Runtime Fabric hardware and software requirements. The following scale limits are benchmarked on Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

Limitation Description Enforced by Runtime Fabric

Deployments per Runtime Fabric

The maximum number of deployments being managed by Runtime Fabric at any instance should not exceed 8000 for optimal performance.

Recommended

Horizontal Pod Autoscaling (HPA)

With Runtime Fabric on Amazon EKS, occasional traffic failures of approximately 0.09% may occur when scaling replicas from 1 to 8. These errors are specific to the pod scaling window, so you may experience traffic failure during scaling operations.

Runtime Fabric functionality and management of application deployments is not affected by the following:

  • Size of individual applications

  • Size of worker nodes

  • Number of pods running per work node

Because Runtime Fabric is K8s based and uses cloud-native orchestration principals, it works according to the specified configuration of deployments.

Review the following benchmarked metrics to understand various management metrics when compared between a nearly empty cluster and a cluster running 8000 deployments. The cluster configuration used is Runtime Fabric on AWS EKS cluster:

Configuration Description

VPC and node groups

Default configuration done from eksctl. 3 subnets (one subnet per node group), NAT Gateway.

AWS NLB

Equipped via helm chart at nginx creation.

Node Auto Scaler

Recommended 3 GB (Memory)/200m vCore (req==limit)

Workers

m5.4x large, worker node group

Monitoring

Fixed, 1 Node, Prometheus, monitoring node group, m5.2x large

Ingress

Fixed, 2 Nodes, Nginx, ingress node group, m5.4xlarge

The following graphic shows how long the Runtime Fabric agent takes to restart if invoked by the customers or cluster upgrade:

Diagram that shows restart latency between near empty cluster and loaded cluster

Runtime Fabric takes approximately 190 seconds to create the apps irrespective of deployment volume. The following table shows how long it takes for a Mule app to start, stop, or delete:

App Operations Near Empty Cluster Loaded

Start Latency

~110 seconds

158 seconds

Stop Latency

~10 seconds

20 seconds

Delete Latency

~9 seconds

~28 seconds

The following table shows how long a node cordon operation takes to restart if there is a bad node:

Node Operation Empty Cluster Loaded Cluster

Cordon

~2 seconds

~2 seconds

Drain process

~4 seconds

~35 seconds

Aspects to consider:

  • Runtime Fabric agent requires more than default memory and CPU to manage clusters with large-scale deployments, you can modify such values using the information in the following table. For a cluster with 8000 deployments, use at least:

    Cluster Recommendations Default 200 Node/8000 App Resource Settings

    Agent Resources

    Memory 200 mb request, 500 mb limit. CPU 100m request, 1vCore limit.

    3 GB (Memory)/2vCore (request == limit)

  • Note that in Helm you need to disable node watcher nodeWatcherEnabled explicitly when you operate at a large scale. Node watcher is always enabled by default (nodeWatcherEnabled: true).

  • The performance of large-scale deployment orchestration through Runtime Fabric is inversely proportional to the volume of deployments.

  • RTFCTL clusters are supported only for certain benchmarks.

  • Using Helm enables you to manage any cluster that requires customized resources for Runtime Fabric agent due to the size of the cluster.

  • When scaling up Runtime Fabric agent resource limit, consider increasing the request accordingly.

  • Try not to run a container unbounded by removing the limit completely. For OpenShift operators, removing the limit sets the agent to the default limit setting.