Rate Limiting: SLA-Based Policy

Policy name

Rate-limiting SLA


Monitors access to an API by defining the maximum number of requests processed within a timespan, based on SLAs


Quality of Service

First Mule version available


First Flex version available


Returned Status Codes

400 - Invalid client credentials provided or quota exceeded by WSDL APIs that use SOAP v1.2 (Mule only)

401 - Request blocked due to invalid client credentials (client ID or client secret) provided

429 - Quota exceeded; requests blocked until the current window finishes

500 - Invalid client credentials provided or quota exceeded by WSDL APIs that use SOAP v1.1 (Mule only)


The Rate-Limiting Service Level Agreement (SLA) policy enables you to control incoming traffic to an API by limiting the number of requests that the API can receive within a given timespan. When the limit is reached before the time expires, the policy rejects all requests, thereby avoiding any additional load on the backend API.

To apply the Rate-Limiting SLA policy to an API, you must first create a contract between the API and a registered client application. The number of requests that an API can receive within a given time is defined in the contracts section in API Manager.

Each request must be identified by a client ID and an optional client secret (depending on the policy configuration). To review how to obtain the client credentials of a registered client application, see Obtaining Client Credentials of a Registered Application.

Additionally, you can configure the policy to run in a Mule runtime engine (Mule) cluster. In a clusterized policy configuration, the quota is shared among all nodes in the cluster. For more information about how these options work, see Examples.

A Rate-Limiting SLA policy applied to a Flex Gateway API is scoped to replicas, not the gateway.

Configuring Policy Parameters

Flex Gateway Local Mode

The Rate-Limiting SLA policy is not supported in Local Mode.

Flex Gateway Connected Mode and Mule Gateway

When you apply the policy to your API from the UI, a list of parameters are displayed based on whether your environment includes Mule, Flex or non-Mule applications.

Configuring Policy Parameters

The following parameters are displayed:

Parameter Description Example

Client ID Expression

A DataWeave expression that resolves the Client Application’s ID for a Contract of the API

Only one rate-limiting algorithm is created per ID: #[attributes.headers['client_id']] This example looks for an HTTP header called ‘client_id’ and uses its value.

Client Secret Expression

A DataWeave expression that resolves the Client application’s client secret for a contract of the API

This is an optional value. Example: #[attributes.headers['client_secret']]


When using interconnected runtimes with this flag enabled, quota will be shared among all nodes

Expose Headers

Distributed Rate Limit SLA Resource Configuration Example

Distributed Rate-Limiting requires Shared Storage to be enabled. For more information about Shared Storage, refer to Configuring Shared Storage for Flex Gateway in Connected Mode.

Configuring Policy Parameters for Anypoint Service Mesh (Non-Mule Applications)

The Rate-Limiting SLA policy works in the same way for Anypoint Service Mesh (non-Mule applications), excluding the following differences:

  • The policy does not offer the option to expose headers

  • The client ID and client secret retrieving parameters differ, as explained in the following table.

Element Description Required

Credentials Origin

Origin of the client ID and client secret credentials:

  • client_id and client_secret headers

  • Custom headers


Client ID Header

The header name from where to extract the client ID in API requests

Yes, if you set the credentials origin to Custom Headers

Client Secret Header

The header name from where to extract the client secret in API requests


How This Policy Works

The Rate-Limiting SLA policy monitors the number of requests made in the current window (the available quota), allowing the requests to reach the backend only if the available quota is greater than zero.

Because each client defines a separate available quota for their window, the client application must define an SLA with the API (a contract). Therefore, to verify whether the request is within the SLA limit, you must define a way to obtain the client ID from the request, and optionally the client Secret.

After a contract is created between a client application and an API, API gateway automatically manages these contracts by monitoring your API Manager configurations. Additionally, API gateway implements high-availability strategies in case of unexpected downtime in the Anypoint Platform management plane.

To understand how the Rate-Limiting SLA policy works, consider an example in which the configuration of an SLA of 3 requests every 10 seconds for the client with ID “ID#1” allows or restricts the request, based on the quota available in that window:

Rate-Limiting SLA Policy

In the example:

  • Requests of client with ID “ID#1”:

In the first window, because the quota is reached with the third request, all subsequent requests are rejected until the window closes. In the second window, only two of the three requests are processed. Because the time elapses in this window, the remaining quota is unused.

An accepted request passes through the API to the backend. Conversely, a rejected request displays a 429 status for HTTP (or 400, or 500 if the API is WSDL) and does not reach the backend.

  • Requests of Client with ID “ID#2”:

Because the client has no contract defined for the API, every request is rejected , and therefore no request is allowed.

An API might have several contracts, each with its own SLA. The Rate-Limiting SLA policy independently monitors the quota (and windows) available for each client by creating one rate-limiting algorithm per contract. With the first request to the API, algorithms are created using the lazy creation strategy.


Consider how the same contract of 3 requests every 10 seconds for client ID#1 works when the configuration is clusterized (Mule only).

Consider an example of a window that is reset exactly at 12:00:00, is 10 seconds long, and operates in a 2-node Mule cluster. Both nodes start and end their windows at the same time, and the cluster allows 3 requests per window for client ID#1:

Rate-Limiting SLA Clusterizalbe Configuration

In the example:

  • Requests of client ID#1:

Because the policy is clusterized, the whole cluster accepts three requests. If clusterization of the policy is disabled, the Mule cluster can accept six requests per window: three per node.

  • Requests of client ID#2:

Because there is no contract defined for this client in this specific API, every request is rejected.

Distributed counters impact performance due to the need for synchronization between the nodes. The policy uses caching mechanisms to predict the behavior and maximize throughput. However, in a worst case scenario, you can expect higher latency. Therefore, whether you use the clusterizable configuration must depend solely on your use case.

Configuring the Policy

When you configure your Rate-Limiting SLA policy, you must consider certain aspects of your environment to help you derive the most value from the policy.

Choosing Between a Cluster or a Standalone Configuration

If you use only one Client application as an authorization mechanism in your server, the same recommendations as in the Rate-Limiting policy applies to your use case. You might have several Mule instances and you might map different client applications to different servers in the load balancer.

In such a scenario, you do not need to use a Mule cluster. Each standalone node creates a rate-limiting algorithm for the clients they exclusively serve. Conversely, you might want to achieve high availability for your servers and have several client applications configured.

Mule clusters count the SLA quota for each client throughout the cluster and thus are a best fit for this use case. The Rate-Limiting SLA policy is designed to work under both balanced and unbalanced workloads. Because the backend does not receive any extra requests, its maximum capacity is not exceeded.

Choosing Window Sizes for Cluster Nodes

In a cluster configuration, the nodes must share information across the cluster for consistency. The sharing process adds latency that must be taken into account when reviewing performance.

In a worst case scenario, the number of penalized requests with latency due to cluster consistency is constant and independent from the actual size of the configured quota. Consequently, the smaller the window, the greater the percentage of potentially delayed requests.

Therefore, you must set up window sizes greater than one minute in Rate Limiting and Rate-Limiting SLA policy configurations only for a cluster scenario.

Choosing Persistence for Your Rate-Limiting SLA Policy (Mule only)

You can configure your Rate-Limiting SLA policy to use windows that persist as long as days, months, and years. For example, suppose you want to allow your user to consume 1 million requests per year, but you cannot ensure that the node will be up the entire period or will need maintenance, which may result in restarting the Mule runtime engine.

The algorithm has been running for several months, so the client will lose critical information. Persistence solves this problem by periodically saving the current policy state. In case of a redeployment or a restart, the algorithms are recreated from the last known persisted state or started from a clean state.

Although persistence is enabled by default, you can turn it off by setting the following property to false:


You can also tweak the persistence frequency rate, which has a default of 10 seconds:


Persistence is not available on CloudHub.


When does the window start?

The window starts with the first request after the policy is successfully applied.

What type of window does the algorithm use?

It uses a fixed window.

What happens when the quota is exhausted?

The algorithm is created on demand, when the first request is received. This event fixes the time window. Each request consumes the request quota from the current window until the time expires.

When the request quota is exhausted, the Rate-Limiting SLA policy rejects the request. When the time window closes, the request quota is reset and a new window of the same fixed size starts.

What happens if I define multiple limits within an SLA ?

The policy creates one algorithm for each Limit with the request quota per time window configuration. Therefore, when multiple limits are configured, every algorithm must have its own available quota within the current window for the request to be accepted.

What does each response header mean?

Each response header has information about the current state of the request:

  • X-Ratelimit-Remaining: The amount of available quota

  • X-Ratelimit-Limit: The maximum available requests per window

  • X-Ratelimit-Reset: The remaining time, in milliseconds, until a new window starts

By default, the X-RateLimit headers are disabled in the response. You can enable these headers by selecting Expose Headers when you configure the policy.

Can I configure a Mule Cluster in CloudHub?

No, the feature is available only for Runtime Fabric, hybrid and standalone Mule setups.

When should I use Rate Limiting instead of Rate-Limiting SLA or Spike Control?

Use Rate Limiting and Rate-Limiting SLA policies for accountability and to enforce a hard limit to a group (using the identifier in Rate Limiting) or to a client application (using Rate Limiting-SLA). If you want to protect your backend, use the Spike Control policy instead.

Was this article helpful? Thanks for your feedback!