Rate-limiting SLA
Rate Limiting: SLA-Based Policy
Policy name |
|
Summary |
Monitors access to an API by defining the maximum number of requests processed within a timespan, based on SLAs |
Category |
Quality of Service |
First Mule version available |
v4.1.0 |
Returned Status Codes |
400 - Invalid client credentials provided or quota exceeded by WSDL APIs that use SOAP v1.2 (Mule only) |
401 - Request blocked due to invalid client credentials (client ID or client secret) provided |
|
429 - Quota exceeded; requests blocked until the current window finishes |
|
500 - Invalid client credentials provided or quota exceeded by WSDL APIs that use SOAP v1.1 (Mule only) |
Summary
The Rate-Limiting Service Level Agreement (SLA) policy enables you to control incoming traffic to an API by limiting the number of requests that the API can receive within a given timespan. When the limit is reached before the time expires, the policy rejects all requests, thereby avoiding any additional load on the backend API.
To apply the Rate-Limiting SLA policy to an API, you must first create a contract between the API and a registered client application. The number of requests that an API can receive within a given time is defined in the contracts section in API Manager.
Each request must be identified by a client ID and an optional client secret (depending on the policy configuration). To review how to obtain the client credentials of a registered client application, see Obtaining Client Credentials of a Registered Application.
Additionally, you can configure the policy to run in a Mule runtime engine (Mule) cluster. In a clusterized policy configuration, the quota is shared among all nodes in the cluster. For more information about how these options work, see Examples.
Configuring Policy Parameters
Mule Gateway
When you apply the policy to your API from the UI, a list of parameters are displayed based on whether your environment includes Mule or non-Mule applications.
Configuring Policy Parameters
The following parameters are displayed:
Parameter | Description | Example |
---|---|---|
Client ID Expression |
A DataWeave expression that resolves the Client Application’s ID for a Contract of the API |
Only one rate-limiting algorithm is created per ID:
|
Client Secret Expression |
A DataWeave expression that resolves the Client application’s client secret for a contract of the API |
This is an optional value.
Example: |
Distributed |
When using interconnected runtimes with this flag enabled, quota will be shared among all nodes |
Expose Headers |
Configuring Policy Parameters for Anypoint Service Mesh (Non-Mule Applications)
The Rate-Limiting SLA policy works in the same way for Anypoint Service Mesh (non-Mule applications), excluding the following differences:
-
The policy does not offer the option to expose headers
-
The client ID and client secret retrieving parameters differ, as explained in the following table.
Element | Description | Required |
---|---|---|
Credentials Origin |
Origin of the client ID and client secret credentials:
|
Yes |
Client ID Header |
The header name from where to extract the client ID in API requests |
Yes, if you set the credentials origin to Custom Headers |
Client Secret Header |
The header name from where to extract the client secret in API requests |
No |
How This Policy Works
The Rate-Limiting SLA policy monitors the number of requests made in the current window (the available quota), allowing the requests to reach the backend only if the available quota is greater than zero.
Because each client defines a separate available quota for their window, the client application must define an SLA with the API (a contract). Therefore, to verify whether the request is within the SLA limit, you must define a way to obtain the client ID from the request, and optionally the client Secret.
After a contract is created between a client application and an API, API gateway automatically manages these contracts by monitoring your API Manager configurations. Additionally, API gateway implements high-availability strategies in case of unexpected downtime in the Anypoint Platform management plane.
To understand how the Rate-Limiting SLA policy works, consider an example in which the configuration of an SLA of 3 requests every 10 seconds for the client with ID “ID#1” allows or restricts the request, based on the quota available in that window:
In the example:
-
Requests of client with ID “ID#1”:
In the first window, because the quota is reached with the third request, all subsequent requests are rejected until the window closes. In the second window, only two of the three requests are processed. Because the time elapses in this window, the remaining quota is unused.
An accepted request passes through the API to the backend. Conversely, a rejected request displays a 429 status for HTTP
(or 400, or 500 if the API is WSDL) and does not reach the backend.
-
Requests of Client with ID “ID#2”:
Because the client has no contract defined for the API, every request is rejected , and therefore no request is allowed.
An API might have several contracts, each with its own SLA. The Rate-Limiting SLA policy independently monitors the quota (and windows) available for each client by creating one rate-limiting algorithm per contract. With the first request to the API, algorithms are created using the lazy creation strategy.
Examples
Consider how the same contract of 3 requests every 10 seconds for client ID#1 works when the configuration is clusterized (Mule only).
Consider an example of a window that is reset exactly at 12:00:00, is 10 seconds long, and operates in a 2-node Mule cluster. Both nodes start and end their windows at the same time, and the cluster allows 3 requests per window for client ID#1:
In the example:
-
Requests of client ID#1:
Because the policy is clusterized, the whole cluster accepts three requests. If clusterization of the policy is disabled, the Mule cluster can accept six requests per window: three per node.
-
Requests of client ID#2:
Because there is no contract defined for this client in this specific API, every request is rejected.
Distributed counters impact performance due to the need for synchronization between the nodes. The policy uses caching mechanisms to predict the behavior and maximize throughput. However, in a worst case scenario, you can expect higher latency. Therefore, whether you use the clusterizable configuration must depend solely on your use case.
Configuring the Policy
When you configure your Rate-Limiting SLA policy, you must consider certain aspects of your environment to help you derive the most value from the policy.
Choosing Between a Cluster or a Standalone Configuration
If you use only one Client application as an authorization mechanism in your server, the same recommendations as in the Rate-Limiting policy applies to your use case. You might have several Mule instances and you might map different client applications to different servers in the load balancer.
In such a scenario, you do not need to use a Mule cluster. Each standalone node creates a rate-limiting algorithm for the clients they exclusively serve. Conversely, you might want to achieve high availability for your servers and have several client applications configured.
Mule clusters count the SLA quota for each client throughout the cluster and thus are a best fit for this use case. The Rate-Limiting SLA policy is designed to work under both balanced and unbalanced workloads. Because the backend does not receive any extra requests, its maximum capacity is not exceeded.
Choosing Window Sizes for Cluster Nodes
In a cluster configuration, the nodes must share information across the cluster for consistency. The sharing process adds latency that must be taken into account when reviewing performance.
In a worst case scenario, the number of penalized requests with latency due to cluster consistency is constant and independent from the actual size of the configured quota. Consequently, the smaller the window, the greater the percentage of potentially delayed requests.
Therefore, you must set up window sizes greater than one minute in Rate Limiting and Rate-Limiting SLA policy configurations only for a cluster scenario.
Choosing Persistence for Your Rate-Limiting SLA Policy (Mule only)
You can configure your Rate-Limiting SLA policy to use windows that persist as long as days, months, and years. For example, suppose you want to allow your user to consume 1 million requests per year, but you cannot ensure that the node will be up the entire period or will need maintenance, which may result in restarting Mule.
The algorithm has been running for several months, so the client will lose critical information. Persistence solves this problem by periodically saving the current policy state. In case of a redeployment or a restart, the algorithms are recreated from the last known persisted state or started from a clean state.
Although persistence is enabled by default, you can turn it off by setting the following property to false:
throttling.persistence_enabled
You can also tweak the persistence frequency rate, which has a default of 10 seconds:
throttling.persistent_data_update_freq
Persistence is not available on CloudHub. |
FAQ
When does the window start?
The window starts with the first request after the policy is successfully applied.
What type of window does the algorithm use?
It uses a fixed window.
What happens when the quota is exhausted?
The algorithm is created on demand, when the first request is received. This event fixes the time window. Each request consumes the request quota from the current window until the time expires.
When the request quota is exhausted, the Rate-Limiting SLA policy rejects the request. When the time window closes, the request quota is reset and a new window of the same fixed size starts.
What happens if I define multiple limits within an SLA ?
The policy creates one algorithm for each Limit with the request quota per time window configuration. Therefore, when multiple limits are configured, every algorithm must have its own available quota within the current window for the request to be accepted.
What does each response header mean?
Each response header has information about the current state of the request:
-
X-Ratelimit-Remaining: The amount of available quota
-
X-Ratelimit-Limit: The maximum available requests per window
-
X-Ratelimit-Reset: The remaining time, in milliseconds, until a new window starts
By default, the X-RateLimit headers are disabled in the response. You can enable these headers by selecting Expose Headers when you configure the policy.
Can I configure a Mule Cluster in CloudHub?
No, the feature is available only for Runtime Fabric, hybrid and standalone Mule setups.
When should I use Rate Limiting instead of Rate-Limiting SLA or Spike Control?
Use Rate Limiting and Rate-Limiting SLA policies for accountability and to enforce a hard limit to a group (using the identifier in Rate Limiting) or to a client application (using Rate Limiting-SLA). If you want to protect your backend, use the Spike Control policy instead.