Maximum Tokens
A2A Token Based Rate Limit Policy
Policy Name |
A2A Token Based Rate Limit |
Summary |
Rate limits agent usage of upstream resources |
Category |
A2A |
First Flex Gateway version available |
v1.11.0 |
Returned Status Codes |
429 - Too Many Requests: Token limit exceeded, requests are blocked until the current window finishes |
| This policy supports Agent2Agent Protocol (A2A) version v0.3.0. To learn more about A2A versions, see A2A Releases. |
Summary
The A2A Token Based Rate Limit policy enforces rate limiting on agent requests based on the number of tokens consumed by the agents. The A2A Token Based Rate Limit policy counts tokens using the GPT-4o-mini tokenizer model, providing control over API usage based on the content being processed.
The policy counts tokens from both request and response payloads. Tokens generated by the response are counted toward the rate limit of the next request to ensure total token consumption across the conversation is properly tracked.
If a request exceeds the allowed token limit, the policy blocks the call and returns a 429 status code.
Configuring Policy Parameters
Managed Flex Gateway and Flex Gateway Connected Mode
When you apply the policy to your API instance from the UI, the following parameters are displayed:
| Element | Required | Description |
|---|---|---|
Yes |
Maximum number of tokens allowed within the specified time period. Must be a positive integer of one or greater. |
|
Time Period (ms) |
Yes |
Time period in milliseconds for the rate limit window. Minimum time period is 1000 milliseconds. |
Key Selector |
Yes |
DataWeave expression that selects the key for rate limiting. This creates independent rate limit counters for each unique value resolved by the expression. For example:
|
How This Policy Works
The A2A Token Based Rate Limit policy monitors token consumption in the current window, allowing requests to reach the backend only if the available token quota is greater than zero.
The policy uses a fixed-window rate limiting algorithm. The window starts after it receives the first request. When the quota is exhausted, requests are rejected with a 429 status code until the time window resets.
The keySelector parameter enables you to create independent rate limit counters for different groups of requests. Each unique value resolved by the DataWeave expression has its own token quota and time window. For example, if you configure keySelector: "#[attributes.headers['ClientId']]" with a limit of 100 tokens per 5 seconds, each individual client has its own 100-token quota in a 5-second window. If you don’t configure a key selector, all requests are counted together.
Token Counting
The policy uses the GPT-4o-mini tokenizer model to count tokens in:
-
Request payloads: Tokens are counted from the
params.message.partsarray in JSON-RPC requests. The policy extracts text from parts withkind: "text"ortype: "text", and data from parts withkind: "data"ortype: "data". -
Response payloads: Tokens are counted from response content including:
-
Direct message parts in
result.parts -
Status message parts in
result.status.message.parts -
Artifact parts in
result.artifacts[*].parts -
Server-Sent Events (SSE) stream data
-
To learn more about the GPT-4o-mini tokenizer model, see OpenAI Platform Tokenizer.
Response Headers
When a request is processed, the policy adds the following headers to the response:
-
x-token-limit: The maximum number of tokens allowed per window -
x-token-remaining: The number of tokens remaining in the current window -
x-token-reset: The remaining time, in milliseconds, until a new window starts



