A2A Token Based Rate Limit Policy

Policy Name

A2A Token Based Rate Limit

Summary

Rate limits agent usage of upstream resources

Category

A2A

First Flex Gateway version available

v1.11.0

Returned Status Codes

429 - Too Many Requests: Token limit exceeded, requests are blocked until the current window finishes

This policy supports Agent2Agent Protocol (A2A) version v0.3.0. To learn more about A2A versions, see A2A Releases.

Summary

The A2A Token Based Rate Limit policy enforces rate limiting on agent requests based on the number of tokens consumed by the agents. The A2A Token Based Rate Limit policy counts tokens using the GPT-4o-mini tokenizer model, providing control over API usage based on the content being processed.

The policy counts tokens from both request and response payloads. Tokens generated by the response are counted toward the rate limit of the next request to ensure total token consumption across the conversation is properly tracked.

If a request exceeds the allowed token limit, the policy blocks the call and returns a 429 status code.

Configuring Policy Parameters

Flex Gateway Local Mode

The A2A Token Based Rate Limit policy isn’t supported in Local Mode.

Managed Flex Gateway and Flex Gateway Connected Mode

When you apply the policy to your API instance from the UI, the following parameters are displayed:

Element Required Description

Element	Required	Description
Maximum Tokens	Yes	Maximum number of tokens allowed within the specified time period. Must be a positive integer of one or greater.
Time Period (ms)	Yes	Time period in milliseconds for the rate limit window. Minimum time period is 1000 milliseconds.
Key Selector	Yes	DataWeave expression that selects the key for rate limiting. This creates independent rate limit counters for each unique value resolved by the expression. For example: `#[attributes.headers['ClientId']]`: Rate limit per client ID `#[attributes.principal]`: Rate limit per user `#[attributes.headers['X-Forwarded-For']]`: Rate limit per IP address `#[attributes.requestPath]`: Rate limit per API path

Maximum Tokens

Yes

Maximum number of tokens allowed within the specified time period. Must be a positive integer of one or greater.

Time Period (ms)

Yes

Time period in milliseconds for the rate limit window. Minimum time period is 1000 milliseconds.

Key Selector

Yes

DataWeave expression that selects the key for rate limiting. This creates independent rate limit counters for each unique value resolved by the expression. For example:

#[attributes.headers['ClientId']]: Rate limit per client ID
#[attributes.principal]: Rate limit per user
#[attributes.headers['X-Forwarded-For']]: Rate limit per IP address
#[attributes.requestPath]: Rate limit per API path

How This Policy Works

The A2A Token Based Rate Limit policy monitors token consumption in the current window, allowing requests to reach the backend only if the available token quota is greater than zero.

The policy uses a fixed-window rate limiting algorithm. The window starts after it receives the first request. When the quota is exhausted, requests are rejected with a 429 status code until the time window resets.

The keySelector parameter enables you to create independent rate limit counters for different groups of requests. Each unique value resolved by the DataWeave expression has its own token quota and time window. For example, if you configure keySelector: "#[attributes.headers['ClientId']]" with a limit of 100 tokens per 5 seconds, each individual client has its own 100-token quota in a 5-second window. If you don’t configure a key selector, all requests are counted together.

Token Counting

The policy uses the GPT-4o-mini tokenizer model to count tokens in:

Request payloads: Tokens are counted from the params.message.parts array in JSON-RPC requests. The policy extracts text from parts with kind: "text" or type: "text", and data from parts with kind: "data" or type: "data".
Response payloads: Tokens are counted from response content including:
- Direct message parts in result.parts
- Status message parts in result.status.message.parts
- Artifact parts in result.artifacts[*].parts
- Server-Sent Events (SSE) stream data

To learn more about the GPT-4o-mini tokenizer model, see OpenAI Platform Tokenizer.

Response Headers

When a request is processed, the policy adds the following headers to the response:

x-token-limit: The maximum number of tokens allowed per window
x-token-remaining: The number of tokens remaining in the current window
x-token-reset: The remaining time, in milliseconds, until a new window starts