Indexed Readers in DataWeave

In DataWeave, some readers of input data (such as XML, JSON, and CSV) support an indexed strategy to avoid loading the whole document in memory while still allowing random access just like in the in-memory strategy. DataWeave uses indexed readers when the input is larger than a certain configurable threshold or if the indexedReader setting is set to true.

The trade-off for managing files that wouldn’t fit in memory is to write the files to a temporary file in disk and index the document content beforehand, which also takes time and disk space.

Indexed readers can process input files up to 20 GB. For bigger files use the streaming strategy, which is faster. Despite this mode being more limited in functionality, there’s no maximum input size limitation for streaming readers.

The actual max input file size supported is difficult to estimate because it depends on the content of the input.

The following list shows the limits on different parts of the input file:

  • Max nesting depth: 4096

  • Max individual value size: 4 GB

  • Tokens: ​​2^32-1 (approximately 4 billion)
    A token is either a key/value pair, object-start or array-start, that limits the max input file size. The amount of the max input file size is irrespective of the size of each token in the file. The bigger your average value length, the bigger your input file size can be because what is important is the amount of tokens, not their size.

The following examples show input file size samples:

In this example, the max input file size is between 40 GB and 50 GB:

[
  {
    "name": "Mariano",
    "lastName": "Achaval"
  },
  ...
]

The following example is similar to the previous one but minified where the max file size is approximately 30 GB:

[{"name":"Mariano","lastName":"Achaval"},...]

The following example shows a minified array of numbers 1. DataWeave supports only approximately 8 GB of such file:

[1,1,1,1,1,1,...]

Large Strings Management

When processing a String with a size larger than 1.5 MB, DataWeave automatically splits the value in chunks to avoid out-of-memory issues. This feature works only with JSON and XML input data.

You can configure the threshold that DataWeave uses to determine when to process a string using this splitting strategy by modifying the following system property:

com.mulesoft.dw.max_memory_allocation

Note that using this feature has a negative impact on performance due to splitting strings and accessing them through local storage. If your environment has enough resources to load all content in memory, you can also disable this string management feature completely by setting the value of the following system property to false:

com.mulesoft.dw.buffered_char_sequence.enabled

Was this article helpful?

💙 Thanks for your feedback!

Edit on GitHub