Nav

Hdfs Connector Reference

Configurations


Hdfs

Name Type Description Default Value Required

Name

String

The name for this configuration. Connectors reference the configuration with this name.

x

Connection

The connection types that can be provided to this configuration.

x

Expiration Policy

Configures the minimum amount of time that a dynamic configuration instance can remain idle before the runtime considers it eligible for expiration. This does not mean that the platform will expire the instance at the exact moment that it becomes eligible. The runtime will actually purge the instances when it sees it fit.

Connection Types

Kerberos
Name Type Description Default Value Required

Username

String

Kerberos principal. It is passed to HDFS client as the "hadoop.job.ugi" configuration entry. It can be overriden by values in configurationResources and configurationEntries. We call it username for backward compatibility reasons in terms of what will be seen in XML

Keytab Path

String

Path to the <a href="https://web.mit.edu/kerberos/krb5-1.12/doc/basic/keytab_def.html">keytab file</a> associated with username. It is used in order to obtain TGT from "Authorization server". If not provided it will look for a TGT associated to username within your local kerberos cache.

Name Node Uri

String

The name of the file system to connect to. It is passed to HDFS client as the {FileSystem#FS_DEFAULT_NAME_KEY} configuration entry. It can be overriden by values in configurationResources and configurationEntries.

x

Configuration Resources

Array of String

A java.util.List of configuration resource files to be loaded by the HDFS client. Here you can provide additional configuration files. (for example, core-site.xml)

Configuration Entries

Object

A java.util.Map of configuration entries to be used by the HDFS client. Here you can provide additional configuration entries as key/value pairs.

Reconnection

When the application is deployed, a connectivity test is performed on all connectors. If set to true, deployment will fail if the test doesn’t pass after exhausting the associated reconnection strategy

Simple
Name Type Description Default Value Required

Username

String

User identity that Hadoop uses for permissions in HDFS. When Simple Authentication is used, Hadoop requires the user to be set as a System Property called HADOOP_USER_NAME. If you fill this field then the connector will set it for you, however you can set it by yourself. If the variable is not set, Hadoop will use the current logged in OS user.

Name Node Uri

String

The name of the file system to connect to. It is passed to HDFS client as the {FileSystem#FS_DEFAULT_NAME_KEY} configuration entry. It can be overriden by values in configurationResources and configurationEntries.

x

Configuration Resources

Array of String

A java.util.List of configuration resource files to be loaded by the HDFS client. Here you can provide additional configuration files. (for example, core-site.xml)

Configuration Entries

Object

A java.util.Map of configuration entries to be used by the HDFS client. Here you can provide additional configuration entries as key/value pairs.

Reconnection

When the application is deployed, a connectivity test is performed on all connectors. If set to true, deployment will fail if the test doesn’t pass after exhausting the associated reconnection strategy

Associated Sources

Operations

Append

<hdfs:append>

Append the current payload to a file located at the designated path. Note: by default the Hadoop server has the append option disabled. To be able append any data to an existing file refer to dfs.support.append configuration parameter.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the path of the file to write to.

x

Buffer Size

Number

the buffer size to use when appending to the file.

4096

Payload

Binary

the payload to append to the file.

#[payload]

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Copy From Local File

<hdfs:copy-from-local-file>

Copy the source file on the local disk to the FileSystem at the given target path, set deleteSource if the source should be removed.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Delete Source

Boolean

whether to delete the source.

false

Overwrite

Boolean

whether to overwrite destination content.

true

Source

String

the source path on the File System.

x

Destination

String

the target path on the local disk.

x

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Copy To Local File

<hdfs:copy-to-local-file>

Copy the source file on the FileSystem to local disk at the given target path, set deleteSource if the source should be removed. useRawLocalFileSystem indicates whether to use RawLocalFileSystem as it is a non CRC File System.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Delete Source

Boolean

whether to delete the source.

false

Use Raw Local File System

Boolean

whether to use RawLocalFileSystem as local file system or not.

false

Source

String

the source path on the File System.

x

Destination

String

the target path on the local disk.

x

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Delete Directory

<hdfs:delete-directory>

Delete the file or directory located at the designated path.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the path of the file to delete.

x

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Delete File

<hdfs:delete-file>

Delete the file or directory located at the designated path.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the path of the file to delete.

x

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Get Metadata

<hdfs:get-metadata>

Get the metadata of a path

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the path of the file to delete.

x

Target Variable

String

The name of a variable on which the operation’s output will be placed

Target Value

String

An expression that will be evaluated against the operation’s output and the outcome of that expression will be stored in the target variable

#[payload]

Reconnection Strategy

A retry strategy in case of connectivity errors

Output

Type

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Glob Status

<hdfs:glob-status>

Return all the files that match file pattern and are not checksum files. Results are sorted by their names.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path Pattern

String

a regular expression specifying the path pattern.

x

Filter

String

the user supplied path filter

Target Variable

String

The name of a variable on which the operation’s output will be placed

Target Value

String

An expression that will be evaluated against the operation’s output and the outcome of that expression will be stored in the target variable

#[payload]

Reconnection Strategy

A retry strategy in case of connectivity errors

Output

Type

Array of File Status

For Configurations

Throws

  • HDFS:CONNECTIVITY

  • HDFS:RETRY_EXHAUSTED

List Status

<hdfs:list-status>

List the statuses of the files/directories in the given path if the path is a directory

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the given path

x

Filter

String

the user supplied path filter

Target Variable

String

The name of a variable on which the operation’s output will be placed

Target Value

String

An expression that will be evaluated against the operation’s output and the outcome of that expression will be stored in the target variable

#[payload]

Reconnection Strategy

A retry strategy in case of connectivity errors

Output

Type

Array of File Status

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Make Directories

<hdfs:make-directories>

Make the given file and all non-existent parents into directories. Has the semantics of Unix 'mkdir -p'. Existence of the directory hierarchy is not an error.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the path to create directories for.

x

Permission

String

the file system permission to use when creating the directories, either in octal or symbolic format (umask).

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Read Operation

<hdfs:read-operation>

Read the content of a file designated by its path and streams it to the rest of the flow.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the path of the file to read.

x

Buffer Size

Number

the buffer size to use when reading the file.

4096

Streaming Strategy

Configure if repeatable streams should be used and their behavior

Target Variable

String

The name of a variable on which the operation’s output will be placed

Target Value

String

An expression that will be evaluated against the operation’s output and the outcome of that expression will be stored in the target variable

#[payload]

Reconnection Strategy

A retry strategy in case of connectivity errors

Output

Type

Binary

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Rename

<hdfs:rename>

Renames path target to path destination. *

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Source

String

the source path to be renamed.

x

Destination

String

new path after rename.

x

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Set Owner

<hdfs:set-owner>

Set owner of a path (i.e., a file or a directory). The parameters username and groupname cannot both be null.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the path of the file or directory to set owner.

x

Ownername

String

If it is null, the original username remains unchanged.

x

Groupname

String

If it is null, the original groupname remains unchanged.

x

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Set Permission

<hdfs:set-permission>

Set permission of a path (i.e., a file or a directory).

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the path of the file or directory to set permission.

x

Permission

String

the file system permission to be set.

x

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Write

<hdfs:write>

Write the current payload to the designated path, either creating a new file or appending to an existing one.

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

the path of the file to write to.

x

Permission

String

the file system permission to use if a new file is created, either in octal or symbolic format (umask).

700

Overwrite

Boolean

if a pre-existing file should be overwritten with the new content.

true

Buffer Size

Number

the buffer size to use when appending to the file.

4096

Replication

Number

block replication for the file.

1

Block Size

Number

the buffer size to use when appending to the file.

1048576

Owner User Name

String

the username owner of the file.

Owner Group Name

String

the group owner of the file.

Payload

Binary

the payload to write to the file.

#[payload]

Reconnection Strategy

A retry strategy in case of connectivity errors

For Configurations

Throws

  • HDFS:INVALID_STRUCTURE_FOR_INPUT_DATA

  • HDFS:CONNECTIVITY

  • HDFS:CONNECTIVITY

  • HDFS:INVALID_REQUEST_DATA

  • HDFS:RETRY_EXHAUSTED

  • HDFS:UNKNOWN

Sources

Read

<hdfs:read>

Name Type Description Default Value Required

Configuration

String

The name of the configuration to use.

x

Path

String

Read the content of a file designated by its path

x

Buffer Size

Number

4096

Redelivery Policy

Defines a policy for processing the redelivery of the same message

Streaming Strategy

Configure if repeatable streams should be used and their behavior

Reconnection Strategy

A retry strategy in case of connectivity errors

Output

Type

Any

Attributes Type

Any

For Configurations

Types

Reconnection

Field Type Description Default Value Required

Fails Deployment

Boolean

When the application is deployed, a connectivity test is performed on all connectors. If set to true, deployment will fail if the test doesn’t pass after exhausting the associated reconnection strategy

Reconnection Strategy

The reconnection strategy to use

Reconnect

Field Type Description Default Value Required

Frequency

Number

How often (in ms) to reconnect

Count

Number

How many reconnection attempts to make

Reconnect Forever

Field Type Description Default Value Required

Frequency

Number

How often (in ms) to reconnect

Expiration Policy

Field Type Description Default Value Required

Max Idle Time

Number

A scalar time value for the maximum amount of time a dynamic configuration instance should be allowed to be idle before it’s considered eligible for expiration

Time Unit

Enumeration, one of:

  • NANOSECONDS

  • MICROSECONDS

  • MILLISECONDS

  • SECONDS

  • MINUTES

  • HOURS

  • DAYS

A time unit that qualifies the maxIdleTime attribute

Redelivery Policy

Field Type Description Default Value Required

Max Redelivery Count

Number

The maximum number of times a message can be redelivered and processed unsuccessfully before triggering process-failed-message

Use Secure Hash

Boolean

Whether to use a secure hash algorithm to identify a redelivered message

Message Digest Algorithm

String

The secure hashing algorithm to use. If not set, the default is SHA-256.

Id Expression

String

Defines one or more expressions to use to determine when a message has been redelivered. This property may only be set if useSecureHash is false.

Object Store

The object store where the redelivery counter for each message is going to be stored.

Repeatable In Memory Stream

Field Type Description Default Value Required

Initial Buffer Size

Number

This is the amount of memory that will be allocated in order to consume the stream and provide random access to it. If the stream contains more data than can be fit into this buffer, then it will be expanded by according to the bufferSizeIncrement attribute, with an upper limit of maxInMemorySize.

Buffer Size Increment

Number

This is by how much will be buffer size by expanded if it exceeds its initial size. Setting a value of zero or lower will mean that the buffer should not expand, meaning that a STREAM_MAXIMUM_SIZE_EXCEEDED error will be raised when the buffer gets full.

Max Buffer Size

Number

This is the maximum amount of memory that will be used. If more than that is used then a STREAM_MAXIMUM_SIZE_EXCEEDED error will be raised. A value lower or equal to zero means no limit.

Buffer Unit

Enumeration, one of:

  • BYTE

  • KB

  • MB

  • GB

The unit in which all these attributes are expressed

Repeatable File Store Stream

Field Type Description Default Value Required

Max In Memory Size

Number

Defines the maximum memory that the stream should use to keep data in memory. If more than that is consumed then it will start to buffer the content on disk.

Buffer Unit

Enumeration, one of:

  • BYTE

  • KB

  • MB

  • GB

The unit in which maxInMemorySize is expressed

Meta Data

Field Type Description Default Value Required

Check Summary

Content Summary

File Status

Path Exists

Boolean

Check Summary

Field Type Description Default Value Required

Bytes Per CRC

Number

Crc Per Block

Number

Md5

String

Content Summary

Field Type Description Default Value Required

Directory Count

Number

File Count

Number

Length

Number

Snapshot Directory Count

Number

Snapshot File Count

Number

Snapshot Length

Number

Snapshot Space Consumed

Number

File Status

Field Type Description Default Value Required

Access Time

Number

Block Replication

Number

Block Size

Number

Directory

Boolean

Group

String

Length

Number

Modification Time

Number

Owner

String

Path

String

Permission

String

Symbolic Link

Boolean