Contact Us 1-800-596-4880

Hadoop HDFS Connector

Select

The Anypoint Connector for the Hadoop Distributed File System (HDFS) is used as a bi-directional gateway between Mule applications and HDFS.

Read through this user guide to understand how to set up and configure a basic flow using the connector. Track feature additions, compatibility, limitations and API version updates with each release of the connector using the Connector Release Notes.

Review the connector operations and functionality using the Technical Reference.

Starting in version 5.0.0, MuleSoft maintains this connector under the Select support policy.

Before You Begin

To use HDFS Connector, you need:

  • Anypoint Studio - An instance of Anypoint Studio. If you do not use Anypoint Studio for development, follow the instructions in Configuring Maven to Work with Mule for your project.

  • An instance of Hadoop Distributed File System up and running. It can be downloaded from here.

Hardware and Software Requirements

For hardware and software requirements, please visit the Hardware and Software Requirements page.

Compatibility

HDFS Connector is compatible with the following:

Application/Service Version

Mule Runtime

3.6 or newer

Apache Hadoop

2.7.1 or newer

Starting with v5.0.0, HDFS Connector is licensed commercially with Anypoint Platform as are other Select connectors. Earlier versions remain freely available to the community.

To Install this Connector

  1. In Anypoint Studio, click the Exchange icon in the Studio taskbar.

  2. Click Login in Anypoint Exchange.

  3. Search for the connector and click Install.

  4. Follow the prompts to install the connector.

When Studio has an update, a message displays in the lower right corner, which you can click to install the update.

Configure the Connector Global Element

To use HDFS Connector in your Mule application, configure a global HDFS element that can be used by the connector. The HDFS connector offers the following global configuration options, requiring the following credentials:

Simple Authentication Configuration

Field Description

NameNode URI

The URI of the file system to connect to.

This is passed to the HDFS client as the FileSystem#FS_DEFAULT_NAME_KEY configuration entry. It can be overridden by values in configurationResources and configurationEntries.

Username

User identity that Hadoop uses for permissions in HDFS.

When Simple Authentication is used, Hadoop requires the user to be set as a System Property called HADOOP_USER_NAME. If you fill this field then the connector will set it for you, however you can set it by yourself. If the variable is not set, Hadoop will use the current logged in OS user.

Configuration Resources

A list of configuration resource files to be loaded by the HDFS client. Here you can provide additional configuration files. (e.g core-site.xml)

Configuration Entries

A map of configuration entries to be used by the HDFS client. Here you can provide additional configuration entries as key/value pairs.

HDFS global elements properties config window

Kerberos Authentication Configuration

Field Description

NameNode URI

The URI of the file system to connect to.

This is passed to HDFS client as the FileSystem#FS_DEFAULT_NAME_KEY configuration entry. It can be overridden by values in configurationResources and configurationEntries.

Username

Kerberos principal.

This is passed to HDFS client as the "hadoop.job.ugi" configuration entry. It can be overridden by values in configurationResources and configurationEntries. If not provided it will use the currently logged in user.

KeytabPath

Path to the keytab file associated with username.

KeytabPath is used in order to obtain TGT from "Authorization server". If not provided it will look for a TGT associated to username within your local kerberos cache.

Configuration Resources

A list of configuration resource files to be loaded by the HDFS client. Here you can provide additional configuration files. (e.g core-site.xml)

Configuration Entries

A map of configuration entries to be used by the HDFS client. Here you can provide additional configuration entries as key/value pairs.

HDFS Kerberos configuration window

Using the Connector

You can use this connector as an inbound endpoint for polling content of a file at a configurable rate (interval) or as an outbound connector for manipulating data into the HDFS server.

Connector Namespace and Schema

When designing your application in Studio, the act of dragging the connector from the palette onto the Anypoint Studio canvas should automatically populate the XML code with the connector namespace and schema location.

If you are manually coding the Mule application in Studio’s XML editor or other text editor, define the namespace and schema location in the header of your Configuration XML, inside the <mule> tag.
<mule xmlns="http://www.mulesoft.org/schema/mule/core"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns:connector="http://www.mulesoft.org/schema/mule/hdfs"
      xsi:schemaLocation="
               http://www.mulesoft.org/schema/mule/core
               http://www.mulesoft.org/schema/mule/core/current/mule.xsd
               http://www.mulesoft.org/schema/mule/connector
               http://www.mulesoft.org/schema/mule/connector/current/mule-hdfs.xsd">

      <!-- put your global configuration elements and flows here -->

</mule>

Using the Connector in a Mavenized Mule App

If you are coding a Mavenized Mule application, this XML snippet must be included in your pom.xml file.

<dependency>
  <groupId>org.mule.modules</groupId>
      <artifactId>mule-module-hdfs</artifactId>
      <version>x.x.x</version>
</dependency>

Replace x.x.x with the version that corresponds to the connector you are using.

To obtain the most up-to-date pom.xml file information, access the connector in Anypoint Exchange and click Dependency Snippets.

Demo Mule Application Using Connector

Existing demos demonstrate how to use the connector for basic file system operations and how to poll data from a file at a specific interval.

Example Use Case

The following example shows how to create a text file into HDFS using the connector:

  1. In Anypoint Studio, click File > New > Mule Project, name the project, and click OK.

  2. In the search field, type "http" and drag the HTTP connector to the canvas, click the green plus sign to the right of Connector Configuration, and in the next screen, click OK to accept the default settings. Name the endpoint /createFile.

  3. In the Search bar type "HDFS" and drag the HDFS connector onto the canvas. Configure as explained Configure the Connector Global Element

  4. Choose Write to path as an operation. Set Path to /test.txt (this is the path of the file that is going to be created into HDFS) and leave other options with default values.

  5. The flow should look like this:

    Create file flow
  6. Run the application. From your favorite HTTP client make a POST request with "Content-type:plain/text" to locahost:8081/createFile with content that you want to write as payload. (e.g. curl -X POST -H "Content-Type:plain/text" -d "payload to write to file" localhost:8090/createFile)

  7. Check that /test.txt has been created and has your content by using Hadoop explorer.

Create a File into HDFS - XML

Paste this into Anypoint Studio to interact with the example use case application discussed in this guide.

<?xml version="1.0" encoding="UTF-8"?>

<mule xmlns:hdfs="http://www.mulesoft.org/schema/mule/hdfs" xmlns:http="http://www.mulesoft.org/schema/mule/http" xmlns="http://www.mulesoft.org/schema/mule/core" xmlns:doc="http://www.mulesoft.org/schema/mule/documentation"
	xmlns:spring="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-current.xsd
http://www.mulesoft.org/schema/mule/core http://www.mulesoft.org/schema/mule/core/current/mule.xsd
http://www.mulesoft.org/schema/mule/http http://www.mulesoft.org/schema/mule/http/current/mule-http.xsd
http://www.mulesoft.org/schema/mule/hdfs http://www.mulesoft.org/schema/mule/hdfs/current/mule-hdfs.xsd">
    <http:listener-config name="HTTP_Listener_Configuration" host="0.0.0.0" port="8081" doc:name="HTTP Listener Configuration"/>
    <hdfs:config name="HDFS__Configuration" nameNodeUri="hdfs://localhost:9000" doc:name="HDFS: Configuration"/>
    <flow name="hdfs-example-use-caseFlow">
        <http:listener config-ref="HTTP_Listener_Configuration" path="/createFile" doc:name="HTTP"/>
        <hdfs:write config-ref="HDFS__Configuration" path="/test.txt" doc:name="HDFS"/>
    </flow>
</mule>

CloudHub Configuration

Additional configuration parameters are required if using the HDFS connector on CloudHub with kerberos authentication as the CloudHub worker is not a member of the kerberos realm. Under Settings→Properties for the CloudHub worker define and set the following properties:

Property Name Value

java.security.krb5.kdc

<kdc server name>

java.security.krb5.realm

<kerberos realm>

The kdc server name and kerberos realm values should be provided by the HDFS administrator for your organization.

Connector Performance

To define the pooling profile for the connector manually, access the Pooling Profile tab in the applicable global element for the connector.

For background information on pooling, see Tuning Performance.

View on GitHub