Setup Apache Hadoop HDFS

What is Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It includes a distributed file system HDFS that provides high-throughput access to application data, YARN framework for job scheduling and cluster resource management and MapReduce for parallel processing of large data sets.


I installed hadoop-2.7.5.

Java 8 is required. I prefer to use separately installed JDK:

export JAVA_HOME

Download Hadoop:

# curl -O

Many manuals suggest creating new system user just for hadoop, but I will just use root account.

Unzip into /opt:

# tar xfz hadoop-2.7.5.tar.gz

If necessary change ownership to root:

# chown -R root:root /opt/hadoop-2.7.5

I also prefer creating symbolic link to hadoop:

# ln -s /opt/hadoop-2.7.5 /opt/hadoop



Important: Hadoop manages its components via SSH. You need to generate public/private RSA keys, so the passphrase authentication will not be required:

# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# cat ~/.ssh/ >> ~/.ssh/authorized_keys

Test if SSH works as expected - without passphrase authentication:


Hadoop will be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Configuration files for Hadoop are in $HADOOP_HOME/etc/hadoop/ directory.

The first to edit is core-site.xml file. This file contains information about the port number used by Hadoop instance, file system allocated memory, data store memory limit and the size of Read/Write buffers.

$ vi etc/hadoop/core-site.xml


Next open and edit hdfs-site.xml file. The file contains information about the value of replication data, namenode path and datanode path for local file systems. We’ll use /opt/volume/ directory to store our hadoop file system.

$ vi etc/hadoop/hdfs-site.xml

<!-- this will temporary disable dsf permissions (for writing) -->

Because we’ve specified /opt/volume/ as our hadoop file system storage, we need to create those two directories (datanode and namenode):

# mkdir -p /opt/volume/namenode
# mkdir -p /opt/volume/datanode

Next, edit the mapred-site.xml file to specify that we are using yarn MapReduce framework.

$ vi etc/hadoop/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

Now, edit yarn-site.xml file with the below:

$ vi etc/hadoop/yarn-site.xml


Start Hadoop

Once hadoop single node cluster has been setup it’s time to initialize HDFS file system by formatting the /opt/volume/namenode storage directory with the following command:

# /opt/hadoop/bin/hdfs namenode -format

The Hadoop commands are located in $HADOOP_HOME/sbin directory. In order to start Hadoop services run the below commands on your console:


Check the services status with the following command:

$ /opt/jdk1.8.0_151/bin/jps
3536 NodeManager
3090 DataNode
3443 ResourceManager
3252 SecondaryNameNode
2999 NameNode
3577 Jps

Alternatively, you can view a list of all open sockets for Apache Hadoop on your system using the ss command:

$ ss -tul
$ ss -tuln # Numerical output

Check the Hadoop cluster GUI:

To test hadoop file system cluster create a random directory in the HDFS file system and copy a file from local file system to HDFS storage (insert data to HDFS).

$ hdfs dfs -mkdir /my_storage
$ hdfs dfs -put LICENSE.txt /my_storage

To view a file content or list a directory inside HDFS file system issue the below commands:

$ hdfs dfs -cat /my_storage/LICENSE.txt
$ hdfs dfs -ls /my_storage/

To retrieve data from HDFS to our local file system use the below command:

$ hdfs dfs -get /my_storage/ ./

To stop all hadoop instances run the below commands:


Read on using Hadoop HDFS from Java code