Setup Apache Hadoop HDFS

What is Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It includes a distributed file system HDFS that provides high-throughput access to application data, YARN framework for job scheduling and cluster resource management and MapReduce for parallel processing of large data sets.

Installation

I installed hadoop-2.7.5.

Java 8 is required. I prefer to use separately installed JDK:

JAVA_HOME=/opt/jdk1.8.0_151
export JAVA_HOME

Download Hadoop:

# curl -O https://www.apache.si/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz

Many manuals suggest creating new system user just for hadoop, but I will just use root account.

Unzip into /opt:

# tar xfz hadoop-2.7.5.tar.gz

If necessary change ownership to root:

# chown -R root:root /opt/hadoop-2.7.5

I also prefer creating symbolic link to hadoop:

# ln -s /opt/hadoop-2.7.5 /opt/hadoop

Set HADOOP_HOME:

HADOOP_HOME=/opt/hadoop
export HADOOP_HOME

Important: Hadoop manages its components via SSH. You need to generate public/private RSA keys, so the passphrase authentication will not be required:

# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test if SSH works as expected - without passphrase authentication:

Configuration

Hadoop will be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Configuration files for Hadoop are in $HADOOP_HOME/etc/hadoop/ directory.

The first to edit is core-site.xml file. This file contains information about the port number used by Hadoop instance, file system allocated memory, data store memory limit and the size of Read/Write buffers.

$ vi etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://orion.home.si:9000/</value>
</property>
</configuration>

Next open and edit hdfs-site.xml file. The file contains information about the value of replication data, namenode path and datanode path for local file systems. We’ll use /opt/volume/ directory to store our hadoop file system.

$ vi etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.data.dir</name>
<value>file:///opt/volume/datanode</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///opt/volume/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!-- this will temporary disable dsf permissions (for writing) -->
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

Because we’ve specified /opt/volume/ as our hadoop file system storage, we need to create those two directories (datanode and namenode):

# mkdir -p /opt/volume/namenode
# mkdir -p /opt/volume/datanode

Next, edit the mapred-site.xml file to specify that we are using yarn MapReduce framework.

$ vi etc/hadoop/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Now, edit yarn-site.xml file with the below:

$ vi etc/hadoop/yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Start Hadoop

Once hadoop single node cluster has been setup it’s time to initialize HDFS file system by formatting the /opt/volume/namenode storage directory with the following command:

# /opt/hadoop/bin/hdfs namenode -format

The Hadoop commands are located in $HADOOP_HOME/sbin directory. In order to start Hadoop services run the below commands on your console:

# start-dfs.sh
# start-yarn.sh

Check the services status with the following command:

$ /opt/jdk1.8.0_151/bin/jps
3536 NodeManager
3090 DataNode
3443 ResourceManager
3252 SecondaryNameNode
2999 NameNode
3577 Jps

Alternatively, you can view a list of all open sockets for Apache Hadoop on your system using the ss command:

$ ss -tul
$ ss -tuln # Numerical output

Check the Hadoop cluster GUI:

http://192.168.1.115:50070/

To test hadoop file system cluster create a random directory in the HDFS file system and copy a file from local file system to HDFS storage (insert data to HDFS).

$ hdfs dfs -mkdir /my_storage
$ hdfs dfs -put LICENSE.txt /my_storage

To view a file content or list a directory inside HDFS file system issue the below commands:

$ hdfs dfs -cat /my_storage/LICENSE.txt
$ hdfs dfs -ls /my_storage/

To retrieve data from HDFS to our local file system use the below command:

$ hdfs dfs -get /my_storage/ ./

To stop all hadoop instances run the below commands:

$ stop-yarn.sh
$ stop-dfs.sh

Read on using Hadoop HDFS from Java code