Setup Apache Hadoop HDFS
What is Apache Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It includes a distributed file system HDFS that provides high-throughput access to application data, YARN framework for job scheduling and cluster resource management and MapReduce for parallel processing of large data sets.
Installation
I installed hadoop-2.7.5.
Java 8 is required. I prefer to use separately installed JDK:
JAVA_HOME=/opt/jdk1.8.0_151
export JAVA_HOME
Download Hadoop:
# curl -O https://www.apache.si/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz
Many manuals suggest creating new system user just for hadoop, but I will just use root account.
Unzip into /opt:
# tar xfz hadoop-2.7.5.tar.gz
If necessary change ownership to root:
# chown -R root:root /opt/hadoop-2.7.5
I also prefer creating symbolic link to hadoop:
# ln -s /opt/hadoop-2.7.5 /opt/hadoop
Set HADOOP_HOME:
HADOOP_HOME=/opt/hadoop
export HADOOP_HOME
Important: Hadoop manages its components via SSH. You need to generate public/private RSA keys, so the passphrase authentication will not be required:
# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Test if SSH works as expected - without passphrase authentication:
Configuration
Hadoop will be run on a single-node in a pseudo-distributed mode where each
Hadoop daemon runs in a separate Java process.
Configuration files for Hadoop are in $HADOOP_HOME/etc/hadoop/ directory.
The first to edit is core-site.xml file. This file contains information about the port number used by Hadoop instance, file system allocated memory, data store memory limit and the size of Read/Write buffers.
$ vi etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://orion.home.si:9000/</value>
</property>
</configuration>
Next open and edit hdfs-site.xml file. The file contains information about the value of replication data, namenode path and datanode path for local file systems. We’ll use /opt/volume/ directory to store our hadoop file system.
$ vi etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.data.dir</name>
<value>file:///opt/volume/datanode</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///opt/volume/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!-- this will temporary disable dsf permissions (for writing) -->
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
Because we’ve specified /opt/volume/ as our hadoop file system storage, we need to create those two directories (datanode and namenode):
# mkdir -p /opt/volume/namenode
# mkdir -p /opt/volume/datanode
Next, edit the mapred-site.xml file to specify that we are using yarn MapReduce framework.
$ vi etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Now, edit yarn-site.xml file with the below:
$ vi etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Start Hadoop
Once hadoop single node cluster has been setup it’s time to initialize HDFS file system by formatting the /opt/volume/namenode storage directory with the following command:
# /opt/hadoop/bin/hdfs namenode -format
The Hadoop commands are located in $HADOOP_HOME/sbin directory. In order to start Hadoop services run the below commands on your console:
# start-dfs.sh
# start-yarn.sh
Check the services status with the following command:
$ /opt/jdk1.8.0_151/bin/jps
3536 NodeManager
3090 DataNode
3443 ResourceManager
3252 SecondaryNameNode
2999 NameNode
3577 Jps
Alternatively, you can view a list of all open sockets for Apache Hadoop on your system using the ss command:
$ ss -tul
$ ss -tuln # Numerical output
Check the Hadoop cluster GUI:
http://192.168.1.115:50070/To test hadoop file system cluster create a random directory in the HDFS file system and copy a file from local file system to HDFS storage (insert data to HDFS).
$ hdfs dfs -mkdir /my_storage
$ hdfs dfs -put LICENSE.txt /my_storage
To view a file content or list a directory inside HDFS file system issue the below commands:
$ hdfs dfs -cat /my_storage/LICENSE.txt
$ hdfs dfs -ls /my_storage/
To retrieve data from HDFS to our local file system use the below command:
$ hdfs dfs -get /my_storage/ ./
To stop all hadoop instances run the below commands:
$ stop-yarn.sh
$ stop-dfs.sh