What is Apache Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It includes a distributed file system HDFS that provides high-throughput access to application data, YARN framework for job scheduling and cluster resource management and MapReduce for parallel processing of large data sets.
I installed hadoop-2.7.5.
Java 8 is required. I prefer to use separately installed JDK:
# curl -O https://www.apache.si/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz
Many manuals suggest creating new system user just for hadoop, but I will just use root account.
Unzip into /opt:
# tar xfz hadoop-2.7.5.tar.gz
If necessary change ownership to root:
# chown -R root:root /opt/hadoop-2.7.5
I also prefer creating symbolic link to hadoop:
# ln -s /opt/hadoop-2.7.5 /opt/hadoop
Important: Hadoop manages its components via SSH. You need to generate public/private RSA keys, so the passphrase authentication will not be required:
# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Test if SSH works as expected - without passphrase authentication:
Hadoop will be run on a single-node in a pseudo-distributed mode where each
Hadoop daemon runs in a separate Java process.
Configuration files for Hadoop are in $HADOOP_HOME/etc/hadoop/ directory.
The first to edit is core-site.xml file. This file contains information about the port number used by Hadoop instance, file system allocated memory, data store memory limit and the size of Read/Write buffers.
$ vi etc/hadoop/core-site.xml
Next open and edit hdfs-site.xml file. The file contains information about the value of replication data, namenode path and datanode path for local file systems. We’ll use /opt/volume/ directory to store our hadoop file system.
$ vi etc/hadoop/hdfs-site.xml
<!-- this will temporary disable dsf permissions (for writing) -->
Because we’ve specified /opt/volume/ as our hadoop file system storage, we need to create those two directories (datanode and namenode):
# mkdir -p /opt/volume/namenode
# mkdir -p /opt/volume/datanode
Next, edit the mapred-site.xml file to specify that we are using yarn MapReduce framework.
$ vi etc/hadoop/mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
Now, edit yarn-site.xml file with the below:
$ vi etc/hadoop/yarn-site.xml
Once hadoop single node cluster has been setup it’s time to initialize HDFS file system by formatting the /opt/volume/namenode storage directory with the following command:
# /opt/hadoop/bin/hdfs namenode -format
The Hadoop commands are located in $HADOOP_HOME/sbin directory. In order to start Hadoop services run the below commands on your console:
Check the services status with the following command:
Alternatively, you can view a list of all open sockets for Apache Hadoop on your system using the ss command:
$ ss -tul
$ ss -tuln # Numerical output
Check the Hadoop cluster GUI:http://192.168.1.115:50070/
To test hadoop file system cluster create a random directory in the HDFS file system and copy a file from local file system to HDFS storage (insert data to HDFS).
$ hdfs dfs -mkdir /my_storage
$ hdfs dfs -put LICENSE.txt /my_storage
To view a file content or list a directory inside HDFS file system issue the below commands:
$ hdfs dfs -cat /my_storage/LICENSE.txt
$ hdfs dfs -ls /my_storage/
To retrieve data from HDFS to our local file system use the below command:
$ hdfs dfs -get /my_storage/ ./
To stop all hadoop instances run the below commands: