Installing Hadoop on a Single Node in Five Simple Steps • CloudSigma

Welcome to our guide on installing Hadoop in five simple steps.
To start with, the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, where each may be prone to failures. (Source)

As of now, a Hadoop project consists of the following modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
YARN: A framework for job scheduling and cluster resource management.
MapReduce: A YARN-based system for parallel processing of large data sets.

In this tutorial, I am going to install Hadoop 2.9.1 on a single node with Ubuntu installed on it.

Step 1: Setting up the instance on CloudSigma

I am using a machine with the following resources:

12 GHz CPU

16 GB RAM

100 GB SSD

I am cloning Ubuntu 18.04 from the library and resizing it to 100 GB. Ubuntu 18.04 on the library comes with VirtIO drivers Python3 and Python 2.7.15, Pip 10.0.1, OpenSSL 1.1.0g and latest updates until 2018-06-11.

I am creating a password for default user, cloudsigma using the following command:

sudo passwd cloudsigma

1	sudo passwd cloudsigma

To configure SSH, running the following command to generate ssh keys:

ssh-keygen
ssh-copy-id cloudsigma@localhost

1 2	ssh-keygen ssh-copy-id cloudsigma@localhost

Step 2: Installing Prerequisites

On the server, I will first upgrade the package list and then upgrade the already installed packages. This would help to get the updated versions of any package/software.

sudo apt update && sudo apt upgrade

1	sudo apt update && sudo apt upgrade

For Hadoop to be installed java, ssh and rsync packages need to be installed. Subsequently, after making sure that all the software packages are at their latest versions, we can proceed with the rest of the process.

sudo apt install openjdk-8-jdk-headless/bionic-updates -y
sudo apt install ssh
sudo apt install rsync

sudo apt install openjdk-8-jdk-headless/bionic-updates -y

sudo apt install ssh

sudo apt install rsync

Now, I am going to set the JAVA_HOME directory. In my case it is /usr/lib/jvm/java-8-openjdk-amd64/jre.

To find out JAVA_HOME directory, enter the command:

which java

1	which java

It gives me /usr/bin/java as a result. This is a path which refers to the original location of java. Now, I would enter the following command:

ls -l `which java`

1	ls -l `which java`

This command would find out where /usr/bin/java is pointing to. I get the following as a result:

lrwxrwxrwx 1 root root 22 Sep 13 13:19 /usr/bin/java -> /etc/alternatives/java

Next, it shows the /usr/bin/java is pointing to /etc/alternatives/java. I will run the following command now:

ls -l /etc/alternatives/java

1	ls -l /etc/alternatives/java

The above command gives the following as result:

lrwxrwxrwx 1 root root 46 Sep 13 13:19 /etc/alternatives/java -> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

This shows that /etc/alternatives/java is pointing towards /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java. Out of the entire path, remove the java and the bin at the end. The remaining path, “/usr/lib/jvm/java-8-openjdk-amd64/jre” becomes my JAVA_HOME directory.

To set JAVA_HOME, use the following command:

echo 'export JAVA_HOME=YOUR_JAVA_HOME' >> ~/.bashrc

1	echo 'export JAVA_HOME=YOUR_JAVA_HOME' >> ~/.bashrc

Step 3: Downloading Hadoop

For downloading Hadoop, go to this link. I am choosing HTTP, and then selecting stable folder for a stable release. Under the stable folder, I can see 2.9.1 version is there. I will copy the link of hadoop-2.9.1.tar.gz and not the source one.

Now, on my instance, I will download the entire file using the following command:

wget http://mirrors.fibergrid.in/apache/hadoop/common/stable/hadoop-2.9.1.tar.gz

1	wget http://mirrors.fibergrid.in/apache/hadoop/common/stable/hadoop-2.9.1.tar.gz

Now that I have downloaded the file, I will extract the file:

tar -xvf hadoop-2.9.1.tar.gz
cd hadoop-2.9.1/

1 2	tar -xvf hadoop-2.9.1.tar.gz cd hadoop-2.9.1/

Setting Hadoop home directory. In my case, /home/cloudsigma/hadoop-2.9.1 is where I have extracted the files:

echo 'export HADOOP_HOME=/home/cloudsigma/hadoop-2.9.1' >> ~/.bashrc

1	echo 'export HADOOP_HOME=/home/cloudsigma/hadoop-2.9.1' >> ~/.bashrc

Step 4: Hadoop Configurations

Now that HADOOP_HOME is set, I am going to make some configuration changes.

Firstly, I will add fs.defaultFS property in HADOOP_HOME/etc/hadoop/core-site.xml

The final file should look like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

This property provides HDFS address to our dfs commands. If this isn’t specified, we would need to give HDFS address in each of the dfs command.

Further, I am creating two folders namenode and datanode for next their respective directories.

mkdir /home/cloudsigma/namenode
mkdir /home/cloudsigma/datanode

1 2	mkdir /home/cloudsigma/namenode mkdir /home/cloudsigma/datanode

Next, I am going to edit HADOOP_HOME/etc/hadoop/hdfs-site.xml. I am adding replication factor, namenode directory as well the datanode directory.

The final file should look like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/cloudsigma/namenode</value>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>/home/cloudsigma/datanode</value>
</property>
</configuration>

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.replication</name>

</property>

<name>dfs.namenode.name.dir</name>

<value>/home/cloudsigma/namenode</value>

</property>

<name>dfs.datanode.name.dir</name>

<value>/home/cloudsigma/datanode</value>

</property>

</configuration>

Replication factor controls how many times the blocks of data are replicated. I have specified 1, that means, no replicas would be made.

dfs.namenode.name.dir determines where the DFS name node should store the name table (fsimage) on the local filesystem.

dfs.datanode.name.dir determines where the DFS data node should store its blocks on the local filesystem.

In HADOOP_HOME/etc/hadoop/hadoop-env.sh, I am hard coding JAVA_HOME since sometimes it’s unable to import this value from local session. Next, I will edit the line:

export JAVA_HOME=${JAVA_HOME}

1	export JAVA_HOME=${JAVA_HOME}

to make it:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/

1	export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/

To make hdfs and hadoop commands easily accessible from anywhere, I am adding their paths to the PATH variable:

echo 'export PATH=$PATH:/home/cloudsigma/hadoop-2.9.1/bin' >> ~/.bashrc
echo 'export PATH=$PATH:/home/cloudsigma/hadoop-2.9.1/sbin' >> ~/.bashrc

1 2	echo 'export PATH=$PATH:/home/cloudsigma/hadoop-2.9.1/bin' >> ~/.bashrc echo 'export PATH=$PATH:/home/cloudsigma/hadoop-2.9.1/sbin' >> ~/.bashrc

Since we will be starting HDFS for the first time, I am formatting the namenode:

hdfs namenode -format

1	hdfs namenode -format

However, I am getting this error while running this command,

18/09/13 14:22:39 WARN net.DNS: Unable to determine address of the host-falling back to “localhost” address
java.net.UnknownHostException: rev-xxx.xxx.xx.xxx-static.atman.pl: rev-xxx.xxx.xx.xxx-static.atman.pl: Name or service not known

In order to resolve this issue, add this line in /etc/hosts with sudo permissions in the format:

IP_Address rev-xxx.xxx.xx.xxx-static.atman.pl

<em>xxx.xxx.xx.xxx</em>  rev-<em>xxx.xxx.xx.xxx</em>-static.atman.pl

1	<em>xxx.xxx.xx.xxx</em> rev-<em>xxx.xxx.xx.xxx</em>-static.atman.pl

The above solution would resolve the issue. Now, run the namenode format command again.

Step 5: Starting Hadoop

Firstly, from anywhere on the machine, enter the commands,

start-dfs.sh

1	start-dfs.sh

The output will be like this:

Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/cloudsigma/hadoop-2.9.1/logs/hadoop-cloudsigma-namenode-Hadoop5.out
localhost: starting datanode, logging to /home/cloudsigma/hadoop-2.9.1/logs/hadoop-cloudsigma-datanode-Hadoop5.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/cloudsigma/hadoop-2.9.1/logs/hadoop-cloudsigma-secondarynamenode-Hadoop5.out

Starting namenodes on [localhost]

localhost: starting namenode, logging to /home/cloudsigma/hadoop-2.9.1/logs/hadoop-cloudsigma-namenode-Hadoop5.out

localhost: starting datanode, logging to /home/cloudsigma/hadoop-2.9.1/logs/hadoop-cloudsigma-datanode-Hadoop5.out

Starting secondary namenodes [0.0.0.0]

0.0.0.0: starting secondarynamenode, logging to /home/cloudsigma/hadoop-2.9.1/logs/hadoop-cloudsigma-secondarynamenode-Hadoop5.out

Now, we will start YARN using the command:

start-yarn.sh

1	start-yarn.sh

The output will be like this:

starting yarn daemons
starting resourcemanager, logging to /home/cloudsigma/hadoop-2.9.1/logs/yarn-cloudsigma-resourcemanager-Hadoop5.out
localhost: starting nodemanager, logging to /home/cloudsigma/hadoop-2.9.1/logs/yarn-cloudsigma-nodemanager-Hadoop5.out

starting yarn daemons

starting resourcemanager, logging to /home/cloudsigma/hadoop-2.9.1/logs/yarn-cloudsigma-resourcemanager-Hadoop5.out

localhost: starting nodemanager, logging to /home/cloudsigma/hadoop-2.9.1/logs/yarn-cloudsigma-nodemanager-Hadoop5.out

Secondly, we can check it using some hdfs commands,

hdfs dfs -mkdir /my-first-folder
hdfs dfs -ls /

1 2	hdfs dfs -mkdir /my-first-folder hdfs dfs -ls /

It will give us an output like this:

Found 1 items
drwxr-xr-x   - cloudsigma supergroup          0 2018-09-21 03:59 /my-first-folder

1 2	Found 1 items drwxr-xr-x - cloudsigma supergroup 0 2018-09-21 03:59 /my-first-folder

Finally, we’ve reached the end of this tutorial. Hadoop is now installed and fully operational!

About
Latest

About Akshay Nagpal

Big Data Analytics and ML enthusiast.

Removing Spaces in Python - March 24, 2023
Is Kubernetes Right for Me? Choosing the Best Deployment Platform for your Business - March 10, 2023
Cloud Provider of tomorrow - March 6, 2023
SOLID: The First 5 Principles of Object-Oriented Design? - March 3, 2023
Setting Up CSS and HTML for Your Website: A Tutorial - October 28, 2022