Welcome to our guide on installing Hadoop in five simple steps.
To start with, the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, where each may be prone to failures. (Source)
As of now, a Hadoop project consists of the following modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- YARN: A framework for job scheduling and cluster resource management.
- MapReduce: A YARN-based system for parallel processing of large data sets.
In this tutorial, I am going to install Hadoop 2.9.1 on a single node with Ubuntu installed on it.
Step 1: Setting up the instance on CloudSigma
I am using a machine with the following resources:
12 GHz CPU
16 GB RAM
100 GB SSD
I am cloning Ubuntu 18.04 from the library and resizing it to 100 GB. Ubuntu 18.04 on the library comes with VirtIO drivers Python3 and Python 2.7.15, Pip 10.0.1, OpenSSL 1.1.0g and latest updates until 2018-06-11.
I am creating a password for default user, cloudsigma using the following command:
1 |
sudo passwd cloudsigma |
To configure SSH, running the following command to generate ssh keys:
1 2 |
ssh-keygen ssh-copy-id cloudsigma@localhost |
Step 2: Installing Prerequisites
On the server, I will first upgrade the package list and then upgrade the already installed packages. This would help to get the updated versions of any package/software.
1 |
sudo apt update && sudo apt upgrade |
For Hadoop to be installed java, ssh and rsync packages need to be installed. Subsequently, after making sure that all the software packages are at their latest versions, we can proceed with the rest of the process.
1 2 3 |
sudo apt install openjdk-8-jdk-headless/bionic-updates -y sudo apt install ssh sudo apt install rsync |
Now, I am going to set the JAVA_HOME directory. In my case it is /usr/lib/jvm/java-8-openjdk-amd64/jre.
To find out JAVA_HOME directory, enter the command:
1 |
which java |
It gives me /usr/bin/java as a result. This is a path which refers to the original location of java. Now, I would enter the following command:
1 |
ls -l `which java` |
This command would find out where /usr/bin/java is pointing to. I get the following as a result:
lrwxrwxrwx 1 root root 22 Sep 13 13:19 /usr/bin/java -> /etc/alternatives/java
Next, it shows the /usr/bin/java is pointing to /etc/alternatives/java. I will run the following command now:
1 |
ls -l /etc/alternatives/java |
The above command gives the following as result:
lrwxrwxrwx 1 root root 46 Sep 13 13:19 /etc/alternatives/java -> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
This shows that /etc/alternatives/java is pointing towards /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java. Out of the entire path, remove the java and the bin at the end. The remaining path, “/usr/lib/jvm/java-8-openjdk-amd64/jre” becomes my JAVA_HOME directory.
To set JAVA_HOME, use the following command:
1 |
echo 'export JAVA_HOME=YOUR_JAVA_HOME' >> ~/.bashrc |
Step 3: Downloading Hadoop
For downloading Hadoop, go to this link. I am choosing HTTP, and then selecting stable folder for a stable release. Under the stable folder, I can see 2.9.1 version is there. I will copy the link of hadoop-2.9.1.tar.gz and not the source one.
Now, on my instance, I will download the entire file using the following command:
1 |
wget http://mirrors.fibergrid.in/apache/hadoop/common/stable/hadoop-2.9.1.tar.gz |
Now that I have downloaded the file, I will extract the file:
1 2 |
tar -xvf hadoop-2.9.1.tar.gz cd hadoop-2.9.1/ |
Setting Hadoop home directory. In my case, /home/cloudsigma/hadoop-2.9.1 is where I have extracted the files:
1 |
echo 'export HADOOP_HOME=/home/cloudsigma/hadoop-2.9.1' >> ~/.bashrc |
Step 4: Hadoop Configurations
Now that HADOOP_HOME is set, I am going to make some configuration changes.
Firstly, I will add fs.defaultFS property in HADOOP_HOME/etc/hadoop/core-site.xml
The final file should look like this:
1 2 3 4 5 6 7 8 9 |
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> |
This property provides HDFS address to our dfs commands. If this isn’t specified, we would need to give HDFS address in each of the dfs command.
Further, I am creating two folders namenode and datanode for next their respective directories.
1 2 |
mkdir /home/cloudsigma/namenode mkdir /home/cloudsigma/datanode |
Next, I am going to edit HADOOP_HOME/etc/hadoop/hdfs-site.xml. I am adding replication factor, namenode directory as well the datanode directory.
The final file should look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/cloudsigma/namenode</value> </property> <property> <name>dfs.datanode.name.dir</name> <value>/home/cloudsigma/datanode</value> </property> </configuration> |
Replication factor controls how many times the blocks of data are replicated. I have specified 1, that means, no replicas would be made.
dfs.namenode.name.dir determines where the DFS name node should store the name table (fsimage) on the local filesystem.
dfs.datanode.name.dir determines where the DFS data node should store its blocks on the local filesystem.
In HADOOP_HOME/etc/hadoop/hadoop-env.sh, I am hard coding JAVA_HOME since sometimes it’s unable to import this value from local session. Next, I will edit the line:
1 |
export JAVA_HOME=${JAVA_HOME} |
to make it:
1 |
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ |
To make hdfs and hadoop commands easily accessible from anywhere, I am adding their paths to the PATH variable:
1 2 |
echo 'export PATH=$PATH:/home/cloudsigma/hadoop-2.9.1/bin' >> ~/.bashrc echo 'export PATH=$PATH:/home/cloudsigma/hadoop-2.9.1/sbin' >> ~/.bashrc |
Since we will be starting HDFS for the first time, I am formatting the namenode:
1 |
hdfs namenode -format |
However, I am getting this error while running this command,
18/09/13 14:22:39 WARN net.DNS: Unable to determine address of the host-falling back to “localhost” address
java.net.UnknownHostException: rev-xxx.xxx.xx.xxx-static.atman.pl: rev-xxx.xxx.xx.xxx-static.atman.pl: Name or service not known
In order to resolve this issue, add this line in /etc/hosts with sudo permissions in the format:
IP_Address rev-xxx.xxx.xx.xxx-static.atman.pl
1 |
<em>xxx.xxx.xx.xxx</em> rev-<em>xxx.xxx.xx.xxx</em>-static.atman.pl |
The above solution would resolve the issue. Now, run the namenode format command again.
Step 5: Starting Hadoop
Firstly, from anywhere on the machine, enter the commands,
1 |
start-dfs.sh |
The output will be like this:
1 2 3 4 5 |
Starting namenodes on [localhost] localhost: starting namenode, logging to /home/cloudsigma/hadoop-2.9.1/logs/hadoop-cloudsigma-namenode-Hadoop5.out localhost: starting datanode, logging to /home/cloudsigma/hadoop-2.9.1/logs/hadoop-cloudsigma-datanode-Hadoop5.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /home/cloudsigma/hadoop-2.9.1/logs/hadoop-cloudsigma-secondarynamenode-Hadoop5.out |
Now, we will start YARN using the command:
1 |
start-yarn.sh |
The output will be like this:
1 2 3 |
starting yarn daemons starting resourcemanager, logging to /home/cloudsigma/hadoop-2.9.1/logs/yarn-cloudsigma-resourcemanager-Hadoop5.out localhost: starting nodemanager, logging to /home/cloudsigma/hadoop-2.9.1/logs/yarn-cloudsigma-nodemanager-Hadoop5.out |
Secondly, we can check it using some hdfs commands,
1 2 |
hdfs dfs -mkdir /my-first-folder hdfs dfs -ls / |
It will give us an output like this:
1 2 |
Found 1 items drwxr-xr-x - cloudsigma supergroup 0 2018-09-21 03:59 /my-first-folder |
Finally, we’ve reached the end of this tutorial. Hadoop is now installed and fully operational!
- Removing Spaces in Python - March 24, 2023
- Is Kubernetes Right for Me? Choosing the Best Deployment Platform for your Business - March 10, 2023
- Cloud Provider of tomorrow - March 6, 2023
- SOLID: The First 5 Principles of Object-Oriented Design? - March 3, 2023
- Setting Up CSS and HTML for Your Website: A Tutorial - October 28, 2022