Wednesday, 17 June 2015

Steps to Install Apache Hadoop Cluster 2.0 with Single Node Configuration



Apache Hadoop 2.0 is part of Apache Software Foundation(ASF). This article provides step by step setup for a single node Apache Hadoop 2.0 cluster on an Ubuntu Virtual Machine(VM).

Step-1 [Download and install the VMware player]

Download the VMware player from the below link and install it.



Step-2 [Download an Ubuntu image and create an Ubuntu VMPlayer Instance]

Access the following link and download the 12.0.4 Ubuntu image:


Extract the Ubuntu VM Image and open it in VMware Player. Click open virtual machine and select path where we have extracted the image. Select the ".vmx" file and click "ok".




We can see the below screen in VMware Player after the VM image creation completes. Double click on the link to run the machine. We will get the home screen of Ubuntu.

The user details for the Virtual instance are:

Username : user
Password  : password

Open the Terminal to access the File System:


The first task is to run "apt-get update" to download the package lists from the repositories and "update" them to get information on the newest versions of packages and their dependencies.

>> $sudo apt-get update

Then use apt-get to install JDK 6 on the server.

>> $sudo apt-get install openjdk-6-jdk

Check Java Version: 

>> $java -version

Step-3 [Download and install the Apache Hadoop 2.0 binaries]

Download the binaries to your home directory. Use the default user "user" for the installation.

In Live production instances a dedicated Hadoop user account for running Hadoop is used. It is recommended because this helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (separating for security,permissions, backups, etc.).

>> $wget


Unzip the files and review the package content and configuration files.

>> $tar -xvf hadoop-2.2.0.tar.gz

After unzip, Hadoop package content looks like below:


Step-4 [Configure Apache Hadoop 2.0 single node server]

4.1 Update “.bashrc” file for user ‘ubuntu’:

Move to ‘user’ $HOME directory and edit ‘.bashrc’ file.

Location of .bashrc file:



Update the ‘.bashrc’ file to add important Apache Hadoop environment variables for user.

Change directory to home

>> $ cd

Edit the file 

>> $ vi .bashrc

Edit the .bashrc file as below based on our installed directory for Hadoop and Java.


Source the .bashrc file to set the Hadoop environment variables without having to invoke a new shell:

>> $. ~/.bashrc

4.2 Configure JAVA_HOME:

Configure JAVA_HOME in ‘hadoop-env.sh’. This file specifies environment variables that affect the JDK used by Apache Hadoop 2.0 daemons started by the Hadoop start-up scripts.

>> $cd $HADOOP_CONF_DIR

>> $vi hadoop-env.sh

Update the JAVA_HOME to:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

4.3 Create NameNode and DataNode directory:

Create DataNode and NameNode directories to store HDFS data.

$ mkdir -p $HOME/hadoop2_data/hdfs/namenode
$ mkdir -p $HOME/hadoop2_data/hdfs/datanode

4.4 Configure the Default File system:

The "core-site.xml" file contains the configuration settings for Apache Hadoop Core such as I/O settings that are common to HDFS, YARN and MapReduce. 

Configure default files-system (Parameter: fs.default.name) used by clients in core-site.xml.

>> $cd $HADOOP_CONF_DIR

>> $vi core-site.xml

Update the content as below:



Here the Hostname and Port are the machine and port on which Name Node daemon runs and listens. It also informs the Name Node as to which IP and port it should bind. The commonly used port is 9000 and you can also specify IP address rather than host name.

4.5 Configure the HDFS:

This file contains the configuration settings for HDFS daemons:
  • Name Node 
  • Data nodes

Configure hdfs-site.xml and specify default block replication, and NameNode and DataNode directories for HDFS. The actual number of replications can be specified when the file is created. The default is used
if replication is not specified in create time.

>> $cd $HADOOP_CONF_DIR

>> $vi hdfs-site.xml

Update the content as below:

4.6 Configure YARN framework:

This file contains the configuration settings for YARN: the NodeManager.

>> $cd $HADOOP_CONF_DIR

>> $vi yarn-site.xml

Update the content as below:

4.7 Configure Map Reduce framework:

This file contains the configuration settings for MapReduce.


>> $cd $HADOOP_CONF_DIR

We need to copy the mapred-site.xml template.

>>$cp mapred-site.xml template mapred-site.xml

>> $vi mapred-site.xml

Configure content as below:

4.8 Start the HDFS services:

The first step in starting up our Hadoop installation is formatting the Hadoop file-system, which is implemented on top of the local file-systems of our cluster. This is required on the first time Hadoop installation.Do not format a running Hadoop file-system, this will cause all your data to be erased.

To format the file-system, run the command:

>> $hadoop namenode–format

Now we need to start the HDFS services one by one.


Finally start the JobHistory server.



4.9 Verify the Hadoop Cluster Status:

Now lets verify the status of Hadoop cluster on below mentioned URLs.

Name node status:

JobHistory status: