Thursday, 2 July 2015

RACK Awareness and Configuration in HADOOP Cluster



Rack is the collection of machines which are physically located in a single place\data-center connected through traditional network design and top of rack switching mechanism. In Hadoop, Rack is a physical collection of slave machines put together at a single location for data storage. There can be multiple racks in a single location.

When the client is ready to load a file into the cluster, the content of the file will be divided into blocks(each Block size 128 MB) and then client consults the Name node and gets the address of data nodes for the default 3 replication copies for every block. While placing in the data nodes, the key rule followed is "for every block of data, two copies will exist in one rack, third copy in the different rack". This rule is called as "Replica Placement Policy".

Rack topology is configured in hadoop by implementing a script that, when given a list of host names or IP addresses on the command line, prints the rack in which machine is location , in order. Topology scripts are used by hadoop to determine the rack location of nodes. This information is used by hadoop to replicate block data to redundant racks.

Here is the sample representation for Replication Rack awareness.



Here is the sample bash shell script.

HADOOP_CONF=/etc/hadoop/conf 

while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec< ${HADOOP_CONF}/topology.data 
  result="" 
  while read line ; do
    ar=( $line ) 
    if [ "${ar[0]}" = "$nodeArg" ] ; then
      result="${ar[1]}"
    fi
  done 
  shift 
  if [ -z "$result" ] ; then
    echo -n "/default/rack "
  else
    echo -n "$result "
  fi
done 

Here is the topology data.

hadoopdata1.ec.com     /dc1/rack1
hadoopdata1               /dc1/rack1
10.1.1.1                     /dc1/rack2