Tuesday, 16 June 2015

HADOOP Core Components and Data Storage Architecture

Hadoop is a combination of both Data storage(HDFS) and data processing(YARN) components. HDFS and YARN are two core components of Hadoop framework. These two core components contains sub components(daemons\programs) in order to handle all the activities.

Daemons (programs) running under HDFS
  • Name node  - Master
  • Data Node   - Slave

Daemons running under YARN (Yet Another Resource Negotiator)
  • Resource Manager  - Master
  • Node Manager        - Slave

Hadoop maintains Master\Slave architecture as below.

Client is a program\library which is part of hadoop framework. It is acting like a inter-mediator between User and Hadoop cluster.

Whenever we upload any file to Hadoop cluster, client library receives the file and divide them into multiple 128 MB blocks (Block size @Hadoop-1: 64MB; Block size @Hadoop-2: 128 MB) and generates 3 copies of each block (default replication factor is 3). Then based on the name node's instruction, it distributes the file across the slave machines in the cluster.

Lets assume user is uploading 200MB file into HDFS.

Sample.dat - 200MB

- Client receives this file and divides into 2 blocks(B1 and B2) and prepares 3 copies(R1, R2 and R3) of each blocks.

B1 – 128 MB ( B1R1, B1R2, B1R3)
B2 – 72 MB   (B2R1, B2R2, B2R3)

** B indicates Block
** R indicates Replication

By default, it maintains 3 copies of each block.

Total 6 Blocks. These blocks are distributed across the machines in the above cluster as below.

Name node maintains only meta data information which are directory structure of file information located at HDFS. Name node keeps these meta data information in the memory and also at disk for recovery purpose.

Name node's meta data information:
  • File Name
    • Block Details
      • Replication Details
        • Node Details
          • File and Time attributes for each file.

Name node maintains below mentioned 2 file structures at disk for keeping these meta data information.
  1. FSImage (Information about all the block details while initializing\creating cluster in the beginning).
  2. Edit Logs (Information about all the block details  or subsequent transaction details, while cluster started and running)
Data Node is responsible for read and write of data.