Saturday 15 August 2015

Popular Hadoop Distributions in the Market

List of popular HADOOP distributions are given below. Commercial distributions are providing more stable hadoop and comes up with patches for each issues.

Apache HADOOP:

  • Complex Cluster setup.
  • Manual integration and install of HADOOP ecosystem components.
  • No Commercial support. Only discussion forum.
  • Good for first try.


  • Established distribution with many referenced deployments.
  • Powerful tools for deployment, management and monitoring such as Cloudera Manager.
  • Impala is for Interactive querying and analytic.


  • It is from Yahoo.
  • Only distribution without any modification in Apache Hadoop.
  • Hcatalog for metadata.


  • Supports native Unix file system.
  • HA features such as snapshots, mirroring or stateful fail over.

Amazon Elastic Map Reduce(EMR):

  • Hosted Solution.
  • The most popular applications, such as Hive, Pig, HBase, DistCp, and Ganglia are already integrated with Amazon EMR.

And also we have IBM's BigInsights and Microsoft's HDInsights.

Thursday 13 August 2015

Data Modelling in HADOOP

We need to think differently about how we approach data modelling in a BIG Data world using HADOOP. It means that the way we designed data models for OLTP applications (using third normal form) and for data warehousing (using dimensional modelling) needs to change to take advantage of the inherent architecture and processing advantages offered by HADOOP.

In HADOOP, we create flat data models that take advantage of the "big table" nature of HADOOP to handle massive volumes of raw data. For example, Hadoop accesses data in very large blocks(default 64MB to 128MB). But in relational database block sizes are typically 32KB or less. In order to optimize this block size(default 64MB to 128MB) advantage, the BIG Data analytic requires very long flatten set of records.

HADOOP data processing prefers to "flatten" a star schema by collapsing (or) integrating the dimensional tables that surround the fact table into a single flat record in order to construct and execute more complex data queries without having to use joins.

Wednesday 12 August 2015

How to Improve PIG Data Integration performance by using Specialized Joins

PIG Latin includes below mentioned three specialized JOINS types.

1.   Replicated Join
2.   Skewed Join
3.   Merge Join

Replicated Join

Ø  It works successfully if one of the data set is smaller in size which is to be fit in memory.
Ø  It works very efficiently because smaller data set is copied into distributed cache which will be shared across to all the mappers in the cluster of machines. Also it implements join process at mapper itself in which reducer phase is avoided.
Ø  According to the Pig documentation, a relation of size up to 100 MB can be used when the process has 1 GB of memory.
Ø  A run-time error will be generated if not enough memory is available for loading the data.
Ø  Replicated Join can be used in both inner and outer join. And it also supports for joining more than two tables.


A_Big = LOAD ‘emp.dat’ USING PigStorage() AS (f1:int, f2:int, f3:int);
B_Small = LOAD ‘salary.dat’ USING PigStorage() AS (f1:int, f2:int, f3:int);
C = JOIN A_Big BY f1, B_Small BY f1 USING ‘replicated’;

Skewed Join

Ø  Usually parallel join process will be harmed, if there are lots of data for a certain key, then data will not be evenly distributed across the reducers in which one of them will be stuck in processing the majority of data. Skewed join handles this case efficiently.
Ø  Skewed join computes a histogram of the key space and it uses this data to allocate reducers for a given key.
Ø  Skewed Join can be used in both inner and outer join. And currently it only supports for joining two tables.
Ø  The pig.skwedjoin.reduce.memusage Java parameter specifies the heap fraction available to reducers in order to perform this join. Setting a low value means more reducers will be used, yet the cost of copying the data across them will increase.
Ø  Pig’s developers claim to have good performance when setting it between 0.1 - 0.4.


A_Big = LOAD ‘emp.dat’ USING PigStorage() AS (f1:int, f2:int, f3:int);
B_massive = LOAD ‘salary.dat’ USING PigStorage() AS (f1:int, f2:int, f3:int);
C = JOIN A_Big BY f1, B_massive BY f1 USING ‘skewed’;

Merge Join

Ø  It works successfully if both the data sets are sorted(ascending order) by the same join key.
Ø  It improves performance because join process takes place at mapper phase itself and it ignores two phases that are sort & Shuffle and reducer.
Ø  Pig implements the merge join algorithm by selecting the left input of the join to be the input file for the map phase, and the right input of the join to be the side file. It then samples records from the right input to build an index that contains, for each sampled record, the key(s) the filename and the offset into the file the record begins at. This sampling is done in an initial map only job. A second or actual Map Reduce job is then initiated, with the left input as its input. Each map uses the index to seek to the appropriate record in the right input and begin doing the join.
Ø  Merge join is only supported inner join.


C = JOIN A BY f1, B BY f1 USING ‘merge’;

Tuesday 11 August 2015

How to customize HADOOP default block size

HADOOP stores files across the cluster by breaking them down into coarser grained, fixed size block. Default block size is 64 MB in HADOOP-1 and 128 MB in HADOOP-2 versions. If we need to change the default block size, then we update the value of dfs.blocksize property at hdfs-default.xml or hdfs-site.xml. Hdfs-site.xml overrides the settings at hdfs-default.xml.



The default block size for new files, in bytes. We can use the following suffix (case insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.), Or provide complete size in bytes (such as 134217728 for 128 MB).

Block size can be defined at Master system(Name node), slave system(Data node) and client side hdfs-site.xml or through commands submitted by the client.

Dfs.blocksize property definition in the client side hdfs-site.xml has the highest precedence over master and slave side settings. We can also define the block size from the client while uploading the files to HDFS as below.

HDFS dfs -D dfs.block.size=67108864 -put weblog.dat /hdfs/Web-data

Among master and slave node block size settings, master gets highest precedence than slave. If we would like to force to use the slave side dfs.blocksize definition, then include final=true value as below.

Hdfs-site.xml at slave node: