Thursday, 13 August 2015

Data Modelling in HADOOP

We need to think differently about how we approach data modelling in a BIG Data world using HADOOP. It means that the way we designed data models for OLTP applications (using third normal form) and for data warehousing (using dimensional modelling) needs to change to take advantage of the inherent architecture and processing advantages offered by HADOOP.

In HADOOP, we create flat data models that take advantage of the "big table" nature of HADOOP to handle massive volumes of raw data. For example, Hadoop accesses data in very large blocks(default 64MB to 128MB). But in relational database block sizes are typically 32KB or less. In order to optimize this block size(default 64MB to 128MB) advantage, the BIG Data analytic requires very long flatten set of records.

HADOOP data processing prefers to "flatten" a star schema by collapsing (or) integrating the dimensional tables that surround the fact table into a single flat record in order to construct and execute more complex data queries without having to use joins.