Monday, 15 June 2015

What is DFS

DFS is Distributed File System. Hadoop uses DFS for data storage which is called as Hadoop Distributed File system(HDFS).

As part of HDFS's write path, input data will be divided into multiple chunks which will be distributed across the machines in the cluster. HDFS provides parallel IO process which increases IO performance. It also maintaining replication factor for each chunk of data in order to achieve the fault tolerant.

Formula to find the time taken to load the data,

Time taken(In minutes) = (Total input data size) / (Number of I/O Channels * Speed at which each channel transfers data * 60)

  • 60 -- To convert the output value in minutes.