AKTU BIG DATA NOTES: Design and Concept of HDFS

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
There are Hadoop clusters running today that store petabytes of data.
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern.
A dataset is typically generated or copied from a source, then various analyses are performed on that dataset over time.

It’s designed to run on clusters of commodity hardware (commonly available hardware available from multiple vendors) for which the chance of node failure across the cluster is high, at least for large clusters.
HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode.
Files in HDFS may be written by a single writer.
Writes are always made at the end of the file.
There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
HDFS Concepts :
Blocks :
HDFS has the concept of a block, but it is a much larger unit—64 MB by default.
Files in HDFS are broken into block-sized chunks, which are stored as independent units.
Having a block abstraction for a distributed filesystem brings several benefits. :

A file can be larger than any single disk in the network. Nothing requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster.
Making the unit of abstraction a block rather than a file simplifies the storage subsystem. It simplifies the storage management (since blocks are a fixed size, it is easy to calculate how many can be stored on a given disk) and eliminating metadata concerns.
Blocks fit well with replication for providing fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines.

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.
Let us go ahead with HDFS first. The main components of HDFS are: NameNode and DataNode.
Let us talk about the roles of these two components in detail.
NameNode
It is the master daemon that maintains and manages the DataNodes (slave nodes)
It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.
It records each and every change that takes place to the file system metadata
If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live
It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
It has high availability and federation features which I will discuss in HDFS architecture in detail

DataNode

It is the slave daemon which run on each slave machine
The actual data is stored on DataNodes
It is responsible for serving read and write requests from the clients
It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode
It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds

AKTU BIG DATA NOTES