AKTU BIG DATA NOTES: HDFS Block abstraction :

HDFS Block abstraction :

HDFS block size is usually 64MB-128MB and unlike other filesystems, a file smaller than the block size does not occupy the complete block size’s worth of memory.
The block size is kept so large so that less time is made doing disk seeks as compared to the data transfer rate.
Why do we need block abstraction :
Files can be bigger than individual disks.
Filesystem metadata does not need to be associated with each and every block.
Simplifies storage management - Easy to figure out the number of blocks which can be stored on each disk.
Fault tolerance and storage replication can be easily done on a per-block basis.

Data Replication :

number of times you make a copy of that particular thing can be expressed as its Replication Factor.

As HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured.
We need this replication for our file blocks because for running Hadoop we are using commodity hardware (inexpensive system hardware) which can be crashed at any time.
We are not using a supercomputer for our Hadoop setup.
That is why we need such a feature in HDFS that can make copies of that file blocks for backup purposes, this is known as fault tolerance.
For the big brand organization, the data is very much important than the storage, so nobody cares about this extra storage.
You can configure the Replication factor in your hdfs-site.xml file.

AKTU BIG DATA NOTES