HDFS Block abstraction :
- HDFS block size is usually 64MB-128MB and unlike other filesystems, a file smaller than the block size does not occupy the complete block size’s worth of memory.
- The block size is kept so large so that less time is made doing disk seeks as compared to the data transfer rate.
- Why do we need block abstraction :
- Files can be bigger than individual disks.
- Filesystem metadata does not need to be associated with each and every block.
- Simplifies storage management - Easy to figure out the number of blocks which can be stored on each disk.
- Fault tolerance and storage replication can be easily done on a per-block basis.
Data Replication :
- Replication ensures the availability of the data.
- Replication is - making a copy of something and the
number of times you make a copy of that particular thing can be expressed as its Replication Factor.
- As HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make a copy of those file blocks.
- By default, the Replication Factor for Hadoop is set to 3 which can be configured.
- We need this replication for our file blocks because for running Hadoop we are using commodity hardware (inexpensive system hardware) which can be crashed at any time.
- We are not using a supercomputer for our Hadoop setup.
- That is why we need such a feature in HDFS that can make copies of that file blocks for backup purposes, this is known as fault tolerance.
- For the big brand organization, the data is very much important than the storage, so nobody cares about this extra storage.
- You can configure the Replication factor in your hdfs-site.xml file.
No comments:
Post a Comment