AKTU BIG DATA NOTES: October 2024

FileReadFromHDFS.java

public class FileReadFromHDFS {

public static void main(String[] args) throws Exception {

//File to read in HDFS

String uri = args[0];

Configuration conf = new Configuration();

//Get the filesystem - HDFS

FileSystem fs = FileSystem.get(URI.create(uri), conf);

FSDataInputStream in = null;

try {

//Open the path mentioned in HDFS

in = fs.open(new Path(uri));

IOUtils.copyBytes(in, System.out, 4096, false);

System.out.println("End Of file: HDFS file read complete");

} finally {

IOUtils.closeStream(in);

}

FileWriteToHDFS.java

public class FileWriteToHDFS {

public static void main(String[] args) throws Exception {

//Source file in the local file system

String localSrc = args[0];

//Destination file in HDFS

String dst = args[1];

//Input stream for the file in local file system to be written to HDFS

InputStream in = new BufferedInputStream(new FileInputStream(localSrc));

//Get configuration of Hadoop system

Configuration conf = new Configuration();

System.out.println("Connecting to -- "+conf.get("fs.defaultFS"));

//Destination file in HDFS

FileSystem fs = FileSystem.get(URI.create(dst), conf);

OutputStream out = fs.create(new Path(dst));

//Copy file from local to HDFS

IOUtils.copyBytes(in, out, 4096, true);

System.out.println(dst + " copied to HDFS");

}

Write Operation In HDFS

Client initiates write operation by calling 'create()' method of DistributedFileSystem object which creates a new file - Step no. 1 in above diagram.
DistributedFileSystem object connects to the NameNode using RPC call and initiates new file creation. However, this file create operation does not associate any blocks with the file. It is the responsibility of NameNode to verify that the file (which is being created) does not exist already and client has correct permissions to create new file. If file already exists or client does not have sufficient permission to create a new file, then IOException is thrown to client. Otherwise, operation succeeds and a new record for the file is created by the NameNode.
Once new record in NameNode is created, an object of type FSDataOutputStream is returned to the client. Client uses it to write data into the HDFS. Data write method is invoked (step 3 in diagram).
FSDataOutputStream contains DFSOutputStream object which looks after communication with DataNodes and NameNode. While client continues writing data, DFSOutputStream continues creating packets with this data. These packets are en-queued into a queue which is called as DataQueue.
There is one more component called DataStreamer which consumes this DataQueue. DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable DataNodes to be used for replication.
Now, the process of replication starts by creating a pipeline using DataNodes. In our case, we have chosen replication level of 3 and hence there are 3 DataNodes in the pipeline.
The DataStreamer pours packets into the first DataNode in the pipeline.
Every DataNode in a pipeline stores packet received by it and forwards the same to the second DataNode in pipeline.
Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which are waiting for acknowledgement from DataNodes.
Once acknowledgement for a packet in queue is received from all DataNodes in the pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode failure, packets from this queue are used to reinitiate the operation.
After client is done with the writing data, it calls close() method (Step 9 in the diagram) Call to close(), results into flushing remaining data packets to the pipeline followed by waiting for acknowledgement.
Once final acknowledgement is received, NameNode is contacted to tell it that the file write operation is complete.

Read Operation In HDFS

Client initiates read request by calling 'open()' method of FileSystem object; it is an object of type DistributedFileSystem.
This object connects to namenode using RPC and gets metadata information such as the locations of the blocks of the file. Please note that these addresses are of first few block of file.
In response to this metadata request, addresses of the DataNodes having copy of that block, is returned back.
Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the client. FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and NameNode. In step 4 shown in above diagram, client invokes 'read()' method which causes DFSInputStream to establish a connection with the first DataNode with the first block of file.
Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This process of read() operation continues till it reaches end of block.
Once end of block is reached, DFSInputStream closes the connection and moves on to locate the next DataNode for the next block
Once client has done with the reading, it calls close() method.

AKTU BIG DATA NOTES

Tuesday, 22 October 2024

Read write java program

Monday, 21 October 2024

Read and Write operation in HDFS

Report Abuse