AKTU BIG DATA NOTES: March 2023

Wednesday, 29 March 2023

Big Data Technology Components:

1. Ingestion :

The ingestion layer is the very first step of pulling in raw data. It comes from internal sources, relational databases, non-relational databases, social media, emails, phone calls etc.

There are two kinds of ingestions:

Batch: In which large groups of data are gathered and delivered together.

Streaming: This is a continuous flow of data. This is necessary for real-time data analytics.

2. Storage:

Storage is where the converted data is stored in a data lake or warehouse and eventually processed. The data lake/warehouse is the most essential component of a big data ecosystem. It needs to contain only thorough, relevant data to make insights as valuable as possible. It must be efficient with as little redundancy as possible to allow for quicker processing.

3. Analysis:

In the analysis layer, data gets passed through several tools, shaping it into actionable insights.

There are four types of analytics on big data:

Diagnostic: Explains why a problem is happening.
Descriptive: Describes the current state of a business through historical data.
Predictive: Projects future results based on historical data.
Prescriptive: Takes predictive analytics a step further by projecting best future efforts.

4. Consumption:

The final big data component is presenting the information in a format digestible to the end-user. This can be in the forms of tables, advanced visualizations and even single numbers if requested. The most important thing in this layer is making sure the intent and meaning of the output is understandable

Big Data Architecture:

Big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional database systems

The big data architectures include the following components:

Data sources: All big data solutions start with one or more data sources.

Example,

o Application data stores, such as relational databases.

o Static files produced by applications, such as web server log files.

o Real-time data sources, such as IoT devices.

Data storage: Data for batch processing operations is stored in a distributed file store that can hold high volumes of large files in various formats (also called Data Lake).

Example: Azure Data Lake Store or blob containers in Azure Storage.

Batch processing: Since the data sets are so large, therefore a big data solution must process data files using long-running batch jobs to filter, aggregate, and prepare the data for analysis.

Real-time message ingestion: If a solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing.

Stream processing: After capturing real-time messages, the solution must process them by filtering, aggregating, and preparing the data for analysis. The processed stream data is then written to an output sink. We can use open-source Apache streaming technologies like Storm and Spark Streaming for this.

Analytical data store: Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. Example: Azure Synapse Analytics provides a managed service for large-scale, cloud-based data warehousing.

Analysis and reporting: The goal of most big data solutions is to provide insights into the data through analysis and reporting. To empower users to analyze the data, the architecture may include a data modelling layer. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts.

Orchestration: Most big data solutions consist of repeated data processing operations that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report. To automate these workflows, we can use an orchestration technology such as Azure Data Factory