Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
There are numerous components in Big Data and sometimes it can become tricky to understand it quickly. Through this article, we will try to understand different components of Big Data and present these components in the order which will ease the understanding.
Understanding what actually big data is
Big Data is nothing but any data which is very big to process and produce insights from it. Data being too large does not necessarily mean in terms of size only. There are 3 V’s (Volume, Velocity and Veracity) which mostly qualifies any data as Big Data. The volume deals with those terabytes and petabytes of data which is too large to be quickly processed. Velocity deals with data moving with high velocity. Continuous streaming data is an example of data with velocity and when data is streaming at a very fast rate may be like 10000 of messages in 1 microsecond. Veracity deals with both structured and unstructured data. Data that is unstructured or time-sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
Familiarise yourself with Hadoop
Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. Let’s understand this piece by piece.
It is an open source framework which refers to any program whose source code is made available for use or modification as users see fit.
It is a distributed processing framework. In Hadoop, we rather than computing everything on a very computationally powerful machine, we divide work across a set of machines which collectively process the data and produce results. This is also known as horizontal scaling.
It has distributed storage feature. Here we do not store all the data on a big volume rather than we store data across different machines, Retrieving large chunks of data from one single volume involves a lot of latency. In case of storage across multiple systems, reading latency is reduced as data is parallelly read from different machines.
Understanding HDFS and Map-Reduce
HDFS is part of Hadoop which deals with distributed storage. It enables to store and read large volumes of data over distributed systems. Map-Reduce deals with distributed processing part of Hadoop. Map-Reduce breaks the larger chunk of data into smaller entities(mapping) and after processing the data, it collects back the results and collates it(reducing). Mapping involves processing data on the distributed machines and reducing involves getting back the data from the distributed nodes to collate it together.
Bringing Hive and Pig in the picture
Hive and ping are more like data extraction mechanism for Hadoop. They offer SQL like capabilities to extract data from non-relational/relational databases on Hadoop or from HDFS. Users can query the selective data they require and can perform ETL operations and gain insights out of their data.
NoSQL (commonly referred to as “Not Only SQL”) represents a completely different framework of databases that allows for high-performance, agile processing of information at a massive scale. In other words, it is a database infrastructure that has been very well-adapted to the heavy demands of big data.
The efficiency of NoSQL can be achieved because unlike relational databases that are highly structured, NoSQL databases are unstructured in nature, trading off stringent consistency requirements for speed and agility. NoSQL centres around the concept of distributed databases, where unstructured data may be stored across multiple processing nodes, and often across multiple servers. This distributed architecture allows NoSQL databases to be horizontally scalable; as data continues to explode, just add more hardware to keep up, with no slowdown in performance.
Data collection using Sqoop and Flume
Analytical processing using Hadoop requires loading of huge amounts of data from diverse sources into Hadoop clusters. This process of bulk data load into Hadoop, from heterogeneous sources and then processing it, comes with a certain set of challenges.
Apache Sqoop (SQL-to-Hadoop) is designed to support bulk import of data into HDFS from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop is based upon a connector architecture which supports plugins to provide connectivity to new external systems.
Apache Flume is a system used for moving massive quantities of streaming data into HDFS. Collecting log data present in log files from web servers and aggregating it in HDFS for analysis, is one common example use case of Flume.
Yarn stands for “Yet another resource manager”. Role of the YARN is to divide the task into multiple sub-tasks and assign them to distributed systems so that they can perform the assigned computation. It keeps a track of resources i.e. which all nodes are free etc. Apart from being a resource manager, it is also a job manager. It also keeps a check on the progress of tasks assigned to different compute nodes
Spark is a general-purpose data processing engine that is suitable for use in a wide range of circumstances. It is more like an open-source cluster computing framework. It is more or less like Hadoop but the difference is that it performs all the operations in the memory. Spark can be seen as either a replacement for Hadoop or as a powerful complement to it. Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers. It has an extensive set of developer libraries and APIs and supports languages such as Java, Python, R, and Scala. It’s use cases include
1. Handling streaming data and processing it
2. Machine learning over Big Data
3. ETL operations over Big Data
Messaging Queues- Kafka
Apache Kafka is a fast, scalable, fault-tolerant publish-subscribe messaging system which enables communication between producers and consumers using message-based topics. It designs a platform for high-end new generation distributed applications. Kafka permits a large number of permanent or ad-hoc consumers. Kafka is highly available and resilient to node failures and supports automatic recovery. These characteristics make Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems. Did you know that AWS is providing Kafka as a service
The ability to give higher throughput, reliability, and replication has made this technology replace the conventional message brokers such as JMS, AMQP, etc.
A Kafka broker is a node on the Kafka cluster that is used to persist and replicate the data. A Kafka Producer pushes the message into the message container called the Kafka Topic and a Kafka Consumer pulls the message from the Kafka Topic.