Introduction to Hadoop

4 min readMay 20, 2020

Have you heard of Hadoop before?OK…. It doesn’t matter. Let’s have some idea about Hadoop today.

Simply Hadoop is,

“An open source software framework from Apache to store and large scale processing of massive datasets on a cluster of cheap machines in a distributed manner”

Hadoop was developed at Yahoo by Doug Cutting and Mike Cafarella in 2005 for one of their projects.

With traditional systems, it was hard to deal with large and complex data. Further in this fast paced world it always needed to handle lots of data generated in high speed with different formats. Sometimes those data are unstructured. Hadoop was introduced to overcome these problems and became a cost-effective, innovative solution to store and analyze big data.

Hadoop has a master-slave topology. In another words client — server distributed systems.

Why we need Hadoop?

Mainly we use it for managing big data : Collecting, analyzing,processing and storing large no.of data generated. Further Hadoop is used by different companies as a research tool to find answers for their business questions. Hadoop has the ability to accept different kinds of data from different sources.

Hadoop is easy to use as it is written in Java, biggest developer community language. This framework takes care of how the data gets stored and processed in a distributed manner. Hence we just need to worry about driver programs etc.

Following are some occasions that Hadoop is used.

Financial Trading and Forecasting.
Understanding and Optimizing Business Processes.
Improving Healthcare and Public Health.
Customer’s Requirements Understanding.

So now I believe you have overview on Hadoop and why we need it.Let’s go for some deeper details.

Modules in Hadoop.

Hadoop contains following modules in Hadoop ecosystem.

Hadoop Distributed File System (HDFS) : A distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in a cluster and can access the distributed data in parallel. This distribution enables reliable and extremely rapid computations.
Hadoop MapReduce: Computational model and software framework for writing applications which are run on Hadoop. It contains algorithm which distributes the task into small pieces and assigns those pieces to many computers joined over the network, and assembles all the events to form the last event data set. It has the ability to process large amount of data.
Hadoop YARN: A resource manager. Monitors and manages the resources. This allocates resources to the applications running, and passes the jobs coming to be executed.
Hadoop Common: Java Libraries that are used to start Hadoop and utilities which are needed by other Hadoop modules.

To know how these module works you can refer the link here .

Apart to above modules Apache Hadoop consists some other tools also: Hive, HBase, Mahout, Sqoop, Flume, and ZooKeeper.

There are 4 types of nodes in a distributed cluster: Name node, Data node, Master node and Slave node. In Hadoop scaling, storage and and processing are done parallel.

Features of Hadoop

Fault tolerance : Hadoop framework has been designed to detect and handle failures at the application layer. Further Hadoop file system replicates, or copies, each piece of data multiple times and distributes the copies to individual nodes, placing at least one copy on a different server tracks than the others. So when one data point falls down there won’t be a failure of the process because always data is available in one or two other nodes.
High availability : In a Hadoop cluster when a single Name node goes down always there is another standby Name node to takeover it. In Hadoop 3.0 it has more than one standby Name nodes.
Recoverability : The application master restarts the worker on a different machine when a worker fails. YARN will restart the application master if itself failed.
Consistency : Object stores are generally Eventually Consistent: it can take time for changes to objects — creation, deletion and updates — to become visible to all callers. Indeed, there is no guarantee a change is immediately visible to the client which just made the change.
Scalability : Hadoop clusters can easily be scaled to any extent by adding additional cluster nodes and it does not need modification to application logic to scale. This is know as horizontal scalability.
Performance : It divides the input data file into a number of blocks and stores data in these blocks over several nodes. It also divides the task that user submits into various sub-tasks which assign to these worker nodes containing required data and these sub-task run in parallel. Hence it can has high performances.
Security : There are many security issues related to Hadoop. At the beginning their main focus was managing big data not security. With new versions they have improved the security in to a certain extend. But security seems inconsistent with different versions because of the different functionalities added to Hadoop . Data are encrypted while transmitting as well as when storing.

Big data handling is one of the trending things in IT field. As there are no many specialists, opportunities are huge. Start learning.

Introduction to Hadoop

Why we need Hadoop?

Modules in Hadoop.

Written by Sitharanishadini