Manado, Indonesia. 95252
(+62) 823-9602-9583
bayudwiyansatria@gmail.com

Apache Hadoop Introduction

Software Engineer | DevOps Engineer

hadoop

Apache Hadoop Introduction

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer. So delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel.

This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

For the installation of hadoop you may found it here :
Setup And Configure Cluster Node Hadoop Installation

ADVANTAGES OF HADOOP

Apache Hadoop was born out of a need to process an avalanche of big data. The web was generating more and more information on a daily basis, and becoming very difficult to index over one billion pages of content. In order to cope, Google invented a new style of distributed data storage and processing on clusters of commodity computers known as GFS and MapReduce.

A year after Google published a white paper describing their framework, Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop (HDFS and MapReduce) to apply these concepts to an open-source software framework to support distribution for the Nutch search engine project. Given the original case, HDFS was designed with a simple write-once storage infrastructure. Quoted From MapR.

Apache Hadoop controls costs by storing data more affordably per terabyte than other platforms. Instead of thousands to tens of thousands of dollars per terabyte, Hadoop delivers compute and storage for hundreds of dollars per terabyte.

Fault tolerance is one of the most important advantages of using Hadoop. Even if individual nodes experience high rates of failure when running jobs on a large cluster, data is replicated across a cluster so that it can be recovered easily in the face of disk, node, or rack failures.

Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a huge variety of tasks that all share the common theme of high variety, volume, and velocity of data – both structured and unstructured.

Apache Hadoop Modules

The Apache Hadoop framework is composed of the following modules:

  • Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  • Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
  • Hadoop YARN – (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications.
  • Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.

Single node cluster means only one Data Node running and setting up all the Name Node, Data Node, Resource Manager and Node Manager on a single machine. Easily and efficiently the sequential workflow in a smaller environment as compared to large environments which contains terabytes of data distributed across hundreds of machines.

Multi node cluster mean are more than one DataNode running and each DataNode is running on different machines. The multi node cluster is practically used in organizations for analyzing Big Data. In real time when deal with petabytes of data, We need to be distributed across thousand of machines to be processed.

Why is Hadoop important?

Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.

Computing power Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.

Fault tolerance Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.

Flexibility Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.

Low cost The open-source framework is free and uses commodity hardware to store large quantities of data.

Scalability You can easily grow your system to handle more data simply by adding nodes.

What are the challenges of using Hadoop?

Map Reduce programming is not a good match for all problems It’s good for simple information requests and problems.

That can be divided into independent units, but it’s not efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.

There’s a widely acknowledged talent gap It can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. That’s one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings.

Data security Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing. The Kerberos authentication protocol is a great step toward making Hadoop environments secure.

Full-fledged data management and governance Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.

Conclusion

The main issues with big data sets is that as they grow bigger, the time for a certain computation. Increases non-linear with the size of the data set if you use conventional techniques. The main issue is that since the computation is sequential you cannot simply add computing capacity to cope with the servers.

So this is where parallel computing comes in. What Hadoop takes a certain query/computation a part into several smaller queries and distributes it over a number of computing resources. It does this using Map Reduce, which groups and sorts a computation into smaller parts and then aggregates the results of the smaller computations once done.

 

Comments: 1

  1. […] and Facebook. It is because they have infrastructure to store and process this vast amount of data. Apache Hadoop and Apache Spark are both Big Data analytics frameworks they provide some of the most popular tools […]

Leave a Reply