Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

26 Jan 2014

Apache Hadoop

Introduction to Hadoop

    Apache Hadoop is an open source software framework developed by Doug Cutting and Mike Cafarella in 2005. It is an open source software framework for storage and large scale processing on data sets of clusters of commodity hardware. Hadoop is top-level project written in java and being built and used by global community of users and contributors worldwide. It is licensed under Apache License 2.0. The latest stable version 2.2 is released on 15th October, 2013.

Elaborating in Details

What is Apache Hadoop?

    Apache Hadoop is an open source cross platform software framework which is used to store and process large amount of data. It uses scalable and distributed data storage and processing. It runs on any standard hardware. Apache Hadoop supports Distributed File System for storage and processing of data.
    In a traditional non distributed architecture, you’ll have data stored in one server and any client program will access this central data server to retrieve the data. The non-distributed model has few fundamental issues. In this model, you’ll mostly scale vertically by adding more CPU, adding more storage, etc. This architecture is also not reliable, as if the main server fails, you have to go back to the backup to restore the data. From performance point of view, this architecture will not provide the results faster when you are running a query against a huge data set. In a Hadoop distributed architecture, both data and processing are distributed across multiple servers.


History of Hadoop

    The Hadoop framework is developed by cutting and cafarella. Cutting who is working in Yahoo at that time named it after his son’s toy elephant. It was originally developed to support distribution for Nutch search engine project. Originally Google started using the distributed computing model based on GFS (Google File system) and MapReduce. Later Nutch (open source web search software) was rewritten using MapReduce. Hadoop was branched out of Nutch as a separate project. Now Hadoop is a top-level Apache project that has gained tremendous momentum and popularity in recent years.

The Key Features of Hadoop:

Local Storage.
   When you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set. Finally, the result set from all this local servers are consolidated.
Faster results.
     
    In simple terms, instead of running a query on a single server, the query is split across multiple servers, and the results are consolidated. This means that the results of a query on a larger dataset are returned faster.
Less expensive servers.
    You don’t need a powerful server. Just use several less expensive commodity servers as Hadoop individual nodes.
High fault tolerance.
     If any of the nodes fails in the Hadoop environment, it will still return the dataset properly, as Hadoop takes care of replicating and distributing the data efficiently across the multiple nodes.
Supports large amount of servers.
     A simple Hadoop implementation can use just two servers. But you can scale up to several thousands of servers without any additional effort.
Robust and platform independence
     Hadoop is written in Java. So, it can run on any platform.
Used for unstructured data.
      It is used for unstructured data, as user doesn’t know the full result of the query. The Hadoop is not replacement for RDBMS.
Distributed, Scalable and Portable File system.
      HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines.

Components of Hadoop:

The Apache Hadoop framework is composed of the following modules:
  • Hadoop Common - contains libraries and utilities needed by other Hadoop modules
  • Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
  • Hadoop MapReduce - a programming model for large scale data processing.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.
Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop “platform” is now commonly considered to consist of a number of related projects as well – Apache Pig, Apache Hive, Apache HBase, and others.
For the end-users, though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Apache Pig, Apache Hive among other related projects expose higher level user interfaces like Pig Latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.





HDFS

HDFS stands for Hadoop Distributed File System, which is the storage system used by Hadoop. The following is a high-level architecture that explains how HDFS works.


The following are some of the key points to remember about the HDFS:
  • In the above diagram, there is one Name Node, and multiple Data Nodes (servers). b1, b2, indicates data blocks.
  • When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in the Hadoop cluster.
  • HDFS creates several replication of the data blocks and distributes them accordingly in the cluster in way that will be reliable and can be retrieved faster.
  • A typical HDFS block size is 128MB. Each and every data block is replicated to multiple nodes across the cluster. Hadoop will internally make sure that any node failure will never results in a data loss.
  • There will be one Name Node that manages the file system metadata There will be multiple Data Nodes (These are the real cheap commodity servers) that will store the data blocks When you execute a query from a client, it will reach out to the Name Node to get the file metadata information, and then it will reach out to the Data Nodes to get the real data blocks.
  • Hadoop provides a command line interface for administrators to work on HDFS.
  • The Name Node comes with an in-built web server from where you can browse the HDFS file system and view some basic cluster statistics.

MapReduce

MapReduce is a parallel programming model that is used to retrieve the data from the Hadoop cluster.
  • In this model, the library handles lot of messy details that programmers doesn’t need to worry about. For example, the library takes care of parallelization, fault tolerance, data distribution, load balancing, etc.
  • This splits the tasks and executes on the various nodes parallely, thus speeding up the computation and retrieving required data from a huge dataset in a fast manner.
  • This provides a clear abstraction for programmers. They have to just implement (or use) two functions: map and reduce.
  • The data are fed into the map function as key value pairs to produce intermediate key/value pairs.
  • Once the mapping is done, all the intermediate results from various nodes are reduced to create the final output.
  • JobTracker keeps track of all the MapReduces jobs that are running on various nodes. This schedules the jobs, keeps track of all the map and reduce jobs running across the nodes. If any one of those jobs fails, it reallocates the job to another node, etc.
  • In simple terms, JobTracker is responsible for making sure that the query on a huge dataset runs successfully and the data is returned to the client in a reliable manner. TaskTracker performs the map and reduce tasks that are assigned by the JobTracker. TaskTracker also constantly sends a hearbeat message to JobTracker, which helps JobTracker to decide whether to delegate a new task to this particular node or not.

Supporting Evidence/Examples

Uses of Hadoop Today:

Yahoo!

On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query.
There are multiple Hadoop clusters at Yahoo! and no HDFS file systems or MapReduce jobs are split across multiple datacenters. Every Hadoop cluster node bootstraps the Linux image, including the Hadoop distribution. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine.
On June 10, 2009, Yahoo! made the source code of the version of Hadoop it runs in production available to the public. Yahoo! contributes all the work it does on Hadoop to the open-source community. The company's developers also fix bugs, provide stability improvements internally and release this patched source code so that other users may benefit from their effort.

Facebook

In 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 Pita Byte of storage. On June 13, 2012 they announced the data had grown to 100 PB. On November 8, 2012 they announced the warehouse grows by roughly half a PB per day.

Other users

As of 2013, Hadoop adoption is widespread. For example, more than half of the Fortune 50 uses Hadoop.

Contradictory Examples

  • The only benefit to using Hadoop is scaling. If you have a single table containing many terabytes of data, Hadoop might be a good option for running full table scans on it. If you don’t have such a table, avoid Hadoop. It isn’t worth the hassle and you’ll get results with less effort and in less time if you stick to traditional methods.
  • Hadoop does not have any conception of indexing. Hadoop has only full table scans. Hadoop is full of leaky abstractions.
  • With Indexing, Hadoop finds the data Slower in Comparison to SQL and PostgreSQL. It is only fast and useful when data is unstructured or stored in only one table.

Dispels the Contradictory Examples

Hadapt believed differently, and for the last 4 years Hadapt have been espousing a contradictory vision. Instead of viewing Hadoop and database systems as complimentary, Hadapt have viewed them as competitive, and have championed the idea of bringing high performance SQL to Hadoop in order to create a single system that can handle both structured and unstructured data processing. In 2008 Hadapt started building a system called HadoopDB that does exactly this, and by March 2009 Hadapt completed our initial prototype and submitted our work to VLDB. The work was accepted and published at VLDB, and Hadapt founded Hadapt2.0 shortly afterwards (in 2010) to productize this defiant vision.

Reaffirming the Trends

Hadoop is making its way into the enterprise, as organizations look to extract valuable information and intelligence from the mountains of data in their storage environments. The way in which this data is analyzed and stored is changing, and Hadoop has become a critical part of this transformation. 

Conclusion

  • Hadoop started out as an open source effort to replicate the system described in the MapReduce research paper that was published by Google in 2004.
  • It started gaining steam in 2006 and finally got adopted by several major Web enterprises for use in production in 2008.
  • By 2009 it became clear that Hadoop was going to be a major force to be reckoned with for processing unstructured data.
  • Between then and now, just about everybody in the industry has agreed that Hadoop and database systems were perfectly complementary.
  • Hadoop can be used for processing unstructured data, ETL-style transformations, and one-off data processing jobs, while database systems can be used for fast SQL access to structured data.
  • Data can be shipped between Hadoop and relational database systems over a connector. For example, a Hadoop job can be run to structure the data, after which it is sent to a relational database system where it can be queried using SQL.

Reference


Comments

© 2013-2016 ITTechnocrates. All rights resevered. Developed by Bhavya Mehta