Applying Thoughts

"Sometimes I Win, Other times I learn. but I never lose."

August 26, 2013

What is Hadoop?

Hadoop is a software library framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines.This Software helps us rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service.


Hadoop provides a framework to process data of any size using a computing cluster made from normal, commodity hardware. There are two major components to Hadoop: 
1. The File System, which is a distributed file system that splits up large files onto multiple computers, 
2. The MapReduce framework, which is an application framework used to process large data stored on the file system.

But what is particularly notable about Hadoop (and Google's MapReduce) is the built-in fault tolerance. It is designed to run on commodity hardware, and therefore it avoids computers to be breaking frequently. The underlying file system is highly-redundant (blocks of data are replicated across multiple computers) and the MapReduce processing framework automatically handles computer failures which occur during a processing job by reassigning the processing to another computer in the cluster


When u deal with a huge quantity of data and your analytic's is deep & computationally intensive. When you traditional database fails or has huge limitations, HADOOP provides the solution. It is an open-source software framework developed in java, that supports data-intensive distributed applications. It is driven by 'google file systems' paper published in 2003 & 'MapReduce' algorithm published in 2004 by google R&D Team. It is a kind of No SQL database but now many organizations are developing the query tools to interact with Hadoop file systems using Hadoop APIs

Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it has and then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy. 

In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.  

Hadoop has advantages over traditional database management systems, especially the ability to handle both structured data like that found in relational databases as well as unstructured information such as video and etc. The system can also scale up with a minimum of fuss. You can run lots of different jobs of different types on the same hardware.

Customers look for enterprise-grade software when deploying mission-critical applications in their Data Centers, while Hadoop started as an open-source technology, it is quickly becoming a mission-critical enterprise technology that companies view as a competitive advantage, quickly deploy, manage and scale. Hadoop is available in virtual and cloud environments. Today, most of Amazon’s web services run on hosted Hadoop framework
 


Some Trivia on Hadoop
 

Volume
1. Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
2. Convert 350 billion annual meter readings to better predict power consumption

Velocity
1. Scrutinize 5 million trade events created each day to identify potential fraud
2. Analyze 500 million daily call detail records in real-time to predict customer churn faster

Verity
1. Structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more
2. Monitor 100’s of live video feeds from surveillance cameras to target points of interest

Logo: a baby elelphant
Hadoop Adoption by: AOL, Adobe, Amazon, AT&T, Bank of America, GE, Yahoo and more...
 


Google didn’t stop with MapReduce, but they developed other approaches for applications where MapReduce wasn’t a good fit.

Following are few frameworks beyond MapReduce -
 

1. Percolator: Handling individual updates
2. Pregel: Scalable graph computing
3. Dremel: Online visualizations
 

Open source projects have picked up on the more recent ideas and papers by Google. For example, ApacheDrill is reimplementing the Dremel framework, while projects like Apache Giraph and Stanford’s GPS are inspired by Pregel 

No comments: