Joomla Jumpstart

  • Increase font size
  • Default font size
  • Decrease font size
Home NoSQL Databases Hadoop What is Hadoop?

What is Hadoop?

E-mail Print PDF
User Rating: / 1
PoorBest 

Hadoop logoHadoop is a massively scalable distributed open source map-reduce system. It's one of the most popular NoSQL database alternatives. It was sponsored by Yahoo which has one of the largest Hadoop clusters in the world (tens of thousands of nodes).

Hadoop uses the process of MapReduce to process large amounts of data across distributed machines. Hadoop really forms the core infrastructure on top of which other servers are built. Hadoop provides a fault-tolerant distributed file system known as HDFS for -- you guessed it -- Hadoop Distributed File System.

Hadoop is a lot easier to understand with a concrete example. Imagine you've created a search engine spider. This spiders millions of web sites and pulls the HTML text from the page. How do you effectively store all this text? You could do one gigantic hard drive but that would be very expensive. How about a distributed file system? You just run Hadoop on multiple commodity machines, write the data to HDFS, and the data would be written onto the many machines, distributed automatically with replicated data.

Now you want to process all this data. It's already stored on a large number of machines -- each of which has its own process, memory, and hard disk space. So instead of writing one big program to process all the data (like a database server such as MySQL), instead you write a little program that processes the data exactly how you want it for this task (say counting the number of link per page). Hadoop can then distribute this little program across the machines. The little program processes the data on the server where it is running and outputs the results -- usually a summary of the relevant information.

Hadoop then collects these small pieces of data that are output by all of the little programs running on all the different machines and compiles together a final result set. If one of the little programs fail to execute properly, Hadoop tracks the problem and has another server run the process.

You should be able to see that a NoSQL system like Hadoop is not for all database tasks but is specialized to large data sets like log files.

To make things more effecient, there are other program such as HBase and Hive that run on top of Hadoop and accept more traditional queries and then THEY write the little mapreduce programs that are distributed across the system for processing the data.

 

 

 
NoSQLsSQL Actuate Announces Cloudera Alliance to Support Apache Hadoop and BIRT Developers in Big Data Integration: http://t.co/OKRERfOT
by NoSQLsSQL. Link: web
kernel023 Crawbar試してみよう RT @DelltechcenterJ: オープンソースソフトウェアのデプロイメントツールであるCrowbarと、CrowbarによるCloudera Hadoopの展開方法についてご紹介します。#OSCA http://t.co/c09aWqpm
by kernel023. Link: TwitBird
y_asoh SQLライクにHadoop Hiveを使い倒す! - @IT http://t.co/bDxI9Ig3 真面目に勉強するわ。
by y_asoh. Link: Tweet Button
hongtebari Best alternative development stack for R with Hadoop? Forget MYSQL? | http://t.co/Sgf3iZaz | @scoopit http://t.co/cO13yWtD
by hongtebari. Link: Scoop.it

Google AdSense


Coffee and Cream Publishing