Big Data

The 2014 Hadoop Summit – The Elephant is no longer a Bay Area phenomenon

At the Hadoop Summit in San Jose my most immediate observation was exactly this: companies from mainstream industries walking the summit. That was a sign – a sign that Hadoop and it’s ecosystem is not a Bay Area phenomenon. A few quick conversations and it was obvious that Hadoop has gained adoption. Adopters are not necessarily on the bleeding edge (yeah, everybody is not a Twitter, Facebook or LinkedIn), but that’s in fact a good thing. For sure, people have come with stories about use cases where they believed the Hadoop stack would help them. The growth in the adoption and probably a correlating event – the growth in the attendance at the summit , seems to be no short of phenomenal. As Doug Cutting put it – it was way beyond what he had imagined.

These are in my view some of the key highlights and sound bytes of the summit:

1) Apache YARN – as the Data Operating System. YARN brings to hadoop the ability to run multiple, concurrent managed applications of which MapReduce is just one of them. I think Apache YARN as a Data OS is a little bit of a stretch.My own personal view is one of Hadoop and YARN being more of a Big Data OS foundation. But there’s no denying that YARN is a key turning point in application and resource management.

2) Data Lake – I am not sure about this one :-). I lean more towards Merv Adrian’s Data “Reservoir” – but like Merv hinted, that’s probably not a battle that is going to be won.

3) Interactive/Low Latency : Clearly this is what  is being talked about as “Big Data 2.0”. This is more of the “bleeding edge“. This is taking Hadoop outside it’s batch processing origins.

4) SQL On Hadoop (Analytics on Hadoop): This really goes with 3) but it deserves a spotlight of it’s own. It’s obvious that there is a high market demand to run analytics with a tested and popular language such as SQL.

5) In-Memory – Again, goes hand in hand with 3, but just given the sheer products and technologies, this deserves a spotlight on it’s own.

6) Security – Two acquisitions in this space from HortonWorks and Cloudera made the topic of security a key one.

7) HBase as the “default” NoSQL – I was really surprised at how much HBase has grown in terms of performance and features (specifically Phoenix). All vendors seem to be actively supporting and pushing HBase.

My favorite sessions:

1) By far my most favorite session was the “Hadoop Puzzlers” by the Cloudera team – very engaging, entertaining and very insightful – and judging by the “standing room only” audience , this may have been a crowd favorite.

2) The RocketFuel folks did a great job walking through best practices of deploying Hadoop in their Had”ops” or Had”oops” talk

3) The Yahoo team did a great job on their Data Discovery talk giving some very insightful information on HCatalog. I definitely came away thinking I needed to explore that more.

4) I also had the opportunity to sit with Arun Murthy on his long term vision around YARN – all I can say it keep an eye out for more YARN features (hint: even less justification for Hadoop in a virtualized environment?)

The Parquet session gets some high points – I think this is a powerful capability that will have a significantimpact in the analytics world.

Overall great conference, my only compliant is that getting the right session length seems to be a real art, can never get that quite right. 40 minute sessions seem to be the right mid-way point between too short and too long (1 hr). However most of these were technical sessions and I definitely think technical sessions need to be closer to an hour. You may have to reduce some sessions, but the quality would go up. There were lot of assumptions in most talks, which I am guessing is OK for the Silicon Valley crowd, but probably not for others.

There was a general sense that the technology disruption is too fast, too furious – everybody realizes that playing catch-up is almost impossible.

In summary, Big Data 2.0 has arrived, the Elephant roams freely outside the Bay Area and dare I say this – Hadoop has maybe become approachable ?

 

 

History of (Real) World Data – Why NoSQL Matters

In a world where SQL is perceived to have overcome many challenges, NoSQL battles on. Relational models and SQL have certainly withstood the test of time, but #NoSQL has its place.

I have seen my fair share of skeptics. I have myself been a skeptic. But then I educated myself.

As an aside, I think a big part of the problem here is the name – “NoSQL” (Dear Shakespeare, I am sorry, but apparently there is something in a name). It’s provocative to say the least and was misinterpreted from day 1. Anyway, let’s put aside that for a moment and talk “relational”.

History Lesson

To truly get a better perspective, we need a trip down memory lane, and this needs to start with some history lessons especially around these concepts:  logical database, physical database, Relational  model, SQL, Normalization and ACID – and oh, be prepared for some myth busting.

We all know relational databases through the notion of collections of tables or relations. But many may not be familiar with the terms “logical” vs “physical” databases. These are in fact not two separate things, but two different views or perceptions. The logical view of relations or tables is one as perceived by the “end user” while the physical database is the view of the data from the relational DBMS perspective (Data management software). This separation is significant because this allows the implementation of how the data is physically stored to be independent of how the data is viewed and retrieved by the end user. The logical interface to the end user is based on an implementation of the relational model.

Now the term relational database was proposed by E.F. Codd in his seminal paper “ A Relational Model of Data for large Shared Data Banks”. What Codd did was show the world that data could be stored electronically in a data “bank” or database and retrieved repeatedly and predictably by the users without knowing how the data was stored. This was a ground-breaking concept at the time, because mainframes were the electronic banks of data then and required specialists who knew how the data was stored, to retrieve the data.

What exactly is a relational model? Per CJ Date, “It is a formal and rigorous definition of what the data in a database should look like to the user” – in short, it should look relational (with rows and columns) and relational operators should be available for operating on this data.

And if you are going to be programming against databases, you need a language to program it with, right? That’s where SQL comes in –  SQL is the standard query language for relational databases.

Of course, Codd was a scientist and his work was grounded heavily in mathematical concepts – the notion of relations based on the underlying set theory. This was really the first time databases and database operations were describable as a mathematical system. However, (here’s myth buster # 1):

There is  no relational database that is a complete and rigorous implementation of Codd’s relational theory. In fact Codd’s famour 12 rules were developed to try and prevent the dilution of Codd’s original relational theory  as he believed that products that were being brought to market were making some convenient compromises.

Accompanying that is myth buster number 2:

The SQL language is considered to be the primary culprit in allowing the original relational theory to dilute. A real relational language that enforced all of the requisites would have taken many more years (opinion) so a quick and dirty but user friendly language SQL was formulated that did not adhere to all the pre-requisites of Codd’s relational theory (that’s a fact). Of course, it was still a success – you can’t argue with a 40-billion dollar industry on that can you?

Now let’s talk about Normalization. Normalization is popularly used to “minimize” or “remove” redundancy of data. But’s that of only half true. Normalization is more importantly used to prevent update, insert and delete inconsistencies (or “modification anomalies”) in database design. The net result is that information that needs to be stored goes through stages of normalization and is represented in various tables with integrity constraints. Normalization plays a key role in achieving one of the ACID properties in the context of a transaction – namely “Consistency” – all database operations must leave the data in a consistent state.

ACID was defined by Jim Gray for reliable transaction processing, the characteristics of which are a) Atomicity (transactions are all or nothing) , b) Consistency (transactions leave data in a consistent state or take it from one consistent state to another), c)Isolation (concurrent execution results in the same state as if the transactions were executed individually) and d)Durability (once committed, it persists as the final state even during a power loss).

Great, you say, so what?

Ok, I am getting there. Let’s summarize the history lesson:

1)      Relational Model provided a way for a mathematical system to represent data and operations on the data

2)      No RDBMS implements all the rigid requisites of the relational model

3)      SQL is great but is actually not very complaint with the relational model

4)      In the end the combo of SQL and relational model provided a way to separate physical storage from end user perspective of interacting with the data which mainframes couldn’t provide

BUT, here’s the catch (or mythbuster #3)– nothing or no one can vouch that the relational model is the “right” way of representing real world data. Or better put – real world data characteristics often translate poorly to relational databases.

But I don’t care, because we can’t do without ACID?

Yes ACID is great (#awesome in fact) and is easily achievable “if” , and that’s a big IF, you keep your data managed as a “single system”. The fact is, today, the sources of data that many organizations want to use are so many and the data volumes themselves are so large, that processing of that data cannot be done using monolithic architectures of yesterday. The moment you want to deal with data fire hoses that go beyond the capabilities of a single system (aka distributed), ACID gets harder and harder to comply with.

The Real World

Electronically representing data from the real world is an exercise in translation.  The greater the diversity of the data you want to capture electronically, the more complicated and bungled this translation gets. In the early days, there was an emphasis on the type of data that was important to capture – mainly transactions. In its origins, transaction processing was on mainframes. Then came the power of client server architecture and distributed web applications. What changed along the way was the need to capture things beyond just transactions – businesses wanted to capture and store more in-depth data on people, devices, products, models, maps and many other things from the real world.  A single data format never could and can never store the variety and variations of the real world.

Let’s just take a look at a few examples that helps illustrate these points:

1)      We are all familiar with ERP systems and CRM systems. Did you know that ERP systems like SAP today can contain close to 100,000 tables to manage an organizations’ enterprise resources? Less than half of that is actual attribute data about the resources in the Enterprise – and many of those tables are likely the result of normalization. The rest of it is “hook up” data (cross-reference tables, lookup tables, etc). Really ? 100,000 tables. Does it fit all the design consistencies and normal forms and maintain referential integrity? Sure – but this is a failure in data management or (#EpicFail).

2)      Let’s take a more simple but poignant example. Almost everyone is familiar with the notion of storing customer names and addresses. How do you decide to create attributes such as last name and first name? What if someone has more than 4 name attributes? By going with the ubiquitous first name and last name, are you choosing to just model part of that attribute? Is that right? Is the “middle initial” a compromise? Well it certainly doesn’t help the person with a 4-part name. What about the address field? Address Line 1, Address Line 2? Who decides where the break is, if the address has 6 lines?

The data in the real world is less rigid (malleable perhaps) and very contextual.

You don’t believe me, lift your head up and look around – this is the real world :

Name - Real World

It is precisely these kind of problems that companies like Facebook, Twitter, Google and others want to avoid. Why and how would you represent a social network using a relational model? When your social network is a network of one billion unique people (a 1/7th of the world population) from all over the world communicating and interacting in real time,  is a pre-defined forced schema going to work? The answer is no – it’s not. In fact, is a single type of data model going to work – the answer is No again.

And that’s the point – One size does not fit all and in fact today, one size fits less and less for representing the real world. So get over it – relational is not the answer to everything and anything.

If you want more thought leadership view points from the real world – I suggest that you read the following two articles:

Your Coffee Shop doesn’t use Two-Phase Commit

Why Banks are not ACID – Availability is Revenue

So NoSQL has its place – in many cases as a better choice compared to SQL databases or often to co-exist with them.

Pulling off the training wheels … A first MapReduce Job

Remember that feeling as a kid before your first bike ride without training wheels? I do. It was a tale of mixed emotions – initially excited and enthralled, a feeling of independence. Then slowly there are some flashes of anxiety  – a picture of a few falls and physical bruises (and mental as others watched you fall). But then you see someone whiz by you – and you regain your position and know it’s all worth it!

So I decided to take my Hadoop training wheels off and roll-up my sleeves for my first MapReduce application. It’s been a while since my last post of setting up a Hadoop cluster from scratch. As a refresher, here’s the link to that. I will be using that cluster for this job.

The example is straight out of Tom White’s book:”Hadoop: The Definitive Hadoop Guide”.

The Data Set

The data set is a weather data set from the National Climatic Data Center – data that has weather information stored in an ASCII format (fixed width). NCDC has weather data collected from weather sensors all the way back from 1901. I had some difficulty downloading all the files and after several attempts, I decided to deal with the data from 1901 to 1960. This still amounted to about 36 Gigabytes of data, something that could be equated to a reasonably sized database. The goal of the MR job is to find the highest recorded temperature for each year.

All files are gzipped by year, organized by date and weather station. For each year, there is a file for different weather stations for that year. Here is an example for 1990.

010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz

Each file has entries that look like this:

0029029070999991905010106004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-01391+99999102641ADDGF102991999999999999999999

Not very readable, right? You have to know where the field delimiters are and that definition is available from NCDC.

For the purpose of this example, let’s focus on the “bold” numbers above. The first one (029070) is the USAF weather station identifier. The next bold (19050101) represents the observation date. The next item of interest (-0139) represents the air temperature in celsius times ten.So the reading of -0139 equates to 13.9 degrees celsius. The nextbold item in red indicates a reading quality code. Lastly, a value of 99999 at a point where an entry should exist signifies a missing value.

Processing the data set

Knowing this much we can talk about the processing of this data, which is quite straightforward. We would extract two fields : the air temperature and quality code. We make sure the quality code is good and that the reading is not “99999”. If it’s good, we record that as the highest temperature so far in a variable till we find the next higher one. At the end of the year’s processing, the last recorded value in the variable is the highest temperature. We spit out the year and the value divided by 10 along with the time taken. The output looks like this

1901 12.2

1902 11.5

1905 13.9

and so on

Time : 1395 seconds

Processing the Old fashioned Way

There are several approaches you can take here. I took Tom’s awk script, modified it slightly for my own environment and needs and ran the script against the data set. This is the simplest and probably the most common approach you would apply to such a data set.

If you are bold enough and have the time to try, you could build a relational data model out of this and write your favorite application to perform the logic (or stored procedure). I don’t know why , but you could :-).

Now remember, this is 36 GB of raw data. I ran the script on the beefiest machine in my cluster. The entire crunching with an awk script took a little over 23 minutes.

Processing with Hadoop

Before describing the processing with Hadoop, I want to point out a few things:

  • I did not really tune the cluster itself – obviously it’s a bare bones cluster running with 3 laptops on a shared network. You can’t really do too much with that
  • It’s important to keep in mind that Hadoop works well with large files. Therefore to take advantage of that, again,per Tom’s book, I created one large file for every year and labeled it “1901.all”, 1902.all” and so on. I therefore now have weather files that are gigabytes in size.

For my hadoop cluster, the volume that hadoop recognizes is set to /home. The default settings in Cloudera Manager are set to /. This is important to know because in many default linux installations, the root partition is small , around 50 GB and the rest of the disk is under a single partition. So I have my namenode directory under /home/dfs/nn, my data node directory under /dfs/dn1 etc. I also set the log configurations to “ERROR” instead of “INFO” to save space.

I then distributed these files into my Hadoop cluster.

Sudo –u hdfs hadoop fs –mkdir –p /examples/input
Sudo –u hdfs hadoop fs –mkdir –p /examples/output1,2,3 etc
Sudo –u hdfs fs -put /home/rnair/WeatherFiles/*.all /examples/input

The java code to run this is also quite bare bones but it has a Mapper and a Reducer – just what we need!

For the mapper, the input split is a line from the file, with the offset as key and the value as text of the line. The offset is meaningless for us and we only care about the line itself. However, now unlike the awk job, the map task is run in parallel across the data nodes. The output/emit of the map tasks is a key value pair of (year, temperature) as in:

(1905, 11.3), (1905, 13.9), (1905, 12.2)

(1906, 10.3), (1905, 10.9), (1905, 12.5)

The shuffle step then passes the following to the reduce task:

(1905, [11.3, 13.9, 12.2]), (1906, [10.3, 10.9, 12.5])

After all the map tasks are complete, the Reducer task is called and in our case , all the reducer does is find the max reading for each key (which is year in our case).

That’s it. I didn’t have to put in any code to record timing, because the JobTracker UI has all that information.

And So..?

So how fast was it on a poor man’s cluster of 3 laptops on a regular shared network? It took a little over 6 minutes. 

The result  is compelling for at least 1 reason: It showed performance boosts for a relatively smaller “volume” of data. 36GB is no small measure, but it’s certainly not the terabytes or petabytes that Hadoop is normally associated with. I got at least a factor of 3 times performance boost with this data set. This means that when you go into terabytes , the results would look even better, because processing on a monolithic machine will reach a point of diminishing returns at a certain scale.

I would venture to say that it’s not too far from the kind of processing we do with our data – many of us deal with a lot of ETL/data integration processes that can benefit by applying Hadoop’s power to it.

I am glad I pulled off those training wheels – I did hit a few bumps, but now I am exhilarated.

A (Data) Platform storm is brewing..

With all the movement around Big Data and it’s related technologies, one challenge companies either face today or will do so soon, is how to manage their requirements with so much disruption and evolution in the technology space.

This is where the power of the platform comes into play. Platforms allow you to focus on your business needs and hide the complexity of technology challenges. This is a very important consideration.

To demonstrate that this movement is getting legs, I am going to highlight 3 insights using a 3 year timeline from 2011.

In 2011, Accenture delivered a research project called Technology Vision 2011  where they found that data is taking it’s rightful place as a platform inside enterprises.

In 2012, enterprises started acting on this ; a very telling example is Disney, which you can read about in this article.

Fast forward to today in 2013, in Big Data convention after convention, Doug Cutting makes it clear that Hadoop and it’s ecosystem is becoming more and more a Big Data OS.

This article about the Chicago Big Data Conference shows this topic of Big data as a platform as one of the top threads.

With so many new tools and technologies in this space, you don’t want to be caught in the trap of adopting and managing a disparate set of tool sets. A Platform allows to focus more on the services and product needs without exposing technological churn.

 

Setting up a Hadoop cluster on CentOs – Oppum Cloudera Manager style!!

OK so it’s a cheesy title but hey it grabs attention, right? Hadoop clusters are serious business , so we need to inject a little bit of fun! Well, the first fun part is that this is a very cheap , not for production, type cluster. I don’t have a couple of racks to try out the full power of Hadoop. My intention was to go through a complete “soup” to “nuts” experience complete with hardware, OS and the hadoop stack. But I wanted to experience setting up a cluster with the resources at hand – which was a bunch of laptops that my friends at IT procured for me.

My end result? – Three laptops (two old Dells and 1 IBM Thinkpad) with the Cloudera Hadoop distribution. I am posting all the steps I went through for those interested in going through a similar experience. This is a pretty detailed post (18 steps) but at the end you should have your own complete Hadoop Stack.

The Cloudera Manager “generally” is all UI based click-through type of installation. However, it’s very sensitive to your machine’s networking.

You have to get a few things absolutely right (Goldilocks style) – especially relating to your hostnames, DNS, ports etc.

To begin with my m/c configurations were as follows:

1st machine (IBM Thinkpad) : Intel quad core, 16 GB RAM, 500 GB disk

2nd machine: Intel dual core, 8 GB RAM, 80 GB disk

3rd machine: Intel core 2 duo, 2 GB RAM, 80 GB disk

What? You didn’t think cheap meant that cheap? Yup – but remember – this is a purely prototyping instance.

I loaded these laptops with CentOS v6.2. I am going to start with the CentOS installation process:

1)   I used NetInstall (which means you need an Ethernet or wireless connection ) – where I created a boot CD and the media was downloaded via a URL: http://mirrror.centos.org/centos/6/os/x86_64

2)   NOTE 1: When you are asked to provide a hostname, make sure to provide a fully qualified real hostname, instead of localhost.localdomain. You can change it later, but it’s better to get it done here

3)      Click through most of your install

4)      Where you are asked to provide a root password, I recommend using the same root password on all the machines that will be part of the cluster – to keep it simple – unless you want to generate some rsa public and private keys for SSH.

5)   You now need to make sure you have Oracle Java SDK installed on your machine (for CDH 4, it’s verion 1.6.0_31 as the certified version, so I recommend using that or anything after that.

  • First check if you have other versions of java installed (most likely openjdk 1.6 or 1.7)
  • If you have openjdk, then do yum remove java-1.6.0-openjdk (if you have another version of java substitute that version
  • If you have some other java , do what you need to uninstall it
  • Download Oracle Java from (this is version 6, browse to your appropriate version from this site )
  • http://www.oracle.com/technetwork/java/javase/downloads/jdk6u35-downloads-1836443.html
  • Chmod +x jdk-6u35-linux-x64-rpm.bin
  • ./ jdk-6u35-linux-x64-rpm.bin
  • Set the alternatives if you want to :

# alternatives –install /usr/bin/java java /usr/java/jdk1.6.0_35/jre/bin/java 20000

# alternatives –install /usr/bin/jar jar /usr/java/jdk1.6.0_35/bin/jar 20000

# alternatives –install /usr/bin/javac javac /usr/java/jdk1.6.0_35/bin/javac 20000

# alternatives –install /usr/bin/javaws javaws /usr/java/jdk1.6.0_35/jre/bin/javaws 20000

# alternatives –set java /usr/java/jdk1.6.0_35/jre/bin/java

# alternatives –set javaws /usr/java/jdk1.6.0_35/jre/bin/javaws

# alternatives –set javac /usr/java/jdk1.6.0_35/bin/javac

# alternatives –set jar /usr/java/jdk1.6.0_35/bin/jar

6)      Open a terminal window and type uname –a or hostname (this should show your hostname that you specified during installation)

7)      On each machine, first disable selinux via /etc/selinux/config

8)      Open /etc/hosts (you will need to sudo or su) . Your’s may look like this :

127.0.0.1              localhost.localdomain    localhost

::1                           localhost.localdomain localhost localhost4

Delete the ::1 line. And add your own host entry if it’s not already there. Then proceed to add the other machines that will part of the cluster.

Eg: 192.168.x.x  Hadoop-main.yourdomain.com                                Hadoop-main

192.168.m.n  Hadoop-node2.yourdomain.com            Hadoop-node2

8) Open /etc/resolv.conf and for the “search” entry, add domain.local. Eg:

Domain        yourdomain.com

Search          yourdomain.com             domain.local

Repeat these entries on other nodes in the cluster so the ips/names are resolved properly

9)      You will most likely need to make changes to your firewall to allow cloudera manager ports access to 7180, 7182. The best way is to restrict these ports to  hosts in the cluster. On all the machines that cloudera manager will connect to, you can do the following:

iptables -A INPUT -p tcp -s 192.168.x.x/24 –dport 22 -j ACCEPT

iptables -A INPUT -p tcp -s 192.168.x.x/24 –dport 7180 -j ACCEPT

iptables -A INPUT -p tcp -s 192.168.x.x/24 –dport 7182 -j ACCEPT

10)   Save it: service iptables save

11)   NOT RECOMMENDED but useful for testing : If you have the desktop, you can go into system->administration->Firewall and disable the firewall (hit apply).

After steps 6-9, reboot.

12)   Open a terminal window and get the cloudera manager free edition :

wget http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin

13)   Make it executable: chmod +x ./cloudera-manager-installer.bin

14)   ./cloudera-manager-installer.bin – this should launch the installer and the rest is history J

15)   You should now automatically be launched into the admin console of CM.

16)   If you want to use your browser from a Windows laptop, you will either have to use the ip address:7180 or edit your Windows etc/hosts file to add the CM host. I suggest you do the latter because the url references are all based on the CM hostname and you will need to edit your URL everytime with the IP for every reference.

17)   To add your linux machine alias to Windows etc, do the following:

  • Search for Notepad in your start menu – right click and do “run as administrator” for notepad. This step is important because Windows will not let you edit the etc hosts file otherwise
  • Browse to C:\Windows\system32\drivers\etc and open the hosts file
  • Add the IP and hostname of the linux machine runnging CM

18)   Remember where you may have disabled the entire firewall? If everything now works after this, then you can reconfigure the firewall with the right ports enabled. If you have the cluster manager running as a web app on one of the machines, you may have to open the Web server port so you can access it from your windows laptop.

Cloudera-Mgr-Laptop

That’s it! Your own Hadoop Cluster on a budget!! My next post I will attempt to run the weather processing example from the book “The Definitive Guide” – by Tom White.

Is it finally time to let all data speak?

Lately it appears that data has finally become a first class citizen. The first class citizen had to however had to take on a new image and a new name – “Big Data”. Which is fine – fame always came with a price right? By now it’s clear, it’s all about a new paradigm and a new set of tools and technologies to take advantage of all this data. It’s like the old gold rush – in this case people trying to find those potential gold nuggets of information that could potentially be transformational to businesses.

Intent on getting more informative myself, I attended my first Big Data conference – BigDataTechCon held at San Francisco.

BigDataTechCon-2. If anything I got the opportunity to meet Doug Cutting, the creator of Apache Lucene and of course Hadoop. He literally gave wings to this Big Data movement.

DougCutting

So what did I get out of the conference? Here are my three takeaways:

  1. There’s evolution and disruption happening at a frantic pace: The open source community is on steroids with all this adoption. In fact, so are vendors who are building on top of or helping to integrate with the Hadoop stack better. As a result, the incubation pipeline is rich and healthy and as well, there’s a steady stream that’s hatching out of the incubator. My personal view is that there’s some overlap and it’s not always easy to pick a tool or technology for a general use case. Many of these codelines are still at version 1 or below.
  2. Production and widely adopted use cases are still the typical ones around batch: Hadoop excels at batch processing and storage. This theme was pretty solid throughout the conference and there were many talks around the deployment of Hadoop for ETL processing.
  3. Hadoop is awesome but you will see less of it in the future: According to Doug Cutting, the future of Hadoop is to be sort of the “Big Data OS”. Hadoop and it’s stack has evolved to become a platform for distributed computing. While MapReduce and HDFS are powerful, many consider the interfaces low level and not too developer friendly. The future therefore appears to be hiding the low-level programming interfaces and exposing easier to use interfaces and libraries.

Ok so while the future of Hadoop is less (visibility) of it – now is still the time to get to know Hadoop better. So it’s time for some hands on.. I hope to have a completely self-done cluster of Hadoop by my next post.