The 2014 Hadoop Summit – The Elephant is no longer a Bay Area phenomenon

At the Hadoop Summit in San Jose my most immediate observation was exactly this: companies from mainstream industries walking the summit. That was a sign – a sign that Hadoop and it’s ecosystem is not a Bay Area phenomenon. A few quick conversations and it was obvious that Hadoop has gained adoption. Adopters are not necessarily on the bleeding edge (yeah, everybody is not a Twitter, Facebook or LinkedIn), but that’s in fact a good thing. For sure, people have come with stories about use cases where they believed the Hadoop stack would help them. The growth in the adoption and probably a correlating event – the growth in the attendance at the summit , seems to be no short of phenomenal. As Doug Cutting put it – it was way beyond what he had imagined.

These are in my view some of the key highlights and sound bytes of the summit:

1) Apache YARN – as the Data Operating System. YARN brings to hadoop the ability to run multiple, concurrent managed applications of which MapReduce is just one of them. I think Apache YARN as a Data OS is a little bit of a stretch.My own personal view is one of Hadoop and YARN being more of a Big Data OS foundation. But there’s no denying that YARN is a key turning point in application and resource management.

2) Data Lake – I am not sure about this one :-). I lean more towards Merv Adrian’s Data “Reservoir” – but like Merv hinted, that’s probably not a battle that is going to be won.

3) Interactive/Low Latency : Clearly this is what  is being talked about as “Big Data 2.0”. This is more of the “bleeding edge“. This is taking Hadoop outside it’s batch processing origins.

4) SQL On Hadoop (Analytics on Hadoop): This really goes with 3) but it deserves a spotlight of it’s own. It’s obvious that there is a high market demand to run analytics with a tested and popular language such as SQL.

5) In-Memory – Again, goes hand in hand with 3, but just given the sheer products and technologies, this deserves a spotlight on it’s own.

6) Security – Two acquisitions in this space from HortonWorks and Cloudera made the topic of security a key one.

7) HBase as the “default” NoSQL – I was really surprised at how much HBase has grown in terms of performance and features (specifically Phoenix). All vendors seem to be actively supporting and pushing HBase.

My favorite sessions:

1) By far my most favorite session was the “Hadoop Puzzlers” by the Cloudera team – very engaging, entertaining and very insightful – and judging by the “standing room only” audience , this may have been a crowd favorite.

2) The RocketFuel folks did a great job walking through best practices of deploying Hadoop in their Had”ops” or Had”oops” talk

3) The Yahoo team did a great job on their Data Discovery talk giving some very insightful information on HCatalog. I definitely came away thinking I needed to explore that more.

4) I also had the opportunity to sit with Arun Murthy on his long term vision around YARN – all I can say it keep an eye out for more YARN features (hint: even less justification for Hadoop in a virtualized environment?)

The Parquet session gets some high points – I think this is a powerful capability that will have a significantimpact in the analytics world.

Overall great conference, my only compliant is that getting the right session length seems to be a real art, can never get that quite right. 40 minute sessions seem to be the right mid-way point between too short and too long (1 hr). However most of these were technical sessions and I definitely think technical sessions need to be closer to an hour. You may have to reduce some sessions, but the quality would go up. There were lot of assumptions in most talks, which I am guessing is OK for the Silicon Valley crowd, but probably not for others.

There was a general sense that the technology disruption is too fast, too furious – everybody realizes that playing catch-up is almost impossible.

In summary, Big Data 2.0 has arrived, the Elephant roams freely outside the Bay Area and dare I say this – Hadoop has maybe become approachable ?



A (Data) Platform storm is brewing..

With all the movement around Big Data and it’s related technologies, one challenge companies either face today or will do so soon, is how to manage their requirements with so much disruption and evolution in the technology space.

This is where the power of the platform comes into play. Platforms allow you to focus on your business needs and hide the complexity of technology challenges. This is a very important consideration.

To demonstrate that this movement is getting legs, I am going to highlight 3 insights using a 3 year timeline from 2011.

In 2011, Accenture delivered a research project called Technology Vision 2011  where they found that data is taking it’s rightful place as a platform inside enterprises.

In 2012, enterprises started acting on this ; a very telling example is Disney, which you can read about in this article.

Fast forward to today in 2013, in Big Data convention after convention, Doug Cutting makes it clear that Hadoop and it’s ecosystem is becoming more and more a Big Data OS.

This article about the Chicago Big Data Conference shows this topic of Big data as a platform as one of the top threads.

With so many new tools and technologies in this space, you don’t want to be caught in the trap of adopting and managing a disparate set of tool sets. A Platform allows to focus more on the services and product needs without exposing technological churn.