At the Hadoop Summit in San Jose my most immediate observation was exactly this: companies from mainstream industries walking the summit. That was a sign – a sign that Hadoop and it’s ecosystem is not a Bay Area phenomenon. A few quick conversations and it was obvious that Hadoop has gained adoption. Adopters are not necessarily on the bleeding edge (yeah, everybody is not a Twitter, Facebook or LinkedIn), but that’s in fact a good thing. For sure, people have come with stories about use cases where they believed the Hadoop stack would help them. The growth in the adoption and probably a correlating event – the growth in the attendance at the summit , seems to be no short of phenomenal. As Doug Cutting put it – it was way beyond what he had imagined.
These are in my view some of the key highlights and sound bytes of the summit:
1) Apache YARN – as the Data Operating System. YARN brings to hadoop the ability to run multiple, concurrent managed applications of which MapReduce is just one of them. I think Apache YARN as a Data OS is a little bit of a stretch.My own personal view is one of Hadoop and YARN being more of a Big Data OS foundation. But there’s no denying that YARN is a key turning point in application and resource management.
2) Data Lake – I am not sure about this one :-). I lean more towards Merv Adrian’s Data “Reservoir” – but like Merv hinted, that’s probably not a battle that is going to be won.
3) Interactive/Low Latency : Clearly this is what is being talked about as “Big Data 2.0”. This is more of the “bleeding edge“. This is taking Hadoop outside it’s batch processing origins.
4) SQL On Hadoop (Analytics on Hadoop): This really goes with 3) but it deserves a spotlight of it’s own. It’s obvious that there is a high market demand to run analytics with a tested and popular language such as SQL.
5) In-Memory – Again, goes hand in hand with 3, but just given the sheer products and technologies, this deserves a spotlight on it’s own.
6) Security – Two acquisitions in this space from HortonWorks and Cloudera made the topic of security a key one.
7) HBase as the “default” NoSQL – I was really surprised at how much HBase has grown in terms of performance and features (specifically Phoenix). All vendors seem to be actively supporting and pushing HBase.
My favorite sessions:
1) By far my most favorite session was the “Hadoop Puzzlers” by the Cloudera team – very engaging, entertaining and very insightful – and judging by the “standing room only” audience , this may have been a crowd favorite.
2) The RocketFuel folks did a great job walking through best practices of deploying Hadoop in their Had”ops” or Had”oops” talk
3) The Yahoo team did a great job on their Data Discovery talk giving some very insightful information on HCatalog. I definitely came away thinking I needed to explore that more.
4) I also had the opportunity to sit with Arun Murthy on his long term vision around YARN – all I can say it keep an eye out for more YARN features (hint: even less justification for Hadoop in a virtualized environment?)
The Parquet session gets some high points – I think this is a powerful capability that will have a significantimpact in the analytics world.
Overall great conference, my only compliant is that getting the right session length seems to be a real art, can never get that quite right. 40 minute sessions seem to be the right mid-way point between too short and too long (1 hr). However most of these were technical sessions and I definitely think technical sessions need to be closer to an hour. You may have to reduce some sessions, but the quality would go up. There were lot of assumptions in most talks, which I am guessing is OK for the Silicon Valley crowd, but probably not for others.
There was a general sense that the technology disruption is too fast, too furious – everybody realizes that playing catch-up is almost impossible.
In summary, Big Data 2.0 has arrived, the Elephant roams freely outside the Bay Area and dare I say this – Hadoop has maybe become approachable ?