Big Data aka Data OS

History of (Real) World Data – Why NoSQL Matters

In a world where SQL is perceived to have overcome many challenges, NoSQL battles on. Relational models and SQL have certainly withstood the test of time, but #NoSQL has its place.

I have seen my fair share of skeptics. I have myself been a skeptic. But then I educated myself.

As an aside, I think a big part of the problem here is the name – “NoSQL” (Dear Shakespeare, I am sorry, but apparently there is something in a name). It’s provocative to say the least and was misinterpreted from day 1. Anyway, let’s put aside that for a moment and talk “relational”.

History Lesson

To truly get a better perspective, we need a trip down memory lane, and this needs to start with some history lessons especially around these concepts:  logical database, physical database, Relational  model, SQL, Normalization and ACID – and oh, be prepared for some myth busting.

We all know relational databases through the notion of collections of tables or relations. But many may not be familiar with the terms “logical” vs “physical” databases. These are in fact not two separate things, but two different views or perceptions. The logical view of relations or tables is one as perceived by the “end user” while the physical database is the view of the data from the relational DBMS perspective (Data management software). This separation is significant because this allows the implementation of how the data is physically stored to be independent of how the data is viewed and retrieved by the end user. The logical interface to the end user is based on an implementation of the relational model.

Now the term relational database was proposed by E.F. Codd in his seminal paper “ A Relational Model of Data for large Shared Data Banks”. What Codd did was show the world that data could be stored electronically in a data “bank” or database and retrieved repeatedly and predictably by the users without knowing how the data was stored. This was a ground-breaking concept at the time, because mainframes were the electronic banks of data then and required specialists who knew how the data was stored, to retrieve the data.

What exactly is a relational model? Per CJ Date, “It is a formal and rigorous definition of what the data in a database should look like to the user” – in short, it should look relational (with rows and columns) and relational operators should be available for operating on this data.

And if you are going to be programming against databases, you need a language to program it with, right? That’s where SQL comes in –  SQL is the standard query language for relational databases.

Of course, Codd was a scientist and his work was grounded heavily in mathematical concepts – the notion of relations based on the underlying set theory. This was really the first time databases and database operations were describable as a mathematical system. However, (here’s myth buster # 1):

There is  no relational database that is a complete and rigorous implementation of Codd’s relational theory. In fact Codd’s famour 12 rules were developed to try and prevent the dilution of Codd’s original relational theory  as he believed that products that were being brought to market were making some convenient compromises.

Accompanying that is myth buster number 2:

The SQL language is considered to be the primary culprit in allowing the original relational theory to dilute. A real relational language that enforced all of the requisites would have taken many more years (opinion) so a quick and dirty but user friendly language SQL was formulated that did not adhere to all the pre-requisites of Codd’s relational theory (that’s a fact). Of course, it was still a success – you can’t argue with a 40-billion dollar industry on that can you?

Now let’s talk about Normalization. Normalization is popularly used to “minimize” or “remove” redundancy of data. But’s that of only half true. Normalization is more importantly used to prevent update, insert and delete inconsistencies (or “modification anomalies”) in database design. The net result is that information that needs to be stored goes through stages of normalization and is represented in various tables with integrity constraints. Normalization plays a key role in achieving one of the ACID properties in the context of a transaction – namely “Consistency” – all database operations must leave the data in a consistent state.

ACID was defined by Jim Gray for reliable transaction processing, the characteristics of which are a) Atomicity (transactions are all or nothing) , b) Consistency (transactions leave data in a consistent state or take it from one consistent state to another), c)Isolation (concurrent execution results in the same state as if the transactions were executed individually) and d)Durability (once committed, it persists as the final state even during a power loss).

Great, you say, so what?

Ok, I am getting there. Let’s summarize the history lesson:

1)      Relational Model provided a way for a mathematical system to represent data and operations on the data

2)      No RDBMS implements all the rigid requisites of the relational model

3)      SQL is great but is actually not very complaint with the relational model

4)      In the end the combo of SQL and relational model provided a way to separate physical storage from end user perspective of interacting with the data which mainframes couldn’t provide

BUT, here’s the catch (or mythbuster #3)– nothing or no one can vouch that the relational model is the “right” way of representing real world data. Or better put – real world data characteristics often translate poorly to relational databases.

But I don’t care, because we can’t do without ACID?

Yes ACID is great (#awesome in fact) and is easily achievable “if” , and that’s a big IF, you keep your data managed as a “single system”. The fact is, today, the sources of data that many organizations want to use are so many and the data volumes themselves are so large, that processing of that data cannot be done using monolithic architectures of yesterday. The moment you want to deal with data fire hoses that go beyond the capabilities of a single system (aka distributed), ACID gets harder and harder to comply with.

The Real World

Electronically representing data from the real world is an exercise in translation.  The greater the diversity of the data you want to capture electronically, the more complicated and bungled this translation gets. In the early days, there was an emphasis on the type of data that was important to capture – mainly transactions. In its origins, transaction processing was on mainframes. Then came the power of client server architecture and distributed web applications. What changed along the way was the need to capture things beyond just transactions – businesses wanted to capture and store more in-depth data on people, devices, products, models, maps and many other things from the real world.  A single data format never could and can never store the variety and variations of the real world.

Let’s just take a look at a few examples that helps illustrate these points:

1)      We are all familiar with ERP systems and CRM systems. Did you know that ERP systems like SAP today can contain close to 100,000 tables to manage an organizations’ enterprise resources? Less than half of that is actual attribute data about the resources in the Enterprise – and many of those tables are likely the result of normalization. The rest of it is “hook up” data (cross-reference tables, lookup tables, etc). Really ? 100,000 tables. Does it fit all the design consistencies and normal forms and maintain referential integrity? Sure – but this is a failure in data management or (#EpicFail).

2)      Let’s take a more simple but poignant example. Almost everyone is familiar with the notion of storing customer names and addresses. How do you decide to create attributes such as last name and first name? What if someone has more than 4 name attributes? By going with the ubiquitous first name and last name, are you choosing to just model part of that attribute? Is that right? Is the “middle initial” a compromise? Well it certainly doesn’t help the person with a 4-part name. What about the address field? Address Line 1, Address Line 2? Who decides where the break is, if the address has 6 lines?

The data in the real world is less rigid (malleable perhaps) and very contextual.

You don’t believe me, lift your head up and look around – this is the real world :

Name - Real World

It is precisely these kind of problems that companies like Facebook, Twitter, Google and others want to avoid. Why and how would you represent a social network using a relational model? When your social network is a network of one billion unique people (a 1/7th of the world population) from all over the world communicating and interacting in real time,  is a pre-defined forced schema going to work? The answer is no – it’s not. In fact, is a single type of data model going to work – the answer is No again.

And that’s the point – One size does not fit all and in fact today, one size fits less and less for representing the real world. So get over it – relational is not the answer to everything and anything.

If you want more thought leadership view points from the real world – I suggest that you read the following two articles:

Your Coffee Shop doesn’t use Two-Phase Commit

Why Banks are not ACID – Availability is Revenue

So NoSQL has its place – in many cases as a better choice compared to SQL databases or often to co-exist with them.

A (Data) Platform storm is brewing..

With all the movement around Big Data and it’s related technologies, one challenge companies either face today or will do so soon, is how to manage their requirements with so much disruption and evolution in the technology space.

This is where the power of the platform comes into play. Platforms allow you to focus on your business needs and hide the complexity of technology challenges. This is a very important consideration.

To demonstrate that this movement is getting legs, I am going to highlight 3 insights using a 3 year timeline from 2011.

In 2011, Accenture delivered a research project called Technology Vision 2011  where they found that data is taking it’s rightful place as a platform inside enterprises.

In 2012, enterprises started acting on this ; a very telling example is Disney, which you can read about in this article.

Fast forward to today in 2013, in Big Data convention after convention, Doug Cutting makes it clear that Hadoop and it’s ecosystem is becoming more and more a Big Data OS.

This article about the Chicago Big Data Conference shows this topic of Big data as a platform as one of the top threads.

With so many new tools and technologies in this space, you don’t want to be caught in the trap of adopting and managing a disparate set of tool sets. A Platform allows to focus more on the services and product needs without exposing technological churn.