Back in March, I wrote a post that introduced Gartner’s hype cycle. This hype cycle provides an actual graphical representation of the buzzword fatigue that we in the industry feel so acutely. If you look back over the last few years, you won’t have a hard time thinking of examples. In that earlier post, I talked about the term cloud. But terms like internet of things, micro-services, and big data might also come to mind.
Today, I want to talk about this latter term, big data. Unlike some of the other terms that wind their way along Gartner’s hype cycle, this one has maddening uncertainty cooked right into its name. The word big offers so much vagueness that it becomes borderline meaningless. And so we’re left to wonder just how big our data must get to qualify for this designation. And, if we’re feeling pedantic about the matter, wouldn’t we talk about data in terms of many rather than big? But, perhaps I digress.
While the term may confuse and get far too much play in marketing brochures, it represents an extremely important concept for our industry. So let’s go back to first principles and unpack the meaning beneath the hype.
A Brief History of Big Data’s Etymology
In a very real sense, the problem goes back as far as humans have tried to quantify things and thought, “wow, I don’t know how to represent that.” For instance, even in the 19th or early 20th century, imagine trying to build a library that contained a copy of every book in the world. This would have required an unimaginable feat of engineering. This information was simply too big for management in conventional terms.
With the development of computers, the problem naturally compounded itself. Suddenly we had significantly more efficient means of computation, recording and storage with which to manage information. We could now generate data at a pace never previously seen.
To the best of my knowledge, the term “big data” originated a surprisingly long time ago. In a paper written back in 1997, Michael Cox and David Ellsworth described the problem of computers processing information too large for easy storage.
Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data.
I wonder if they had any inkling to what extent their turn of phrase would catch on. In any case, the historical problem had earned itself a name.
Big Data as Big Business
At some point, the term made its way out of academia and into the broader public and commercial realm. This happened in fits and starts and probably gradually enough that I won’t pinpoint anything exact. But the middle of the 2000s saw the rise of another term that worked its way through Garter’s cycle: web 2.0.
If you’ve spent enough time in IT to remember that buzzword, you might recall that it referred to the trend where web content moved from static HTML and CSS, to dynamically served web applications. Instead of passively viewing markup for the sake of information, users began to interact with these applications to do things that most notably included commerce.
Part and parcel with this development, capturing data about websites went from librarian-like curiosity to competitive advantage with billions of dollars at stake. If you could successfully capture, mine, and act on users’ purchasing habits, you could write your ticket.
Big Data in IT Shops: The View from the Ground
Imagine how this played out in IT shops around the world 10 years ago. These folks found themselves building enterprise Java and .NET Webforms applications and backing them with relational databases. In these databases, they stored user credentials, personal information, product catalogs and the general stuff of e-commerce.
But then, the business started to want competitive advantages. So they came to the IT side of the house and said things like, “hey, can you store a bunch of information about every page view of every user of our system? We want to study that.” And to that, the dev group’s team lead said something like, “sure, if you drive a truck full of servers with giant disks up here and drop them off.”
I exaggerate a bit here for comedic effect, but the idea remains the same. The business had demand for storing and accessing amounts of data that the shops simply couldn’t handle the way they’d been doing things. And so the e-commerce oriented web, via a huge profit motive, started forcing software shops to grapple with the previously academic problem of big data.
This resulted in an explosion of tools, processes, approaches, and cloud-based infrastructure over the following years. And, yes, it also resulted in the proliferation of the term big data to an extent that has probably driven you nuts at some point.
Defining Big Data in Today’s World
Now that the world has gotten used to dealing with this issue, we can talk about it a bit differently. In the history I’ve described so far, the term described a pretty wide open problem space. In other words, people said “big data” when they meant “so much that we have no idea what to do.” But in the last decade, we’ve generally figured out what to do. So now we deal with big data more operationally than conceptually.
Amazon has a great primer on this. But, to summarize, a lot of folks that specialize in big data talk about something called the three V’s: volume, variety, and velocity. To describe this anecdotally it means you have data that is too much to stuff reasonably into MySQL, coming at you from a disparate array of sources, and needing to be processed very quickly.
You can probably imagine all sorts of applications for this type of thing. But real-time monitoring of production web apps should certainly come to mind. The DevOps world has a great deal of tie in with big data.
The Mechanics of Handling Big Data
With the three V’s heuristic in mind, let’s talk nuts and bolts. What goes into dealing with big data in an acceptable way? What problems do you need to wrestle with? We can break these up into roughly four buckets: collection, storage, processing, and presentation.
First, you need to collect this information. Of the three Vs, this speaks to the variety angle. This data might come from logs, feeds, device output — you name it. You need a funnel of sorts to capture it.
Then, you need to store it somewhere, which presents its own form of challenge. As disk space gets cheaper and distributed computing more refined, organizations have more options, but keep in mind that the data volume continues to grow too.
Next, we get into velocity concerns. With all of the raw data in place, it doesn’t do anyone any good until someone processes it. You need to take this raw data, extract signal from noise, and gather it into meaningful information.
And then, finally, you need also to present it somehow. Without presentation, all goes for naught, since that means that no one is actually consuming any of the work done up to this point.
Big Data Going Forward
When you de-hype the term and actually look at the underlying concepts, you can see something core to today’s internet and to today’s world. We have become increasingly data-driven, and that shows no signs of slowing.
And take a look at currently emerging technologies. We’re putting chips and awareness into more and more sorts of devices. We wear trackers that tell us how many steps we take, how we sleep, and how often our heart beats. This thirst for information keeps pressing at the outward boundary of what we can currently capture, store and consume.
So, in the end, when people talk about big data, they talk about the outer limits of our ability to capture and synthesize information. And that has important implications, the past, present, and future.