Big Data 2012: The “trough of disillusionment” and how to get past it

I spent some time at the Hadoop Summit this week, and spent lots of time in the prior weeks with entrepreneurs, practitioners and VC’s in this space.  My prediction: we are entering what Gartner would call the “trough of disillusionment” right about now.  The hype has left reality behind.

This is not a special insight of mine.  All emerging technologies go through stages of hype just as the Big Data movement is now.  Rather, I’d like to focus on why people will become disillusioned, and how to get past it.

Your data sucks

Big Data can deal with less-than-perfect data in many cases.  On collection, Hadoop doesn’t require parsing data into a schema, so you can leave it unparsed and de-normalized at first.

On analysis, lots of Big Data use cases are based on non-financial data.  So there’s tolerance for approximations (or, “confidence intervals” if you’re into stats).  For example, can I deal with a predictive model that says a user is 95% likely to churn?  You bet.

But the old adage “garbage in, garbage out” still applies.  For example, I see lots of cases where joining data sets remains a challenge because of a lack of serialized keys.  For example, is a visitor to your web site the same one who went to your community forums for help?  By the way, you’d better avoid using IP address as your key because it’s Personally Identifiable Information in many jurisdictions.  So it remains tough to develop a single view of the customer/user, especially when web-based touchpoints are everywhere including in Enterprise business models.

Organizations also get hung up with the approximation game.  Use of the qualifier “Likely” sends a chill through some people when faced with making important business decisions based on analytic insights.  Yet this is inherent when dealing with imperfect data.  So people wait for perfect data to arrive into their analytics systems.  And wait.  And wait.

Your analytics platform requires programmers to operate

This seems innocuous enough.  Aren’t programmers available to hire?

Let’s draw a comparison.  Legacy analytics platforms, namely those built on SQL databases and BI tools, don’t need much programming compared to when they emerged given the maturity of the platforms.  An Oracle DBA has lots of tools to configure and manage that database, and doesn’t need to do command-line programming thanks to the toolsets.

Compare that to Hadoop platforms, where programming is often required to even extract data from the data store.  Until the platform matures and these tasks are abstracted away by good tools, you’re faced with the prospect of hiring programmers.  The people that know how to do this are commanding huge salaries and multiple job offers.  Paying these market rates is not an easy conversation with your boss.

Your data requires statisticians to make sense of it

Statisticians, like the programmers above, are in scarce supply.  And are commanding  their own big salaries and lots of offers.

But to be fair, part of what makes Big Data exciting is the use of statistical analysis methods on business data as a mainstream discipline.  As hard as the work is today, lots of business insights are coming from new ways of looking at data.  So let’s not “throw the baby out with the bathwater”.

The bad news in sum: it’s going to take longer – and more money – to capture the promise of Big Data

What to do?

First, and most important, paint a vision for Big Data that is compelling, and creates unwavering executive support.  This is a marathon, not a sprint.  You will need executive support for a long time.

Second, make the support conditional on interim results.  Chunk up the journey into phases, where you can declare victory against interim  milestones.  No executive likes to take risks that will take years to prove.  So make sure the phased plan delivers good news along the way, and early detection of things when they go awry.

Speaking of continuous wins, don’t forget to visualize the results.  Pictures are vital in getting the results across and sustaining the excitement over long periods of time.  If you haven’t thought about hiring a visualization specialist for your team, do it.

Get the job done without experienced programmers and statisticians.  This gets to finding talent absent experience; the kind of talent that can learn these tools and methods in a self-directed way.  Someone with a good computer sceince background can learn Hadoop provided they have the curiosity and the will.  Just be a little patient while they get up to speed, and link your phased deployment to their ramp-up so nothing crashes and burns along the way.

The same could be said for stats skills.  I recently hired a masters graduate in marketing, who had a basic command of math.  But she learned the statistical tools on the fly to get the job done.  You can test for math aptitude by giving assignments during the interview stage, or even to existing employees.  This stuff can be learned.  Like the programmer role above, stage your initiative according to the learning journey of your analysts, so that the tasks don’t outstrip their developing capabilities.

All is not lost

Getting the maximum value out of Big Data is hard, and it’s a long journey.  Data quality is never a quick fix.  Nor is it quick or easy to hire the specialist skills presently required.

It will get easier.  Eventually, the vendor community will deliver point and click capability that abstracts away much of the coding required today by Hadoop admins and statisticians.

In parallel, sell the vision.  Deliver interim results.  Pay attention to visualization.  And look for latent talent.  Do these and you’ll be a Big Data hero.

One thought on “Big Data 2012: The “trough of disillusionment” and how to get past it

  1. Simple statistics isn’t that difficult. It’s unfortunate that more marketers, reporters, and management either ignore or don’t grasp the concept of statistical significance. The issue in marketing is that if you do test a number of the studies you farm out, it turns out that the differences in responses as to the rankings of a through j, for example, are statistically insignificant. That doesn’t help when you’re providing engineering with the top three new features customers would like or you only have room for 5 features on a data sheet, but there’s no difference in how customers or sales rank 10. Or if you’re presenting to analysts of bankers, for that matter!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s