There’s an elephant in the Big Data room, and it ain’t Hadoop

elephantThe Strata conference is this week.  It’s the seminal conference on all things Big Data.

What’s notably missing?  Any talk on data quality and ways to deal with it.

I’m shocked, given my past and current experiences and the widely circulated anecdote that “80% of an analyst’s / data scientist’s time is spent preparing data to be analyzed”.  In other words, dealing with inbound data quality.

One explanation could be that the Big Data world is still focused on single-source click-stream data.  This is the cleanest data available.

But many of the best insights come from fusing many data sets together to paint a more comprehensive picture of a subject, such as a user or customer.  And this is when it gets messy.

How do you link multiple data sets together to know it’s the same user or customer across the various sources?  How do you deal with CRM and transactional data, which is rife with duplicate records, incorrect categorizations, missing values, etc.?

If we’re to take the next step in generating value from the Big Data ecosystem, the old problems still need solving.  Hopefully Strata 2014 will be a different story.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s