There’s an elephant in the Big Data room, and it ain’t Hadoop

elephantThe Strata conference is this week.  It’s the seminal conference on all things Big Data.

What’s notably missing?  Any talk on data quality and ways to deal with it.

I’m shocked, given my past and current experiences and the widely circulated anecdote that “80% of an analyst’s / data scientist’s time is spent preparing data to be analyzed”.  In other words, dealing with inbound data quality.

One explanation could be that the Big Data world is still focused on single-source click-stream data.  This is the cleanest data available.

But many of the best insights come from fusing many data sets together to paint a more comprehensive picture of a subject, such as a user or customer.  And this is when it gets messy.

How do you link multiple data sets together to know it’s the same user or customer across the various sources?  How do you deal with CRM and transactional data, which is rife with duplicate records, incorrect categorizations, missing values, etc.?

If we’re to take the next step in generating value from the Big Data ecosystem, the old problems still need solving.  Hopefully Strata 2014 will be a different story.

Where are the women in tech? Updated.

woman in boardroomHaving recently co-founded my own company, I get a big role in defining its culture.  And one of the things my partner and I agree on is the need for diversity in our team across cultures, genders and everything else.

Why?  Because it makes for a more inclusive culture.  And because bringing many different points of view to bear on important decisions yields….better decisions.

To that end, I started asking friends of mine the following question: “where can we find communities of women in technology?”.  This was for the purpose of including such groups in our recruiting outreach.

The results thus far have been, to put it mildly, underwhelming.  Mostly in the form non-responses.  And a suggestion to search on meetup.com, which is like starting mostly cold.

What’s going on?

I know there aren’t many women in engineering roles, especially in proportion to the percent of men.  But they do exist.  And they exist in even larger numbers in roles like user experience design, another role we’re looking for.

(We interrupt this post to note that as I’m writing this, in the lobby of a hotel, James Brown is singing “It’s a Man’s World” in the background music.  You can’t get more ironic than that!)

But despite the size of the community, why can’t I tap into this network of professional women the way I have done with so many other communities of interest?

Rather than put forward my hypotheses, I’m interested in yours.  And any connections you could make.  The journey of a thousand miles – gender equality in tech – begins with but a single step.

UPDATE

I’m pleased to report that the first two hires we’ve made are women.  Not because we went looking for them in women-specific networks; it just happened.  But it’s a big step toward preventing our early culture from being defined by a homogeneously (young) male team that mirrors the tech workforce as a whole.

UPDATE TWO

One of our woman hires resigned just weeks after joining.  Balancing a commute to work, parenting duties such as pickup from child care, and the demands of a startup was just too much for her.  It certainly makes for the argument that it’s tough to “have it all”, at least at certain stages in one’s career.

Breaking my work silence

fingers on lipsSince leaving AVG and Prague last year, I’ve been pretty quiet on the blogging front.  Which makes sense given I was writing about living in Prague and working in the Freemium consumer software world.

In the meantime, I researched and ultimately co-founded a new company in San Francisco called Bluenose Analytics.  And attracted a kick-ass venture investor as partner.  Details on all of this will come with the company and product launch later this year.

A few hints: we’re building an analytics application in the cloud using a Big Data stack.  The application will help companies with subscription business models keep their customers longer and earn more recurring revenue.

This is a huge market opportunity.  I repeat, huge.

In the meantime, we’re hiring.  User experience designers, Java developers and Big Data stack developers.  If you’re interested to know more, contact me.  Or, share this with some friends.

My contact details are on my “about me” page and all over social media platforms.

Pandora, our divorce is pending….

pandora logoI love Pandora.  I use it on my iPhone while driving around the San Francisco area.

I’ve tried almost all of the others.  But Pandora’s music matching algorithms have exposed me to lots of new & cool artists from genres I already like; better than the other services.

So why are we getting divorced (maybe)?  Because the streaming of a song in progress is often interrupted by the start of another song.  Or an ad.  Both interruptions are a huge bummer (Ads are ok between songs.  I use the free version, after all).

I tweeted Pandora’s CTO pleading for help in fixing their app.  And he was incredibly responsive.  But I ultimately got put into a process designed to make the user go through all of the hoops.  The latest email I got after previously being directed to uninstall & re-install the app and re-boot my phone:

Sorry for the continued trouble, but thanks for giving those steps a shot. I noticed that you aren’t running the latest version of iOS on your iPhone, which helps address bugs and provides you with new features. 

You can install the free update by connecting your iPhone to your computer. Now, click ‘Update’ on the main iPhone screen in iTunes. You can also update your phone by installing the update directly by going to Settings -» General -» Software Update (it’s recommended that you plug your phone in during the update). 

If the issue still continues, then network congestion or a signal strength issue is the most likely cause.

If you’re having trouble when using a 3G or EDGE connection (in other words, not Wi-Fi), you can often get better performance with the “higher quality audio” option turned off. (Tap the arrow in the upper left of the Pandora “Now Playing” screen to reach the Station List page, then tap “Settings” -» “Advanced” -» “Higher quality audio” -» Off). This will ensure the minimum bandwidth is used to stream music when using a cellular-data connection (Wi-Fi connections are always automatically streamed in “higher quality audio” whether this option is on or off. For best results, use Wi-Fi whenever possible — e.g. at home, work, a coffee shop or a friend’s house).

If you are still having issues with “higher quality audio” turned off, then this is almost always due to poor cell reception. Note that the “number of bars” is often not an accurate measure of bandwidth. You can test your actual iPhone bandwidth by visiting http://www.testmyiphone.com on your Safari web browser. It will tell you your upload and download speeds. A consistent download speed of over 80kbps is generally required to stream Pandora smoothly. If you have less bandwidth than this, please change location — even a few feet can sometimes make a difference — or wait and try Pandora again later.

The user – me – isn’t the issue.  Half of these steps don’t even address the fact I’m using 3G networks.  And the problem has been occurring for months on a state-of-the-art phone over 3G networks around San Francisco that are amongst the densest in the world.

The issue is something much more technical and out of the user’s control.  It’s probably rooted in cacheing and compression algorithms that deliver data to my phone.

Anyone try Skype 10 years ago?  Remember the crappy sound and video quality?  Skype has since spent tons of time and money to write great algorithms that now deliver a wonderful service that overcomes lots of network problems.

Pandora has yet to, based on my experience.

I’m the kind of person that probably would have installed diagnostic software on my phone to give Pandora a hand.  Instead, I get asked to do a bunch of stuff that skirts the real issue and ignores the actual usage scenarios.

Is Pandora alone in delivering technical support this way?  Certainly not.  I’ve seen it in companies I worked at too.  But it doesn’t make it right.  Vendors delivering support should take an active role in troubleshooting instead of exhausting the users’ efforts (and loyalty) before owning the problem.

A proud day: shipping a net-new product that just works

bullseyeMy former team at AVG just launched AVG CloudCare.

CloudCare is a managed services platform that enables local IT resellers to become managed services providers to their small business customers.  This article did the best job of describing the great value proposition for security and IT support, for resellers and customers alike.

A lot went right in getting there.  The product was rooted in a clear and well-thought-out strategy.  We made a directed investment in building it from scratch.  We focused on a customer segment that was under-served by existing solutions.  A lot of research and interviews went into understanding both the customer and resellers’ needs.  And it got tested and refined in partnership with our users along the way.

Congrats in particular to Mirek, Vikas, Alan, Darren and David.  You pulled off something special.

Connecting social media to business results

My former colleague Jill Hunley and I presented at a social media analytics conference a few weeks back.  We were talking about ways in which social media efforts can be linked to corporate results using Big Data analytics.

It’s nice to have others recognize the importance of this.  One of the audience members, Paul Costanza,  had this to say:

“A few outstanding presentations at the Social Media Intelligence Summit highlighted and validated this. The first was a session in which Jill Hunley of AVG and her former colleague Don MacLennan demonstrated that by integrating a set of social data inputs from online product ratings into its customer data analysis, AVG significantly redirected the product development map of one of its star products, AVG Mobilation.

Specifically, AVG determined that more than 90 percent of the product’s negative sentiment was due to just six product attributes. The most interesting aspect of these findings is that not one of these attributes was being addressed for correction by the existing product roadmap! This is the type of insight that marketers dream of providing to product development efforts.”

You can read Paul’s full blog entry here.

Big Data 2012: The “trough of disillusionment” and how to get past it

I spent some time at the Hadoop Summit this week, and spent lots of time in the prior weeks with entrepreneurs, practitioners and VC’s in this space.  My prediction: we are entering what Gartner would call the “trough of disillusionment” right about now.  The hype has left reality behind.

This is not a special insight of mine.  All emerging technologies go through stages of hype just as the Big Data movement is now.  Rather, I’d like to focus on why people will become disillusioned, and how to get past it.

Your data sucks

Big Data can deal with less-than-perfect data in many cases.  On collection, Hadoop doesn’t require parsing data into a schema, so you can leave it unparsed and de-normalized at first.

On analysis, lots of Big Data use cases are based on non-financial data.  So there’s tolerance for approximations (or, “confidence intervals” if you’re into stats).  For example, can I deal with a predictive model that says a user is 95% likely to churn?  You bet.

But the old adage “garbage in, garbage out” still applies.  For example, I see lots of cases where joining data sets remains a challenge because of a lack of serialized keys.  For example, is a visitor to your web site the same one who went to your community forums for help?  By the way, you’d better avoid using IP address as your key because it’s Personally Identifiable Information in many jurisdictions.  So it remains tough to develop a single view of the customer/user, especially when web-based touchpoints are everywhere including in Enterprise business models.

Organizations also get hung up with the approximation game.  Use of the qualifier “Likely” sends a chill through some people when faced with making important business decisions based on analytic insights.  Yet this is inherent when dealing with imperfect data.  So people wait for perfect data to arrive into their analytics systems.  And wait.  And wait.

Your analytics platform requires programmers to operate

This seems innocuous enough.  Aren’t programmers available to hire?

Let’s draw a comparison.  Legacy analytics platforms, namely those built on SQL databases and BI tools, don’t need much programming compared to when they emerged given the maturity of the platforms.  An Oracle DBA has lots of tools to configure and manage that database, and doesn’t need to do command-line programming thanks to the toolsets.

Compare that to Hadoop platforms, where programming is often required to even extract data from the data store.  Until the platform matures and these tasks are abstracted away by good tools, you’re faced with the prospect of hiring programmers.  The people that know how to do this are commanding huge salaries and multiple job offers.  Paying these market rates is not an easy conversation with your boss.

Your data requires statisticians to make sense of it

Statisticians, like the programmers above, are in scarce supply.  And are commanding  their own big salaries and lots of offers.

But to be fair, part of what makes Big Data exciting is the use of statistical analysis methods on business data as a mainstream discipline.  As hard as the work is today, lots of business insights are coming from new ways of looking at data.  So let’s not “throw the baby out with the bathwater”.

The bad news in sum: it’s going to take longer – and more money – to capture the promise of Big Data

What to do?

First, and most important, paint a vision for Big Data that is compelling, and creates unwavering executive support.  This is a marathon, not a sprint.  You will need executive support for a long time.

Second, make the support conditional on interim results.  Chunk up the journey into phases, where you can declare victory against interim  milestones.  No executive likes to take risks that will take years to prove.  So make sure the phased plan delivers good news along the way, and early detection of things when they go awry.

Speaking of continuous wins, don’t forget to visualize the results.  Pictures are vital in getting the results across and sustaining the excitement over long periods of time.  If you haven’t thought about hiring a visualization specialist for your team, do it.

Get the job done without experienced programmers and statisticians.  This gets to finding talent absent experience; the kind of talent that can learn these tools and methods in a self-directed way.  Someone with a good computer sceince background can learn Hadoop provided they have the curiosity and the will.  Just be a little patient while they get up to speed, and link your phased deployment to their ramp-up so nothing crashes and burns along the way.

The same could be said for stats skills.  I recently hired a masters graduate in marketing, who had a basic command of math.  But she learned the statistical tools on the fly to get the job done.  You can test for math aptitude by giving assignments during the interview stage, or even to existing employees.  This stuff can be learned.  Like the programmer role above, stage your initiative according to the learning journey of your analysts, so that the tasks don’t outstrip their developing capabilities.

All is not lost

Getting the maximum value out of Big Data is hard, and it’s a long journey.  Data quality is never a quick fix.  Nor is it quick or easy to hire the specialist skills presently required.

It will get easier.  Eventually, the vendor community will deliver point and click capability that abstracts away much of the coding required today by Hadoop admins and statisticians.

In parallel, sell the vision.  Deliver interim results.  Pay attention to visualization.  And look for latent talent.  Do these and you’ll be a Big Data hero.

Don’t get caught using averages (part 2)

Pareto/Power Law distributions: the needle in the haystack

I wrote previously about the prevalence of Pareto/Power Law distributions in product users’ behavior here.  Wow, that’s a lot of alliteration.

But the discussion stopped at only one dimension of data.  For example, a single dimension like Free versus Paid users of a Freemium product such as online backup.

The story gets really interesting when you consider multiple dimensions (aka variables) of data at once, each with its own Pareto characteristics.  The outcome can lead you to a some very interesting places.

In the first scenario, a small set of users in Dimension One (let’s say, Paid product users) also represents a small set of users in Dimension Two (let’s say, country of user origin).  This can mean that a tiny percentage (sometimes less than 1 percent!) of an entire user base represents almost all of the revenue or commercial value.

When this happens, it’s incredibly important to know who these users are; you’ll need to hang onto them for dear life to protect your revenue stream.  For example, you might cater to the specific needs of users from their country of origin.  Do you think users in China have different product needs than in France?  Probably.

In this scenario, you’ll also need to consider a revenue diversification strategy to protect your risks of relying on such a small segment.

Another scenario is that users in Dimension One (again, Paid users) don’t belong to the majority (or, “head”) of the distribution within Dimension Two (again, country of origin).  In which case, the implication is that country doesn’t matter in targeting your best (e.g. paying) users.

You can go astray in this scenario by looking at country of origin in isolation.  Maybe you have a huge pool of users from Germany.  The temptation would be to conclude “Germany is my most important market”.  Unless you knew that paid users didn’t cluster around a single country and that Germany was comprised of lots of free users.

What to conclude?

One: make sure you know if your most valuable user segment is much smaller than a Normal distribution would imply.  Most people think that their most important user segment is something like 10-20% of their base.  If 1% of your users drive the business, know who they are, find more like them, and don’t lose them.

Two: don’t let any one dimension of data drive your definition of user segments and internal decision-making.  If you hear sound bites inside your company like “German users are our most important”, that’s being too imprecise.  It generally takes 2-4 dimensions/variables to be precise about a user segment and to know how to best treat them (“Paid users with broadband PC connections in Germany are most important”).

Three: if you truly have 1% of your users driving the business, consider diversification strategies.  You’re carrying a lot of risk, but you also have 99% of your users from which another valuable segment can be found and served.

Last: as I argued in the prior post, it’s easy to dismiss the Pareto effect as only applying to obvious examples like Freemium for online consumers.  I’ve found the same patterns in other businesses.  In which case the gap between reality and perception is even wider!  Spend some time hunting down these patterns inside your company.  I promise you will be rewarded with new insights.

Don’t get caught using averages (part 1)

Our brains are wired somehow to think of everything in terms of a Normal Distribution, aka the “Bell Curve”.  It’s a trap that can kill a tech company.

The shape of the curve means that we think of populations of data (such as users) as being a somewhat homogeneous group if only we could compute the average.  For example, how many minutes per day “on average” a user spends on a website.   Or, the percentage of people “on average” who actively post on a social media platform.

The problem is that populations of people almost never behave in a normal distribution when online or using software products. Instead, the more prevalent pattern of behavior is a Power Law, or Pareto Distribution:

The Pareto distribution is also known as the “80/20 rule”.  Except that in online worlds, the ratio can be even closer to “95/5”.

Think of Freemium business models.  Generally, 2-8% of users consume a paid offering.  The rest use the free version.  Power Law/Pareto distribution, not Normal.

Think of participation in social media.  1% are active contributors, 10% are intermittent contributors and 90% consume but never post.  Power Law/Pareto distribution, not Normal.

These steep Pareto curves have profound meaning on making choices in running a technology company.

If you operate a Freemium business but don’t know which users are the 5% most likely to upgrade to the paid version, then you risk catering to the needs of the Bell Curve: a population of users that looks more like 50-60% of the whole.  Who don’t necessarily pay or monetize.

This is the trap. Chris Anderson touched on this in his book “Free”, by illustrating how the Power Law distribution drives monetization in Freemium business models.

There are other traps by thinking in Normal terms.  Beyond Freemium, the Power Law distribution of behavior still applies.

Take Enterprise business models.  Every user is a payor, of approximately the same fee.  Yet 2-10% of a user population is massively active versus the rest.   And with that 10% of users comes maybe 10-20% of the revenue.

Which is your most important segment? Are you trying to solve the problems of those 10% “power users”?  Or the needs of the rest?

An example: I managed a product that enabled monitoring of corporate networks and systems for the sake of spotting anomalies.  Anomalies which could indicate a security breach in progress, or the risk of one.

Some users spent a large percentage of their day performing the monitoring function for the company.  They were specialists who used the product intensively throughout the day.  These power users had distinct needs, such as the ability to mine and explore data in depth to spot anomalies for themselves.

The rest of the users were different.  They weren’t monitoring specialists.  The monitoring role was only one of many roles they played for their companies.  Thus, they wanted to spent the least amount of time possible in my product.  Instead, they expected the system to alert them automatically, and offer specific actions to take.

Two user populations.  Two very different sets of needs.   One “market”.

Knowing who your core audience is, and the nature of the Power Law distributions, is essential in setting priorities on which segments to serve.  And those that can trap you.

In this post, I’ve only been discussing Power Law in one dimension of meaning (free vs. paid, automated alerting vs. manual trend-spotting).  Some of the most interesting Big Data analytics findings come from combining multiple dimensions of meaning, each with its respective Power Law behavior (a simple example: free/paid combined with locale).  I’ll tackle that one in a future post….

The Czech glass ceiling is extra thick

I deliberated writing this post for the risk of being seen as, ahem, “patrician”.  But I have been moved by some young women to do it anyway.

Recently I hired a woman who just graduated from her Master’s program in business & marketing. During the interview process, she distinguished herself as having great potential. And everything that she has done since arriving has reinforced my impressions.

I began to reflect on where her talent might take her in the Czech workplace in the years ahead. As I looked around at the women in my company, and other companies I have been exposed to in the Czech Republic, I saw the dearth of women as managers. And there are still fewer female executives. Most women are individual contributors and many of those are performing administrative assistant functions.

Americans decry the lack of women in high positions, but the situation is worse still in Czech.

This is a country that is growing in large part thanks to its “knowledge economy”, where the technology and business process outsourcing sectors are the engine. What a shame if a big part of the workforce is excluded from participating in that opportunity.

As an American, one must be careful not to judge other cultures that one doesn’t fully understand. Perhaps women drop out of the workforce once they have a family due to choice of priorities. Or is it because they have no incentive to remain in the workforce?

But for those women who do want a career, it’s going to be a long hard slog. What to do?

First, the challenge will be greatest for women who have been in the workplace for 15 or more years. They are now labeled by the role they currently play and the money they now earn. If they haven’t succeeded in defying the odds somehow and become high earners, managers and executives, then the system won’t change in time to remove the obstacles for them.

Second, for those in the workplace for 5-15 years, the non-managerial roles are probably within easier reach. There can be a career growth path that rewards expertise as an individual contributor and avoids the strongest bias, which is against placing women in leadership roles. Maybe the government should step in and provide mid-career assistance in training and education that enables individual contributors to ascend to a level of expert? Certainly, technical disciplines like high-tech and manufacturing can support such a career ladder.

Third, the youngest of the workforce stand the greatest chance of unconstrained growth. There are tremendously smart, ambitious women available as recent graduates. And the wages they command are modest to say the least. Can they be fast-tracked somehow? Such as pairing them up in apprenticeship-style roles doing the work of a more senior person or even a manager? Companies can afford to carry these costs if they see the value in finding early stars and grooming them.

Last, time above all will enable change. The issue of women in the workplace is a global one, and no country stands out as having solved it. However, the Czech economy is increasingly global; with it comes exposure to other business cultures where women play a more prominent role.

I suppose the greater question is whether Czech society wants this change for itself.  I certainly hope so.  I’ve seen the bright young faces and the hope they have for their careers. They deserve the chance.