Big Data, Big Potential for Big Problems

Brian Wood Blog

Data is not information — it is necessary but not sufficient.

Raise that by a few orders of magnitude (i.e., big data) and it’s easy to come (quickly) to the wrong conclusions.

Original article by Bob Lewis in InfoWorld.

Emphasis in red added by me.

Brian Wood, VP Marketing


Big data’s pitfall: Answers that are clear, compelling, and wrong

Doing big data right takes sophisticated techniques to ensure ad hoc results are reliable

The cloud’s dark secret is integration — most implementations don’t include it, whether IaaS, PaaS, or SaaS. Big data has its own dark secret — that, as Mark Twain once pointed out, it ain’t what you don’t know that gets you into trouble. It’s what you do know that ain’t so.

Big data will make decision makers more certain. Making them more right? That’s more doubtful.

There are three pieces to this puzzle, even beyond the big data cultural challenges mentioned last week: quality assurance, spurious correlations, and the well-known but often-ignored challenges associated with statistically analyzing data that weren’t collected with the analyses in mind. Each presents grave dangers to a company’s decision-making health.

Big data danger No. 1: Quality assurance
Just because someone has access to a database and a bunch of BI tools with which to mine it doesn’t mean they know what they’re doing. Even with all of the right technical skills, it’s easy to generate convincing-looking statistics that are wrong.

Start with how easy it is to misunderstand what kind of data you have. Part of the point of big data technologies is that they’re more flexible and take less up-front planning and analysis than old-school IT reports or even not-quite-so-old-school data warehouses.

In the pre-big-data era, IT delivered carefully constructed data views to users and user analysts, reducing the risks of misunderstanding the data being analyzed. With big data, this responsibility is shifting at the same time data structuring is becoming more ad hoc and therefore easier to get wrong. Get this wrong and analysts will start their work with data sets that are, in one way or another, problematic — for example, by having an invisible systematic bias. Nothing will fix this.

A second issue is data quality. Statisticians apply a number of tests to their data sets to make sure they’re suitable for the intended analyses. If no one in your company knows how to use words like “heteroskedasticity” and “stationarity,” you should probably hire one before you use big data to make any big decisions.

Simple example: Some time-series data include both a cyclical component and a linear trend component. Sales data, for example, might include both seasonality and overall growth. Perform a standard regression analysis on this kind of data, and the technical term for the result is “wrong.” (In case you care: You need Box-Jenkins analysis for this kind of data.)

Knowing to test — and how to test — for data quality is a key reason smart companies investing in big data are including the cost of professional statisticians and “data scientists” (I’m sure there’s a difference; I’m just not sure what it is) in their big data budgets.

Then there’s the most obvious challenge: Making sure the analyses do what the analyst thinks they’re doing. Traditional reports come from data extracted from carefully designed data stores by professionals who understand how it’s all structured. They’re programmed by IT and distributed to users. The most important ones are subjected to independent audits to make sure they do what they’re supposed to.

The question of how to perform quality assurance on big data analytics hasn’t yet been answered. While there are ways to make sure the results are reliable, making use of them is labor-intensive and time-consuming.

One alternative, to illustrate: Extract a random sample of data, small enough to analyze one row at a time if need be and big enough to be statistically significant. Load it into Excel. Save the query. Perform your analysis semi-manually using Excel. Triple-check the answer, and have a friend check it too. This is just like programming: It’s easy to miss your own mistakes. Spotting the mistakes made by others is much easier.

Next: With the same query, run the same analysis using your BI tool. If it gives you the same answer you just might have put it together right.

This isn’t, by the way, purely theoretical. Not that many years ago a client relied on a BI report to determine if a new process it was piloting was an improvement over the old one.

According to the report, it wasn’t, leading the team to hypothesize a number of different root causes. After two months of chasing their tails, it turned out the initial database query that provided the data being analyzed was inappropriate for the use to which it was being put, resulting in systematically wrong results. The new process was fine. The process metrics were faulty.

Big data danger No. 2: Spurious correlations
Logicians know that correlation doesn’t prove causation. Statisticians know that one out of every 20 correlations that are significant at the 0.05 level are, by definition, due to random chance. They also know that every analysis to which data are subjected reduce the degrees of freedom by one. (So I’m told — I’m not a professional statistician. If you have analysts mining big data who don’t understand this principle any better than I do, it’s time to send them to statistics school.)

Back in college, when I took Psych 101, I learned there’s a statistically significant correlation between the length of a person’s big toe and their intelligence quotient. While there’s a thin possibility a causal relationship exists — perhaps genetic pleotropy — most cognitive psychologists ignore this correlation as an unimportant anomaly.

The question is this: After mining your company’s data, would your executives be wise enough not to include toe measurements in its applicant screening program, right alongside drug testing? If so, could they do this without descending into gut-trusting? That’s a fine line to tread, but fail to tread it properly and the possibility of very bad “data-driven decisions” is significant.

Big data danger No. 3: This isn’t why the data was collected
A big reason companies are collecting big data is to discover previously unrecognized correlations — patterns, trends, associations and so on — they can turn to their advantage.

Or to their disadvantage — professional statisticians know the importance of designing data collection so as to provide a sample that can be subjected to legitimate analysis. Analyze data not collected for that purpose and there are any number of ways to get wrong results out of it.

Real-world example: Back in my deep, dark past, I was responsible for analyzing the performance of the six presses owned by the daily newspaper I worked for. I dug into the database and discovered, to my surprise, that the press everyone thought was the best performer — it was 20 years younger and built on more modern technology — was in fact delivering the worst results.

Fortunately, I was just barely smart enough to talk over my findings with a frontline manager before presenting them to upper management. I say “fortunately” because the press manager I talked to explained the facts to me: The press in question ran all the most difficult jobs. As the database I was analyzing didn’t have job_difficulty_rating as a field, my analysis missed this subtlety.

Beware big data dangers — but don’t be scared off by them
None of this is a reason to ignore big data. Its potential value is, in many situations, significant and possibly transformational.

Doing it well is far from trivial, though. So when you head down the big data path, make sure your implementation minimizes the chances of getting answers that are clear, compelling, and wrong.