ARTICLE
Big Data and the challenge of unstructured data
By 2022, 93% of all data in the digital universe will be unstructured.
By 2022, 93% of all data in the digital universe will be unstructured
There's a lot of buzz about Big Data and the privacy issues inherent in collecting and storing so much personal information. And, while that's a legitimate and real concern for both consumers and those who work in data security, industry insiders are facing another challenge: how to handle unstructured data?

What is unstructured data?

Unstructured data doesn't fit neatly into databases organised by fixed categories like name, address, social security number, etc. Unstructured data is the freeform information that is mined from things like social media posts, notes made by a call centre agent, email, or Twitter conversations with customers. Unstructured data can be an extremely rich source of relevant information, but it doesn't easily lend itself to older models of data storage and analysis.

What are some of the challenges?

The challenges of unstructured data run the gamut from gathering to storing, to using it to make decisions:

1. Relevance

One way in which relevance comes into play is lack of insight into "backstory" of certain pieces of data. For instance, a student might do a search on a particular topic or product to gather information for a school paper, and then never search for those keywords again. If so, that search would be irrelevant to any subsequent consumer behaviour, but the computers doing Big Data analysis wouldn't know that. The system assumes a relationship that simply wasn't there.

Another big challenge in working with unstructured data comes into play with machine learning and highlights the importance of knowing which factors actually drive consumer behaviour. It's the classic "correlation or causation" dilemma on steroids. An analytic model could give too much weight to factors that are merely correlated, and, thanks to machine learning, the more the correlation is noted, the more weight it's given. But, since there is no actual causation, the conclusions are inaccurate, and they become more so as time goes on.

2. Volume

The volume of unstructured data is growing at a rate of 62% per year.

For many businesses, that's more than they can keep up with, and they may be collecting information they're not even aware of. That presents challenges for both using and securing the data. The lack of awareness makes it more likely for enterprises to run afoul of the increasing number of regulations addressing data privacy. Such a large volume of data also requires infrastructure that many businesses don't currently have, or haven't budgeted for.

3. Quality

By nature, a large volume of unstructured data is unverified. There are plenty of jokes about "Facebook lives," in which a person's Facebook updates are more fantasy than reality. One effect of growing privacy concerns is the tendency for people to make up details for their profiles, in which even the hard "facts" – like marital status and hometown – can be completely false. This presents serious challenges for consumers and enterprises. On a consumer level, people could be negatively impacted by companies that make decisions based on flimsy unstructured data, like using a person's social media posts to help determine insurance rates. On an enterprise level, making business decisions based on inaccurate data could be extremely costly.

4. Usability

For unstructured data to be usable, businesses will have to come up with a way to locate, extract, organise, and store the data. This means coming up with an entirely new type of database to store information that doesn't fit the mould.

Unstructured Big Data isn't going away. And that's a good thing, because it holds the opportunity for greatly enhanced planning and decision-making.