Tag Archives: Freedom of Information

Open Information vs. Big Data

Freedom of speech and freedom of information are enormously important things to debate and defend. It is clear that papers published in the course of government-funded scientific research should be available to the public by default, for example, or that crime and school statistics should be accessible. (You can read some of the arguments for and against this viewpoint on the Guardian’s FoI website.)

Having established that freedom of information is a Good Thing, what about freedom of data?

Maybe I should backtrack here, and ask: what is data, as opposed to information? (*ahem* more properly, what are data – the word is from the Latin datum  ‘thing that has been given’, plural data – so many people use the plural in English. Back to the point…)

Data are recorded results, the outcome from some measurement or calculation. Each point on a graph is a datum, but one point doesn’t really tell you much about the thing you’re recording. Even a whole bunch of points aren’t particularly helpful. But if you take the scattered points and use them to show a relationship between age and height, say – if you do some work to the data to make sense of it – you can turn it into information.

Information is interpreted data. It is communicable and ideally useful knowledge, an insight into the behaviour of a system.

Since the 1980s, more and more data collection has been automated, and data sets have gotten exponentially bigger – we’re not talking megabytes here, but terabytes, exabytes, zettabytes – that become hard to deal with by conventional database means. These data come from all kinds of things – new scientific experiments like CERN, simulations, Tesco clubcards, iPhone logs, your web browser – almost anything using technology more sophisticated than a microwave, and in a few years, probably that too.

According to Wikipedia, we’re collectively creating about 2.5 quintillion bytes of data every day (1 quintillion = 1000 trillion) and by 2014 that will have doubled. We’re in the realm of ‘big data’ here – data sets that are so big they take too long to do even simple operations using conventional software. They are general hard to process, and even harder to visualise (despite a few great examples of such).

How can we reconcile this fact of cumbersome and ever-growing data sets with the principle of promoting and defending freedom of information? Should freedom of information apply to the data, the information or both?

Outside the USA and Canada, Facebook users are registered with Facebook Ireland Ltd. and Irish law provides for the right to access your own personal information that any online business has been collecting on you. In 2011 an Austrian student named Max Schrems submitted such a request and, after a lengthy bureaucratic process, received a document 1200 pages long. 1200!

Whether or not the 1200 pages on Max Schrems’s Facebook activity were all strictly necessary, and whether or not Facebook should have been keeping all those supposedly deleted messages and lists of de-friended people, I strongly doubt that a cumbersome 100Mb PDF was in a form that Schrems could easily process into useful information. On the contrary, this sounds like a data deluge: a wall of pure, near-raw data whose volume makes it all but impenetrable.

Part of the reason for the relentless detail is legal in nature: it could not then be accused of under-complying. But if the unwieldy format is specified by law, the laws need to keep up with what’s in the public interest. The usability of data is a huge issue here, and is getting more important the more data sets grow.

The basic problem with Freedom of Information requests is as follows:

  1. Citizen files Freedom of Information request with organisation (public or private) possibly with a specific query, but relating to large dataset.
  2. If at all possible, organisation hides information relevant to request inside massive data set. Organisation therefore complies with the letter of the law.
  3. Citizen cannot process data set and likely gives up; information remains useless.

Drowning in data?

To me it seems like it should be in many companies’ interests to be as open and transparent as it can be when dealing with requests for personal data. I don’t care what Google does with the data I continually feed it, as a rule, but that’s partly because I trust it to be honest and tell me what it’s using it all for. I also trust it (perhaps in error) to lay off using my data if I ask it to.

Facebook, on the other hand, has followed a different path, but maybe this difference is in the fragility of the customer base: you can’t leave unless all your friends leave too, but stopping using Google is a relatively easy individual choice – on the face of it, anyway. Regardless, Google hasn’t made a habit of pissing off its users like Facebook can.

So in some cases it is in the interests of the data user (eg. Google) to be transparent with how they use data. Those cases will no doubt take care of themselves – though they may need some extra encouragement to be accountable to the public interest, alongside what is essentially also a branding exercise.

What about the cases where transparency is in the public interest, but not in the interest of those keeping and using the data? The resistance to transparency could be nefarious or it could just be lazy. In those cases, there is an increasing temptation – and ability – to use the realities of Big Data to obscure useful information.

For example, as Fred Pearce asks, should scientists always disclose their raw data along with their findings? Arguably yes, especially if they’re publicly funded – but how far should those scientists go to help requesters actually deal with it? Is a 5 terabyte data set of climate records worth anything without knowing how to interpret it? There is much about the processing methodology that scientists would prefer to keep confidential, at least until publication. Having released the data, these scientists will no doubt be hassled with hundreds of requests for help interpreting their monster data sets. How much time and effort is all this really worth?

There really is a problem here. Big Data means that organisations have the ability and the incentive to collect and use vast amounts of data on people. They can use this, with a sprinkling of multivariate statistical inference, to work out what sort of cheese you might buy next time you go to Tesco, whether you’re likely to click on an ad about breast enlargements, and how likely you are to contract lung cancer before you’re 50. Yet when you ask what they have on you, they only have to tell you about the data, which may run to 1200 pages or more and be completely indigestible. The actual information is still proprietary and you’re none the wiser.

What can someone without a degree in statistics and computer science do?