What the biggest science experiment on earth can teach us about handling and making sense of big data
There is an experimental platform imagined in the classic 80s film Tron where all forms of research can be carried out at unparalleled speeds. It was called The Grid, and 30 years on, this futuristic machine has become a reality.
Down in a series of vast underground caverns and tunnels beneath France and Switzerland, the Large Hadron Collider is about to switch on again after two years of fine-tuning and upgrades. When it does, it will be operating at twice the energy and producing an enormous, mind-blowing quantity of data. And all of that data will be processed by the Grid.
The numbers involved are huge.
When the switch is flicked in the new year the Grid will start to consume 160,000 gigabytes of information every day. That works out at 70 petabytes a year. To give you an idea of size, one petabyte is big enough to store the entire DNA material of everyone in the USA.
But how does the Grid cope with such a large amount of data?
The Grid is a global network of computers, originating in CERN, that quickly processes and shares the big data generated by the trillions of collisions in the Large Hadron Collider. Like its futuristic forbear in Tron, the Grid is incredibly quick. It is composed of hundreds of thousands of processors working in parallel, and it also comprised of a series of Tiers that pass the data along the line, refining as it goes. This enables the information from the experiment to be shared almost in real time with scientists anywhere around the globe.
Handling the data
But the key to its success is not archiving all the data, but reducing it to a manageable level.
CERN’s data expert Pierre Vande Vyvre has spoken recently about the future for big data processing. He says that the role will change to standard algorithms that can do the heavy lifting first and reduce the data to a more manageable size before it gets close to human analysis.
“The big science workflows are mainly data reduction. Currently just 1% of data from collision events [at the Large Hadron Collider] are selected for analysis. The archiving of raw data is not the standard anymore.”
What can big science teach us about big data?
CERN is well prepared. They can handle their data. Like the best big science projects, they have a clear theory which they are testing against the data, and they have set up clear parameters that can narrow the data. They also have the in-house skills to analyse what they find.
It's an important framework for dealing with big data that companies and governments could learn from, all of which are setting up their own big data initiatives to help them make predictions about the future.
The Obama administration recently launched a $200 million R&D initiative to 'improve the government's ability to extract insights from digital data', Google has been trying to map flu trends by studying search queries, and Amazon has been using customer data to try and automate our choices for years. All of them are chasing the big data dream.
But it's when you start to use big data to try and predict future behaviour that you get stuck.
The data is based on social systems which are fraught with unpredictable changes. Unless you understand all future variables, big data for prediction can only tell us more about the past or the present than the future. As Juan Mateos-Garcia argues, “Big data is lurking with biases, mirages and self-fulfilling prophesies. Avoiding them requires access to the right skills and organisation”.
Big data needs big judgement
Operations like the Grid at CERN point the way to better management and analysis of huge data sets. Crucially, they not only have the skills to build the sensors and the computing architecture to deal with the massive flow of data, they also have the judgement to reduce the data and to analyse it afterwards.
Big data is still a judgement call. Despite companies investing heavily in big data processes, most do not have the skills to analyse what is coming in. It's a problem, and an opportunity, that we've set out in our report Model Workers.
The end of theory
Chris Anderson - the idealist TED leader – made the bold call that big data will spell the end of all theory. Like the Master Control Programme in Tron, big data algorithms will work many times faster than a human and will be able to answer our global problems before they arise.
It is misplaced because it is naïve to assume that machines can make decisions about what data to collect and how to interpret the patterns in that data as information for other machines to use. It feels like the techno-utopian belief that brought the world market to its knees in 2008. Here the idea that computers can measure, control and self-stabilise societies was proved dramatically wrong when the complex algorithms that ran financial markets failed to predict the giant debt bubble.
Adam Curtis carefully dismantled this dream of computers running things in his brilliant TV series ‘All Watched Over by Machines of Loving Grace’ (2011). The danger now is that the predictive power of big data is feeding the same fantasy that leads to the same distortion and simplification of the world around us.
Keeping big data human
When it is switched back on the data processed by the Grid will make the Large Hadron Collider the biggest of the Big Science experiments in the world.
But it won’t be the machines peering into the underlying structure of the universe.
That’s up to us.
Image: The Grid's Tier 0 data center courtesy of CERN.