Big data in social research: Access and replication
In the first part of this two-part post, I argued that big data has the potential to open up exciting new avenues in social research. But much of the world’s data is commercial, and private. Access, and in particular access for the purpose of replicating published results, remains difficult. In this second part, I illustrate the problem, and suggest the direction possible solutions will take.
The increasing use of big data in social research is exciting. But this is a trend on a collision path with another emerging research trend: full and open publication of data and code.
In the early days of academic publishing, it was common to include full data tables in articles. These were often no more than a page, and few other methods existed for dissemination. Then, as datasets grew larger and journal pages more crowded, convention shifted to 'data available on request from the author' - an extra step that discouraged careful review.
This is still too often the case, and it leads to persistent errors. Only recently, a paper by economists Carmen Reinhart and Kenneth Rogoff, which had been highly influential in the stimulus-vs-austerity debate, was found to contain just such an error. (The mistake came to light only after a graduate student obtained the original spreadsheet calculations used by Rogoff and Reinhart.
There's no doubt it would have been detected earlier had the spreadsheet been published as a matter of course.)
Thankfully, the advent of the web has made it once again possible for all data and calculations supporting academic papers to be published, and leading journals have required this for some time now.
Unfortunately, datasets from internet giants like Facebook, LinkedIn and Google do not fit this paradigm. Not all such companies cooperate with academic researchers. But even when they do, detailed supporting data are almost never made public, which prevents careful review and replication by other researchers.
Many companies offer public APIs to access selected data, but these are usually designed with the commercial, rather than research, ecosystem in mind, and they are often not useful for replication.
Three barriers prevent publication of more data. The first is sheer size: Facebook's data is measured in tens of petabytes. The second is commercial confidentiality: for many of these businesses, data is a key competitive advantage. The third is user privacy: users would not stand for all their data being made available to the (research) world at large, even with the best of intentions.
These are difficult challenges, but partial solutions exist. The authors of the Facebook small-world experiment, to their credit, made available intermediate data, allowing some degree of replication by other researchers.
There is a field of research, 'statistical disclosure control', dedicated to examining the problem of maintaining individual privacy when disseminating statistical data.
Public institutions such as the UK Data Service and the ONS have a great deal of experience with the parallel problem that arises in official survey data. There are already secure facilities in place with both the technological capacity to deal with large datasets, and the controls in place to minimise risks to privacy: these offer a model to imitate.
The academy is beginning to recognise the extent of the problem, and develop solutions, both institutional and technological, for safely disseminating large proprietary datasets. The Economic and Social Research Council recently announced £14 million in funding for Business Datasafe, a repository designed for just such data.
This is a good first step, but alone it will not be enough. Business leaders must be persuaded to sign up to the broader 'data philanthropy' agenda: even without the Datasafe, they could directly expose data in a way more useful to researchers. Users need to have confidence that personal data will remain private.
The social payoff to greater research using big data could be very large. But if we fail to address the access issue, some of the most fascinating and influential results in the next few decades of social science will remain unreviewed, untested and unreplicated. If that happens, then we will be squandering much of the value of this new world of research.