We live in a world of exponentially expanding data. Digitisation and the emerging internet of things have created a world in which our daily activities leave a digital trail.
To an organisation or an individual with the right skills, that digital trail becomes data, able to be probed and interrogated for meaning, for correlations and for trends. But in the rush to take advantage of this tsunami of zeroes and ones, it's important to remember that not all data is created equal.
Last week undercover economist Tim Harford took aim at big data in the Financial Times. One of his main critiques was that such data is a byproduct: "'found data', the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast'". As a result, it suffers many drawbacks: it's unlikely to be representative of a broader population of interest; the underlying collection mechanism may change unpredictably; it may not quite reflect the measures of interest to an analyst. It's a fair complaint, but it's not unique to big data.
Any quantitative investigation raises a choice: to collect a new dataset, or to use existing data from some other source. There are clear trade-offs between these two options.
Collecting new, 'primary', data is expensive, and since good data collection is hard, there is a high risk of unforeseen issues affecting data quality or quantity. Even with extensive testing, survey questions can be misinterpreted and scientific instruments miscalibrated: 'the field' is a messy place.
Existing 'secondary' data, while cheap and easily accessible, brings with it a different set of risks: it may not measure exactly what you'd like; methodological changes, hidden in footnotes, may induce breaks in series; and reuse of data, even by different researchers, raises serious - and poorly understood - risks of 'data snooping' (as the same data is used to both suggest and then test hypotheses).
The reality, of course, is not quite as simple as that. One researcher's primary data is another's secondary data, and in some situations 'found data' may be considered primary, although it shares the characteristics of secondary data. A better distinction might be the degree to which data has been collected to answer specific questions, or address specific needs: the extent to which the data is collected to answer the question, as opposed to the question being fit to the available data.
Economics, my field and Harford's, has traditionally been especially dependent on secondary data (although this is beginning to change). This has had a huge influence on the questions we address, and all too often economic research feels like the drunk man searching for his keys under a streetlight.
For example, macroeconomic models contain an abstract, pure notion of 'income', and then in practice substitute a particular, flawed measure called GDP, which is assembled by teams of statisticians, on the basis of figures prepared by accountants, according to a mix of international conventions and idiosyncratic choices. Those accountants are working to the dictates of managers, not economic researchers, and while the statisticians can make certain adjustments and cross-checks, there's no getting past the fact that GDP is ultimately 'found data'. The only thing that really differentiates it from the 'digital exhaust' of Harford's complaint is that we've been collecting it for long enough to understand, with some certainty, exactly how wrong it is. (And while economists recognise and pay lip service to all this, they mostly continue to act as if income = GDP.)
There is, then, something rare and valuable about purposively-collected, well-tested social data. So it comes as a relief that, also last week, the National Statistician Jil Matheson recommended that England and Wales retain a full decennial census, albeit with a pragmatic move to primarily online collection. The main alternative option under consideration by the ONS relied greatly on administrative data (e.g. NHS records), collected primarily for other purposes. It's an exciting idea in theory, but would have reduced the nation's flagship social dataset to 'found data', with all the risks that entails.
There is no question that the census is expensive - £480 million in 2011 - and that this expense means compromising other aspects, such as frequency of collection. Administrative data certainly offers unexploited potential for researchers hoping to better understand our society. But as any economist will tell you, there ain't no such thing as a free lunch, and in this as in other cases, you get the data you pay for.