John Maynard Keynes once said, in relation to investors' aversion to 'step out of the box', that “worldly wisdom dictates that is better for the reputation to fail conventionally than to succeed unconventionally.”  There are some similarities between their situation and the risks faced by a researcher considering new sources of data such as those falling under the vague umbrella of big data. In this case, data un-conventionalism carries risks such as:
- That no-one will believe your findings if they go against the grain
- That everyone will yawn if your findings go with the grain
- That people will uncritically make the wrong decisions on the back of a shaky data foundation.
These are important challenges for us. We are involved in several projects that use new sources of data with the goal of generating new knowledge relevant for policymakers, entrepreneurs and managers. They include:
- Mapping the UK games industry with data from web sources like wikis, review aggregators and product directories
- Using social media data to study the effect of innovation events on the connectivity of participants (here’s a blog), or
- Measuring music consumption with data from file-sharing sites and digital distribution platforms (another blog)
…among others. The fact that none of these projects would have been practical until recently underscores the fresh opportunities opened up by new data sources. However, using these data sources also presents risks – such as biases and measurement errors like those highlighted in Tim Harford’s recent Financial Times article (see also Kate Crawford’s writing on this, and this blog by Nesta colleague Andrew Whitby).
How do we communicate the findings of this research? How can we simultaneously convey the novelty and value of our findings, caveat our findings, and avoid slipping into trivialities?
These old problems can become intensified by quality concerns over new data sources. Here I map them - together with potential strategies to address them - using that trusty complexity-reduction tool, the two-by-two matrix.
Enter the matrix
The dimensions of this matrix capture two features of a situation where we are communicating research findings based on new and unconventional sources of data. They are thus:
- The sophistication of the audience: its understanding and appreciation of the analytical methods used to extract insights from data, and/or their knowledge of the domain (e.g. the industry or location) that we are studying.
- Whether the findings of the research are consistent with the priors of the audience or not: That is, the extent to which our insights are expected or, to the contrary, unexpected or surprising. Less charitably, this also would capture if the audience is biased or has a vested interest in some findings over others.
Let’s go through each of these quadrants in turn, starting with...
- The ‘Really?’ quadrant: Here, we are presenting surprising or counter-intuitive findings to a sophisticated audience. The question that will echo across the auditorium is: “Really? Are you sure that this is a robust finding, and not an artifact of this newfangled data source you are using?” One serious risk is that outliers, exceptions or mistakes in the data call into doubt the research. To some extent, this is what happened to the Tech City Map when some journalists found companies there which were neither tech not start-ups . To avoid this situation, we need to do due diligence on our data, and work hard to ensure the findings are robust. Digging deep into the literature for theories and examples that explain any unexpected finding will also be important. So far, so academic.
- The ‘What’s New?’ quadrant: In this case, research findings are consistent with audience expectations (e.g. what the literature says), and the audience is sophisticated. The challenge is to convey our contribution: “What’s new, friend?” Going beyond the fact that replicating previous research is an essential aspect of scientific work, we can also turn our use of new and (perhaps) unproven data sources into a contribution. Put simply, confirming old results with new data tells us something about the old results – and about the new data. An example of how new data sources can play this role is MIT Media Lab’s Pantheon project, which uses Wikipedia articles to explore Elizabeth Eisenstein's theory about the link between changes in media and changes in culture. An important benefit of doing this work is that it might encourage others to dip their toes into new data sources.
- The ‘So what?’ quadrant: What if our findings confirm what everybody knows (or at least they think they do)? Why should anyone care? In this quadrant, the audience is less interested in methods, replication or generalization of findings. Some options to avoid their indifference include quantifying the magnitude of impacts, estimating striking statistics, illustrating findings with interesting examples, or presenting them in visually innovative ways, like many ‘infoporn’ visualisations do.
- The ‘Kool Aid’ quadrant: In the last quadrant, we have unexpected findings and an audience that is less sophisticated analytically, and/or in its domain knowledge. The risk here is that they might ‘drink the kool aid’, uncritically embracing findings based on potentially problematic data: imagine if the US health services had quickly acted on the information generated by Google Flu Trends when this application overestimated the number of flu cases in 2012-2013 (apparently due to changes in the way people use Google’s search engine to find information about the flu). Doing everything we can to ensure the robustness of our findings, triangulating them with conventional data, and being transparent and honest about the ‘experimental’ aspects of our research is definitely the way to go here (and this is in fact what Google did with Google Flu Trends from day one).
The matrix above doesn’t consider what’s the purpose of the research we are communicating – the goals and challenges for maps and early warning systems aimed at ‘making the invisible visible’ will be different from those for impact evaluations, where establishing causality is more important. Perhaps we could add another dimension to the matrix and turn it into a cube? Next time.
For now, suffice to highlight one strategy that can help regardless of the quadrant we're at: ensuring that we understand the data we are using and its limitations, being transparent about our methods, and being able to set our findings in the context of wider literatures and experiences (including the domain knowledge of people in the field). In other words, the kind of stuff taught in Research Methods 101. Hardly unconventional stuff, but simply what’s needed if we are going to create long-lasting value from these new, exciting and (for now) unconventional data sources.
(We’ll be sharing our experience + findings + learnings with these data sources in future posts. If you have any questions, use the comment box below, or contact me at [email protected] or @JMateosGarcia).
(Image= Gzthermal by Scott Schiller).
 Say, found data from the web, which is not big in volume but varied in structure, and has velocity in that it’s recent.
 Even when we use traditional data collection methods, like surveys, the fact that we are looking at relatively new and poorly understood phenomena, like for example the adoption of data practices in our datavores stream of work, and the small sample sizes involved in this work means that the risks identified above still obtain to some extent.
 Some caveats: of course this is a simplification of reality. Most research will fall somewhere in a continuum between our two dimensions.
 To be sure, research can have multiple audiences and be communicated at multiple levels.
 For simplicity, let’s assume that all the findings of the research are expected or unexpected at the same time. If they aren’t, just consider each finding independently from each other.