Sampling society? The foundations of a new frontier in understanding society

Social media data, such as that produced by Twitter and Facebook use, is opening up new ways to understand society, but how do people obtain this data, and what are the challenges in using it to learn about our lives?

API days

While common knowledge in the tech-community, how social media data is accessed is arguably less well known elsewhere in business, public policy and the social sciences. Direct public access to the data of social media platforms (as opposed to obtaining it indirectly by scraping/extracting it off webpages, an act that’s prohibited by many platforms) is typically via their public Application Programming Interfaces (APIs). APIs provide a set of instructions that allow people to write programmes/applications (apps) related to the platform. In doing so, APIs allow the collection of platform data which can be used in apps, but also for research.

Data is typically collected from the API by running queries that ask for platform information on activities that match certain criteria like keywords, a particular geographical area or time period. Responses are returned in a structured format, such as JSON or XML, which is then processed to extract the data for analysis. A challenge may be narrowing a query so that it covers a topic of interest, as opposed to returning a lot of data that is less relevant for the user. Another issue is that the information returned may be drawn from a sample of the platform’s available data, for example, Twitter’s streaming API provides access to a proportion of all tweets (For a discussion on the sampling of Twitter’s APIs, see here).  Furthermore, aside from the form of data requested, when done on a large-scale there are practical issues about how best to query the API and store the returned data. To cater for growing demand for data in this area, commercial providers specialising in the extraction, storage and analysis of social media data are increasingly springing up.

The extent of the data access provided by APIs varies greatly across platforms, in part because platforms have different privacy policies, which also change over time. A full discussion of this would require several posts in itself, but, broadly speaking, platforms with a greater emphasis on public communication (e.g. Twitter and Flickr) give as a consequence greater public API access to data on users’ core platform activities. With platforms focusing on personal or commercial relationships, such as Facebook and LinkedIn, people are more likely to restrict access to much of their data with resulting implications for what is available via the API.  Some high-profile research has been done using Facebook and (to a lesser extent) LinkedIn data, but this has often been in direct partnership with the platforms themselves, who of course study the data intensively.  A recent example of external analysis done on Facebook data is Wolfram research’s data donor programme, where people shared their data.

The Thames going east from central London, as seen from cyberspace (spot the Millennium dome and nearby Thames cable car)

Image based on the locations of over a million geo-tagged photos on Flickr by 38,255 photographers or, at least, photo accounts.

A select crowd

If social media holds a mirror to society, it is one where many people are, by choice, missing in the reflection or are appearing as distorted versions of themselves. This has potential implications for the conclusions that can be drawn from its data. While the skew to younger age groups will decline with time, the intrinsic self-selection of social media users and the information they share is unlikely to go away.

The amount of social media data collected in response to API queries will vary depending on the extent to which users have undertaken an activity meeting the query criteria, but also according to whether users or the platform have restricted public access to certain kinds of information. Alternatively, people may have decided not to share certain kinds of information via social media, or can’t provide the information (perhaps they don’t have a phone that can geo-tag) or just decided not to participate in social media at all. All of these different choices can potentially affect the robustness of conclusions drawn from social media data. This data is also evolving over time as people join/leave platforms, add\remove their information or change their privacy settings (to say nothing of changes by the platforms themselves). Data returned for a query at one point in time may therefore not necessarily be the same as that collected at another.

In terms of adjusting for the effects of self-selection, a limitation of social media data is that most people are understandably unwilling to allow public access to extensive systematic information about themselves. This means that it’s often difficult in practice to use personal characteristics to adjust for/understand behaviour, or see how representative a sample of users is of the population in general. A related issue is that the unit of the data may be inconsistent: is the data being collected on a person or an institution, for example? The Flickr data in the picture above includes photos from a number of companies, charities and government departments, as well as individuals.

In some cases, the use of social media is sufficiently widespread that there can be quite complete coverage of a group of interest. In our on-going research project looking at the impacts of events and festivals on social networks, around 80% of the LeWeb2012 London conference attendees we are studying were on Twitter, allowing us to analyse the connections made by participants at the event. Social-network data raises its own challenges, as understanding a network’s properties depends on having information on its members in a way other kinds of analysis does not. If a few key people are missing from the data, then network properties can look quite different, for example two apparently separate groups of people may in fact be connected. Sampling a network therefore has different implications to other kinds of data. This relates to social networks, rather than social media per se, but the rapid growth of such datasets via social media means that it is of increasing importance.

The wide frontier

Its challenges aside, social media data has huge potential as a source of information to help us understand our world. It offers us with large sample spatial and temporal data on society at a comparatively low cost. Moreover, it should not be thought that its challenges are entirely unique to it, of course self-selection issues also occur in conventional survey data. Techniques have been developed to deal with these biases in this context such as collecting background information on those surveyed, repeated surveying of individuals over time (so-called panel data), providing financial incentives to participate, targeted surveying of specific groups, or state compulsion e.g. the census. However, these can be very expensive when done on a large scale.

In principle, social media provides, at low cost, a way of distributing surveys to large numbers of people and data that offers access to unique insights into human behaviour. It therefore seems very likely that the limitations of both social media data and surveys will increasingly be addressed in future by a hybrid of the two. Matching surveys to social media data has, for example, been used in recent work by Moira Burke and Robert Kraut to look at how people’s social networks affect the psychological impacts of becoming unemployed and the likelihood of finding another job. More generally, there is already convergence as surveys increasingly move online, onto mobile devices and through social networks.

Combining social media and surveys is just one approach in what is an evolving area. This is still a new medium and how best to analyse it, the questions it can answer, and its implications will be explored by (among others) spys, sociologists, economists, and historians for decades to come and, on a more modest scope and timescale, by us in our research and a series of posts in the coming year.


John Davies

John Davies

John Davies

Principal Data Scientist, Data Analytics Practice

John was a data scientist focusing on the digital and creative economy. He was interested in the interface of economics, digital technology and data.

View profile