How we’re doing data in the Political Futures Tracker

From our last blog on the Political Futures Tracker, you'll know at Nesta, we feel that the extent to which politicians are thinking about the future matters. The Political Futures Tracker will enable us to analyse the levels of future thinking and sentiment in texts, as well as identifying key content and themes that politicians are writing and speaking about.

It flags up a fundamental problem with UK politics, namely, political rhetoric is too focused on the short term, making political footballs out of important and often contentious topics. The political landscape in the UK encourages this short term outlook, and as our recent Future Shock event showed, this sometimes means that politicians miss the next big thing.

But if we want to track what politicians are thinking, we first need to know what they're saying...

...so we initially focused our data collection efforts on two sources:

  • MPs' own Twitter accounts
  • Political party websites, in particular their blogs and press releases

These sources complement one another, with Twitter providing a continuous stream of short texts and party websites providing longer, more detailed articles at a slower rate.

Social media data tends to focus on specific events, and therefore enables commentary, but not extensive discourse signalling politician’s thoughts on themes that are not the subject of current debate. Tweets are of limited length, and do not follow to conventional word order or spelling patterns, making it more difficult to analyse the kernels of information that lie within. Sarcasm and abbreviations that we take for granted, and easily interpret are tricky too- the software doesn't always have a good sense of humour- so functions to take into account the complexities of the English language will be developed.

The longer form texts elaborate on key themes, for example in blog posts, or documents from party website. They complement Twitter and website data sources, so along with the ability to feed in very long documents (such as manifestos and transcribed speeches) on an ad hoc basis, we will get a better impression of the whole policy landscape running up to the 2015 election.

Twitter

Twitter provides a rich set of Application Programming Interfaces or APIs, which give us access to Twitter's data, they can be used by anyone (subject to certain rate limits) and allow our developers to see under-the-hood of how Twitter works. Think of the API as a thin, see-through layer that sits on top of Twitter and can pull through any of its raw material, like Tweets uploaded in real time, and use it in new and exciting ways.

Of particular interest for us is the Streaming API which delivers tweets in real time as they are posted. For the Political Futures Tracker, we want to collect not only the original Tweets made by an MP, but also retweets of these and Tweets addressed directly to an MP (either questions or in reply to one of the MP's Tweets). All these are available via a single stream - you tell Twitter which users you are interested in and they deliver you all Tweets, retweets and replies for those users.

In a fortunate co-incidence of timing, Iain Collins from BBC News made this Tweet in mid-October:

Tweet

The file he published on GitHub, an open resource website for the software developer community, was a list of (among other things) the name, constituency and Twitter username of every UK MP. The list contained some errors but we were able to fix these by hand and build an updated version which formed the basis of our Twitter data collection effort.

Since we started collecting tweets in late October, we have gathered 341,000 separate tweets.

The graph below shows the number collected in each one hour window over the first ten days.

Tweet collection graph

Party websites

For party websites, we have started with a basic crawl following links from the parties' homepages, concentrating on parties that are represented at Westminster, Holyrood, and the Welsh and Northern Irish assemblies, collecting a total of 45,000 web pages and 1,500 other potentially useful documents, such as PDFs (just over a gigabyte of data in total). We use the Heritrix web crawler as employed by the Internet Archive, which will be able to re-crawl these websites on a regular basis and report which pages have changed since the last scan.

The initial crawl took about 33 hours, which sounds like a long time but was mostly made up of time spent waiting - Heritrix is careful to be polite and not to send too many requests to the same web server within too short a period of time. Simply grabbing everything as fast as you can is likely to result in the server blocking your crawler completely, which would be somewhat counter-productive when we want to be able to check for updates at regular intervals...

Next steps

We have developed a prototype of the Political Futures Tracker and will be analysing Twitter and website data in the coming weeks. Alongside this, we are honing topic and sentiment detection, and building the groundbreaking future thinking function.

Stay tuned for more information and to read regular blogs in the run up to the 2015 General Election.

Author

George Windsor

George Windsor

George Windsor

Senior Policy Researcher

George was a Senior Policy Researcher in the Creative and Digital Economy team.

View profile

Ian Roberts

Ian is a Research Associate in the Natural Language Processing Group at the University of Sheffield.

Dr Diana Maynard

Dr Diana Maynard is a Research Fellow at the University of Sheffield, and holds a PhD in Automatic Term Recognition.

Dr Mark A Greenwood

Dr Greenwood is a Research Associate in the Natural Language Processing Group at the University of Sheffield.

Dr Kalina Bontcheva

Kalina Bontcheva is a senior research scientist at the University of Sheffield and the holder of an EPSRC career acceleration fellowship, working on text mining and summarisation of so…