Bringing arXiv data to life

In the past two decades, it has become the norm for entirely new industries to arise in just a few years. Traditional means of monitoring industrial and academic activity are relatively slow, and this leads to laggy policy decisions - which means that the full benefits of these industries will not be distributed evenly. In response to this, we are developing 'arXlive', an open-source web application underpinned by Nesta's data analysis and production system, in order to monitor innovation trends from publications data in real-time.

Just a few years ago, countries around the world began clamouring to stake their claims amidst an artificial intelligence (AI)gold rush, with national strategies for investment and development flying in thick and fast. In economist-speak AI is now commonly described as a “general purpose technology” (GPT), which along with other GPTs such as transistors or the combustion engine, is able to dramatically innovate countless industries in the global economy. This kind of system-wide shock could very well lead to a level playing field, where relative outsiders could leapfrog strong economies to become market leaders in various industries. In practice, this is only likely to happen if your economy is already equipped with the talent and infrastructure to compete.

For all economies to truly be on an equal footing, they would need some insider information. If a region or country were able to identify emerging industries or technologies in real-time, they could proactively equip themselves with talent and infrastructure to prepare themselves accordingly.

We began trying to understand this ecosystem some time ago, with our work analysing data from arXiv (pronounced ‘archive’), a popular pre-prints website where scientists share their findings before submitting them to journals and conferences. This work later became one of the top 10% of most downloaded papers on SSRN within the last 12 months, and we have presented the paper to economics of innovation audiences at research institutes like SPRU and ZEW. The final stage in the evolution of this work is the arXlive project.

arXlive will be an open source platform for live monitoring of innovation activity in arXiv publications. Underpinning arXlive is a data analysis and production system, which orchestrates a stable pipeline of data collection, enrichment and machine learning. Initially, arXlive will have two main web apps; the first of which will effectively be a live version of our paper. The second web app will apply the Rhodonite algorithm, which we have developed in Nesta to identify emerging industries or technologies. By applying this live to the latest arXiv data, business leaders and policymakers around the world will have access to the insider information required to prepare themselves for the next big tech disruptor.

We’ve set ourselves a soft deadline for September to go live with the first two initial apps. From there, we are considering several possible extensions such as:

  • A service for powerful "search engine" exploration of arXiv data (i.e. including intelligent ranking and synonyms).
  • Automatic identification of key funding bodies or informal collaborations from paper acknowledgements.
  • Paragraph-level topic tagging.

Please do get in touch if you’re interested in the project, or would like to get involved in the future!

Author

Russell Winch

Russell Winch

Russell Winch

Junior Data Engineer, Innovation Mapping

Russ was a Junior Data Engineer in the Innovation Mapping Team and worked on the development of data products and the implementation of a data production system.

View profile
Joel Klinger

Joel Klinger

Joel Klinger

Data Engineering Senior Lead, Data Analytics

Joel is Nesta’s Data Engineering Senior Lead

View profile