Data sources

A full list of Data sources, and their strengths and weaknesses, are listed below.

Our core dataset comprises 1.8 million papers from arXiv, bioRxiv and medRxiv, three preprints repositories respectively used by researchers in science, technology, engineering and maths (STEM) subjects (including computer science), biological sciences and medical sciences to disseminate their work prior to publication. There are several reasons why we have opted for this data instead of alternative data sources such as the CORD-19 dataset about COVID-19 related research, released by Semantic Scholar in partnership with leading data providers.

  • Researchers increasingly use preprints sites to share their results close to real-time, speeding up the dissemination of findings that may be relevant to tackle COVID-19. (Younes et al., 2020). At the same time, lower thresholds to publication might create quality problems such as those we mentioned in the introduction. This makes these datasets a good setting to explore the positives and negatives of the COVID-19 research rush that we have witnessed in recent months.
  • The CORD-19 dataset does not include any data from arXiv, yet as we demonstrated in previous work, this is an important outlet for the dissemination of artificial intelligence (AI) research by leading teams and private research labs (Klinger et al., 2018). We believe that incorporating it into the analysis will improve our coverage of AI research to tackle COVID-19.
  • Working with the whole corpus of AI research in arXiv, bioRxiv and medRxiv allows us to compare the topical, geographical and institutional composition of AI research to tackle COVID-19, with the broader AI research domain (as captured by those sources) helping us measure what parts of AI research and what countries are over or underrepresented in the fight against COVID-19.

Having collected the core data, we have fuzzy-matched it on paper titles with Microsoft Academic Graph (MAG), a bibliometric database hosted by Microsoft Academic that currently contains 236 million publications (Wang et al., 2020).[1] This allows us to extract the institutional affiliations of a publication’s authors, the fields of study for publications (which Microsoft extracts from their text using natural language processing) and their citations to other papers for a subset of them (we consider the representativeness of this sample in Findings – Knowledge creation and combination).

We use the Global Research Identifier Database (GRID), an open database with metadata about around 97,000 thousand research institutions to further enrich the dataset with geographical information, specifically a latitude and longitude coordinate for each affiliation that we can then geocode into countries and regions. The GRID data is particularly useful since it provides institute names and aliases (for example, the institute name in foreign languages).

Each institute name from MAG is matched to the comprehensive list from GRID (Klinger et al., 2018). In a number of instances, our approach fails to produce a valid match either because the original paper did not include information about an institute, because of inconsistencies in the formatting of institute names, or because the institutes that participated in a paper are not present in the GRID database. A higher matching rate in a field suggests that it involves traditional research institutions more likely to be captured by GRID, and lower matching rates points at the participation of new entrants and institutions.

We conclude our summary of the data by noting that the top 10 institutions publishing COVID-19 related papers in our dataset are Harvard University, Huazhong University of Science and Technology (based in Wuhan, ‘ground zero’ for the pandemic), University of Oxford, Wuhan University, Fudan University, the Chinese Academy of Sciences, Stanford University, the National Institutes of Health, Centers for Disease Control and Prevention and Imperial College London. This supports the idea that our data includes research outputs from highly reputable and relevant research institutions we would expect to be making important and timely contributions to tackling COVID-19.

Data sources, and their strengths and weaknesses

Data set: arXiv articles
Source: arXiv
Strengths: A single established repository of preprints for quantitative subjects, which is almost guaranteed to contain seminal works in fields such as AI.
Weaknesses: Data set is not intrinsically linked or enriched, so one must obtain geography, citation information and fields of study, or other enrichments, using other sources.

Data set: bioRxiv/medRxiv articles
Source: MAG
Strengths: Open repository of preprints which has seen exponential growth during the COVID-19 pandemic.
Weaknesses: Relatively small number of articles and the coverage of high-quality research is not known.

Data set: Geography
Source: GRID
Strengths: Institute matching for typical analyses is around 90 per cent.
Weaknesses: Anecdotal evidence that coverage loss is biased against institutions in developing nations.

Data set: Citations
Source: MAG
Strengths: Has a large user base, so their procedure is likely to be ‘battle hardened’ in terms of iterations.
Weaknesses: Citations are not normalised, to account for the field of study for example. Anecdotal evidence that coverage loss is biased against institutions in developing nations.

Data set: Fields of study
Source: MAG
Strengths: Extensive coverage of disciplines.
Weaknesses: Based on a hierarchical topic model. It is unclear what the caveats in their methodology are. Citations are not normalised, to account for the field of study for example. Anecdotal evidence that coverage loss is biased against institutions in developing nations.

Methodology used to identify COVID-19 and AI

We identify COVID-19 publications by following the same approach used by arXiv. We search for the terms ('SARS-CoV-2', 'COVID-19', 'coronavirus') in either the main body or title of an article. If any of them is found, we assume that the paper is related to COVID-19. This approach leads to identifying 5,450 COVID-19 related papers in our data.

We identify AI publications in bioRxiv, medRxiv and arXiv using a keyword-based approach. We process 1,789,542 documents and find short phrases in them (bigrams and trigrams). Then, we train a word2vec model, a method that finds a numerical representation of words based on their context. We query the trained word2vec with a list of AI terms such as ‘artificial intelligence’ and ‘machine learning’, and pick the most similar words to them. This query-expansion approach produces a list of terms (see annex) that we search for in the preprocessed documents. If any of them is found, we label the document as AI. This way, we identify 82,434 AI papers in the data.

Other analytical techniques used

Topic modelling

In broad terms, topic modelling algorithms exploit word co-occurrences in documents to infer distributions of words over topics (conceived as clusters of words that capture a theme in a corpus) and distributions of topics over documents (capturing the range of themes covered in each document).

Here, we analyse publication abstracts with Hierarchical Top-SBM. This algorithm, based on the stochastic block-model used in network analysis, transforms a corpus into a bipartite graph where words are connected based on their co-occurrence in documents, and documents are connected based on the words that co-occur in them (Gerlach et al., 2018). These two graphs are decomposed using community detection to extract communities of words (topics) and communities of documents (clusters). We use this approach to classify documents into topical clusters that are informative about the focus of a publication based on its abstract, and interpret the clusters based on those topics that are salient in them (we describe the findings in Findings – Topical composition).

Two additional advantages of Top-SBM over other topic-modelling algorithms such as Latent Dirichlet allocation is that it detects the number of topics (and conversely clusters) in a document automatically and makes more realistic assumptions about the distribution that generates the corpus.

Analysis of distance

In Findings – Quality, we explore the impact of publishing COVID-19 on the thematic trajectory of researchers. We frame this inquiry in terms of research diversity: assuming that researchers publish mainly in one broad field (for example, computer vision), publishing a paper in epidemiology and COVID-19 would increase their research diversity as it would be thematically different from the rest of their outputs. We operationalise this idea by focusing on author-level contributions.

We keep the subset of authors that had at least two AI papers and one related to COVID-19 in order to measure the research diversity of their AI publications and how it changed after publishing COVID-19 work. The larger the change, the further out of their previous thematic focus the COVID-19 publications are.

We quantify author-level thematic diversity as follows:

diversity = iN(cosine_distance(vi, centroid))N

Where vi is the vector of paper i, N is the total number of an author's papers and the centroid is the average vector of papers for that author.

  1. See Klinger et al., (2018) for a summary of the algorithm we use for this matching.


Juan Mateos-Garcia

Juan Mateos-Garcia

Juan Mateos-Garcia

Director of Data Analytics Practice

Juan Mateos-Garcia is Director of Data Analytics at Nesta.

View profile
Joel Klinger

Joel Klinger

Joel Klinger

Data Engineering Senior Lead, Data Analytics Practice

Joel is Nesta’s Data Engineering Senior Lead

View profile
Konstantinos Stathoulopoulos

Konstantinos Stathoulopoulos

Konstantinos Stathoulopoulos

Principal Researcher, Innovation Mapping

Konstantinos worked as a Principal Researcher on Nesta's Research Analysis and Policy team.

View profile