We present emerging findings from an analysis of deep learning research trends in arXiv, a popular pre-prints database.
Artificial Intelligence and Machine Learning (AI/ML) are being described as a ‘general purpose technology’ that could transform whole industries and even re-invent the process of invention itself. This excitement is being accompanied by concerns about the disruption that AI/ML is likely to bring, not least the risk of mass unemployment and ethical challenges around algorithmic discrimination and manipulation.
Good evidence about AI/ML research trends can inform policies to steer its development in a way that maximises its public value. The culture of openness in the release of research findings and software tools governing much (if not all) research in the field makes it easier to amass this evidence. We can also use AI/ML tools such as natural language processing to make this process scalable and to some degree automated. AI/ML is after all transforming the process of research as well.
In this blog, we illustrate the opportunities for this work with an analysis of data from arXiv, a popular pre-prints website where scientists share their findings before submitting them to journals and conferences.
After introducing our analysis, we describe the approach we used to collect the data and identify papers in Deep Learning, the state-of-the-art machine learning technique we focus on, and present key findings showing the rapid diffusion of Deep Learning across Computer Science, and its impact on the geography of research in this discipline. We conclude with issues for further research.
ArXiv is a popular pre-prints website commonly used by mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance researchers to share their work. A preliminary exploration of papers presented at the 2017 NIPS conference suggests that the site is extensively used by AI/ML researchers too - more than 50% of the papers presented in the conference are also available from arXiv. All publications from Google DeepMind, one of the leading AI R&D labs in the world are available from arXiv.
Since papers are, together with software, an important output and form of communication between AI/ML researchers, tracking them in arXiv can give us information which is relevant for our understanding of the evolution and diffusion of the ideas they contain.
How do we identify Deep Learning papers in the data?
Authors submitting their papers to arXiv classify them in different categories. In the case of Computer Science (cs), there are 40 categories ranging from cs.AI (Artificial Intelligence) to cs.SY (Systems and Control). One simple way to identify AI/ML papers would be to focus on categories where we might expect to find many relevant papers, such as for example cs.AI or stat.ML (Statistics: Machine Learning)). Instead of doing this, we have used a topic modelling algorithm which detects clusters of interrelated words in documents, and classifies documents in those topics (the Technical Annex contains more detail on data and methods). This analysis reveals a topic related to Deep Learning (DL), a machine learning technique which has substantially enhanced the performance of ML applications, contributing to a boom of activity, investment and interest in this field.
This way, we are able to measure the diffusion of Deep Learning in different arXiv categories. Also, since the topic modelling algorithm we use identifies groups of (cor)related words inside abstracts, the results are more robust and precise than cruder keyword-based methods (e.g. searching for ‘deep learning’ in abstracts). Having said this, it is important to recognise that the findings we report below exclude papers that use other important and useful AI/ML methods: we are capturing the state-of-the-art of AI/ML research rather than the whole field.
Deep Learning is rapidly gaining importance in the Computer Science Discipline
Out of the 116,730 papers in our data, 10,674 are identified as having some DL content. The figure below shows the number of arXiv papers in Computer Sciences categories plus the Stats.ML category which according to our model are not in DL (blue), and those that are (orange).
The figure shows rapid growth in total levels of activity in arXiv, consistent with the increasing importance and interest in IT-related topics, and even faster growth in the number of DL papers, particularly since 2014 or so. According to our analysis, almost 20% of the papers published in ArXiv in 2017 had a substantial DL component.
Deep Learning is being adopted in Computer Science sub-disciplines working with lots of unstructured data
The diffusion of DL is highly uneven across arXiv categories: the top 10 categories by adoption of DL comprise a third of all papers but 94.5% of all DL papers. We display them in the figure below.
The figure shows that much of the DL activity is in categories such as Computer Vision, Sound or Language where this technique has made it possible to work with large and unstructured datasets without the need for slow and costly manual ‘feature engineering’. The categories we identify echo Andrew Ng’s description of the sequence of DL adoption in the AI index: ”Deep Learning first transformed speech recognition, then computer vision. Today, NLP and Robotics are also undergoing similar revolutions”.
High Deep Learning adoption areas are becoming more important in Computer Science
We have also considered the rate at which different arXiv categories have adopted DL techniques. In the figure below, we present the share of all papers published in a category that we identify as DL, and the share of arXiv categories in the Computer Science and stat.ML corpus we are analysing. In the bottom panel, we colour the categories with high rates of adoption of DL (those where DL papers constituted more than 10% of all papers published in 2016 and 2017).
In line with our previous point, the chart shows fast DL uptake in categories related to unstructured data (such as Computer Vision, Sound and Language), with an interesting bump in 2012 (which is the year of publication for “ImageNet Classification with Deep Convolutional Neural Networks”, an influential paper by Krizhevsky, Sutskever and Hinton which demonstrated the potential of the Deep Learning for image recognition). More recently, we see increased adoption of DL in Robotics, Information Retrieval and Graphics, consistent with the expansion of this general-purpose technique into new domains.
Interestingly, the cs.AI category in ArXiv does not appear amongst our group of ‘high adoption’ fields. A cursory exploration of papers in that category suggests a stronger focus on traditional, probabilistic and symbolic approaches to AI, as well as conceptual papers and papers exploring the social implications of AI rather than the development of new DL methods. This underscores our previous caveat: the trends we report in this note tell us more about the development and adoption of state-of-the-art DL methods in different domains of Computer Science, than about AI/ML trends in general.
The bottom bar-chart in the figure, meanwhile, underscores the growing importance of categories with high levels of DL adoption within our corpus of documents, doubling from less than 20% of all publications in 2007 to more than 40%. This is consistent with the idea than, even after we control for the general increase on IT and data-related research, activity in those areas where cutting-edge AI methods are being intensively applied is growing even faster. Paraphrasing Marc Andreessen, one could perhaps say that ‘if software is eating the world, then AI is (starting to) eat software’.
A new international AI order?
In his interview with AI Index, Andrew Ng pointed out that “since AI changes the foundation of many technology systems – everything ranging from web search, to autonomous driving, to customer service chatbots – it also gives many countries an opportunity to 'leapfrog" the incumbents in some application areas”. We use our data to explore whether the arrival of Deep Learning has brought with it changes in the geography of AI/ML research (see the annex for additional information about how we went to arXiv papers to institutions and countries).
To get a handle on this, we consider changes in the share of publications represented by the top 20 countries in terms of overall arXiv activity, focusing on those publications in the top quartile of citations for each year to control for quality, and using 2012 as a cut-off point indicating the ‘arrival of DL’. The figure below shows the results.
The top bar-chart shows changes in shares in ‘low DL’ arXiv categories (those with less than 5% of DL papers in 2016-2017), while the bottom one considers high DL categories (i.e. those with more than 10% DL papers in 2016-2107). We think of the ‘low DL’ activity as a control which allows us to see if high DL adoption areas of research have been particularly disrupted with the arrival of DL.
Our analysis suggests that this has been the case: there is much more variance in changes in shares in the high DL categories than in the low DL ones, consistent with the idea of volatility and some international leapfrogging in those computer science categories where DL is having the biggest impact.
We add patterns to the bars in order to highlight those countries that have experienced changes of more than 30% in their share of activity (upwards or downwards) during the period we are considering. When we do this, we are able to identify some ‘winners’ in DL-intensive domains, such as China (which has more than doubled its share of activity), Canada, Australia, Singapore and Hong Kong. By contrast, countries such as France, Italy, Spain, Israel and India have lost importance, with strong drops in the global share of arXiv papers in high-DL categories that they represent.
Our findings show the growing importance of advanced AI/ML methods such as DL in Computer Science research, and particularly in those fields working with large volumes of unstructured data. Our analysis of changes in the geographical distribution of AI are consistent with the idea of a shift of activity, with China, Canada and Australia gaining influence, and European countries such as France or Spain losing it. The transformation in AI/ML brought about by DL appears to bring with it opportunities for new entrants, and challenges for incumbents, warranting the policy interest in supporting and strengthening AI research we are seeing through the world, from the UK (with its ‘Sector deal’ for AI) to China with its National AI strategy or France with the Villani report.
We should highlight that our analysis is exploratory and based on an experimental pre-prints dataset which while timely and relevant suffers the risk of biases (for example if researchers in certain countries are less prone to disseminate their work through arXiv). Although our perception is that arXiv is an important channel for the dissemination of AI/ML research and that much high impact, high quality work by key players in the field is distributed there, we should verify the robustness of our findings by extending the analysis to other data such as research grants, peer reviewed publications and open source software. These are all issues we plan to address in future research.
Going forward, we also plan to study the mechanisms behind changes in the positions of countries, regions and institutions: is this explained by policy interventions, path-dependence (the fact that some countries like Canada were patient in their investments in AI/ML research while others like the US were more volatile), access to complementary resources or co-location with complementary organisations such as tech companies and start-ups?
We are also interested in using researcher information in our data to measure the diversity of the AI/ML research workforce publishing in arXiv in different countries and disciplines, and how this has evolved over time. Lack of diversity in the AI/ML workforce is a significant concern which we would like to evidence in future work.
We believe the findings we have presented, and the opportunities for further analysis building on what we have done already illustrate the potential for turning AI/ML into a subject for analysis using its own methods.
ArXiv is a `real-time' open archive of academic preprints widely used by researchers in quantitative, physical and computational science fields. Data from each of over 1.3~million papers can be accessed programmatically via the arXiv API. As arXiv papers are self-registered, we ensure that papers are not simply `junk' articles by requiring that all papers are matched to a journal publication in the Microsoft Academic Graph (see below). We also have `anecdotal' evidence that the archive contains high quality papers, since a short study of conference proceeding from the prestigious Conference on Neural Information Processing Systems in 2017 reveals that over 55% of these were published on arXiv.
Using arXiv data as the root source of data has a number of advantages compared with accessing data directly from MAG. We can demonstrate this by considering that MAG requires keyword inputs for finding publications. Even if it was possible to select the exact set of keywords which are most predictive of `Deep Learning' papers, it is not clear how one would also generate data for non-`Deep Learning' papers within the same field. Furthermore, it would not be trivial to generate control groups for fields in which Deep Learning has not become embedded. The `Subject classification' field in the arXiv data naturally allows for control groups to be generated, once a topic modeling procedure (described below) has been implemented.
From the initial set of over 1.3~million papers, approximately 134,000 have been selected for analysis as they fall under the broad category of `Computer Science' or the specific category of `Statistics - Machine Learning’.
Microsoft Academic Graph (MAG) is an open API offering access to 140~million academic papers and documents compiled by Microsoft and available as part of its `Cognitive Services'. For the purpose of this paper, MAG helps to ensure that article retrieved from arXiv have been published in a journal, as well as providing citation counts, publication date and author affiliations. The matching of the arXiv dataset described above is performed in two steps.
We begin by matching the publication title from arXiv to the MAG database. The database can be queried by paper title, although fuzzy-matching or near-matches are not possible with this service. Furthermore, since paper titles in MAG have been preprocessed, one is required to apply a similar preprocessing prior to querying the \MAG database. There is no public formula for achieving this, so we explicitly describe the following steps to emulate the MAG preprocessing:
Identify any `foreign' characters (for example, Greek or accented letters) as non-symbolic;
Replace all symbolic characters with spaces; and
Ensure no more than one space separates characters.
This procedure leads to a match rate of 90%, for the set of arXiv articles used in this paper. We speculate that papers could be missing for several reasons: the titles on arXiv could significantly different from those on MAG; the latter procedure may be insufficient for some titles; the arXiv paper may not be published in a journal; and MAG may not otherwise contain the publication. It may be possible to recuperate some of these papers, however this is currently not a limiting factor in our analysis.
The Global Research Identifier Database (GRID) is used to enrich the dataset with geographical information, specifically a latitude and longitude coordinate for each affiliation. The GRID data is particularly useful since it provides institute names and aliases (for example, the institute name in foreign languages). Each institute name from MAG is matched to the comprehensive list from GRID as follows:
If there is an exact match amongst the institute names or aliases, then extract the coordinates of this match. Assign a `score' of 1 to this match (see step 3. for the definition of `score').
Otherwise, check whether a match has previously been found. If so, extract the coordinates and score of this previous match.
Otherwise, find the GRID institute name with the highest matching score, by convoluting the scores from various fuzzy-matching algorithms.
The Equation we use when convoluting the fuzzy matching scores ensures that effect of a single poor fuzzy-matching score is to vastly reduce the preference for a given match. Therefore, good matches are defined as having multiple good fuzzy-matching scores, as measured according to different algorithms. We opt to use a prepackaged set of fuzzy-matching algorithms implementing the Levenshtein Distance metric and specifically we use two algorithms applying a token-sort-ratio and a partial-ratio respectively. After this stage of data matching, approximately 140,000 unique institute-publication matches are found for analysis.
We analyze the abstracts in our corpus using Natural Language Processing in order to identify papers in the `Deep Learning' topic. This involves tokenizing the text of the abstracts and removing common stop-words, very rare words and punctuation. We lemmatize the tokens based on their part-of-speech tag, and we create bi-grams and tri-grams. Documents with less than twenty tokens are removed from the sample. After these steps, there are over 168,000 features (`words') in the dataset.
Since topic selection is inherently dependent on a set of tunable hyper-parameters, we perform our analysis four-fold with different selections of control and treatment samples. We do this by using two unrelated topic modeling algorithms, each with two sets of hyper-parameterizations.
CorEx: This topic modelling approach takes an information-theoretic approach to generate n combinations of features in the data which maximally describe correlations in the dataset. Using a one-hot bag-of-words representation, we optimally find n = 28 topics by tuning n with respect to the `total correlation' variable, as advised by the CorEx authors. The generated topics contain words which are sorted in terms of their contribution of each feature to total correlation. We assign topics to documents if they are above a minimum threshold.
Latent Dirichlet Allocation: We also use Latent Dirichlet Allocation (LDA), a topic modeling algorithm, to find the mixture of topics in the collection of abstracts and identify those that discuss Neural Networks or Deep Learning methods. We evaluate the performance of LDA by measuring its extrinsic topic coherence and the best performing model produced 100 topics. We manually examine the word distribution of the resulting set and identify two topics related to Neural Networks and Deep Learning. The ten most probable words in these topics are shown in the table below. Finally, we create a binary label that indicates if Neural Networks or Deep Learning were used in a paper. A positive label is assigned to publications with a probability higher than 10% to contain either topic. We select this low probability threshold in order to include a wide spectrum of articles which may discuss Deep Learning or Neural Networks. In further studies, we can look to vary this value in order to assess its impact on our results, although we consider the current threshold as conservative.