Mapping new industries with a machine learning mindset
Mapping new industries with a machine learning mindset
Finding innovation needles in big data haystacks
New technologies are an elusive object of desire for research and innovation (R&I) policymakers. There are three reasons for this interest:
- Economies of Entry: Those companies and locations that first gain a comparative strength in a new technology can be hard to dislodge by competitors, and therefore end up capturing much of the market.[i]
- Embeddedness: Some of these new technologies have ‘general purpose’ aspects - they can be applied in other sectors to make them more innovative, productive and competitive.
- Emergence Failures: Early in the lifecycle of a technology, there will be uncertainty about its business model and skills needs, which could hamper or even abort its development. Policy can help avoid this situation by, for example, funding testbeds and demonstrators, providing access to finance, tackling skills shortages etc.
But how do we find evidence about the state of a new sector, its situation and challenges? Unfortunately, this is not easy. The frameworks and categories (industrial codes) we use to measure the economy fail to capture new sectors, so official data is not very useful for measuring them. The resulting lack of evidence hinders policymaking at all stages of the policy cycle, making it hard to understand the current situation of a new sector, identify stakeholders to engage with to develop suitable policies, and measure the impact of those new interventions.
The immersive economy, including companies developing technology, content and services related to virtual and augmented reality are an excellent example of a high-potential area with multiple applications (from media and games to education and manufacturing) where the UK is perceived to have a strong comparative advantage, but where evidence about the situation of the sector is sorely lacking.
Today, Innovate UK is launching a report we worked on that seeks to start addressing this issue, hopefully providing a better evidence base for the raft of policies being put in place to strengthen immersive in the UK, not least the Industrial Strategy ‘Audiences of the Future' challenge. [ii]
You can download the report and read more about its findings here. In this blog we wanted to ‘get under the hood’ of the project, and show you some of the machine learning pipelines we developed to tackle the challenge of finding an innovative sector in a big data haystack, and generate relevant information for policymakers.
This is a (machine learning) pipe(line)
Machine learning (ML) is a discipline and ensemble of algorithms used to generate predictions from data. Supervised machine learning bases these predictions on examples (labelled datasets where we train in algorithm to find what characteristics of an observation predict its label) and unsupervised machine learning bases predictions on similarities between observations (that is, it puts them together in groups or 'clusters' we may be interested in).
At Nesta’s innovation mapping team, we use machine learning all the time to find innovative sectors in the big data haystack we work with (there is no way we could check every observation to find those we are interested in). As we will see, this requires creating pipelines that connect datasets in interesting ways, and a detective mindset where you think of the data you already have as a source of clues you can follow-up in other datasets.
We will go through some stages of these pipelines in turn.
Using web text to find immersive companies
Since we do not have an industrial code for immersive, it is not possible to measure the sector with official statistics. There are trade bodies and industry networks (notably, Immerse UK, who commissioned the research) but we cannot be sure about their coverage. We follow the machine learning mindset and think of those industry networks as labelled datasets of relevant observations: we want to find others like them.
This is where Glass comes in. It is a big data start-up we partnered with for the project who ‘read’ data from hundreds of thousands of UK business websites. Glass trained a machine learning model on this data to identify which terms in a business website are highly predictive of whether a company self-identifies as immersive or not (based on the labelled dataset) and then use the model to predict other organisations likely to be immersive. This supervised approach was enhanced with a ‘keyword search’ for companies that mentioned immersive related terms in their website, and validated by our friends at Immerse UK.[iii] Ultimately, this analysis gave us a list of around 2,000 individual organisations in the UK, together with metadata from Glass — such as their sector, their address (based on postcodes from the website) and other data from their websites — which became useful later when predicting organisation size, such as size of the website, number of inbound and outbound links, number of personal profiles and job ads on the site.
Using a business survey to generate a labelled dataset
Unfortunately, web data only takes you so far when you want to generate policy relevant information. Important financial information about a company (how many people it employs and its turnover, which we need in order to estimate the size of the immersive sector) is generally missing from websites. It is also hard to use web information to estimate a company’s level of involvement in immersive, something important for us given that, for example, many large manufacturers and brands are experimenting with virtual and augmented reality for prototyping and marketing, but this only involves a small fraction of their massive workforces and budgets.
It is also difficult to use web data to measure perceived drivers and success of barriers for a business, yet that information is relevant for policymakers who want to boost growth drivers and remove barriers. Sometimes, asking people about these things is the best strategy, and this is precisely what we did with MTM London, who set up an online survey targeted at the 2,000 organisations we had identified in the previous stage (another connection in the pipeline).
We surveyed all these organisations, receiving 278 unique responses with rich information about the situation of the immersive economy in the UK which is discussed in detail in the report. But not only that – the survey also worked as a ‘labelled dataset’ that we used to generate predictions about the level of immersive engagement, size and turnover of those organisations that did not respond to the survey (the diagram below presents the logic of this process). The predictors we used for this had to be shared across ‘labelled’ and ‘unlabelled’ datasets, and included website information from Glass, and other company metadata from Companies House (which we also merged with our data — more connections in the pipeline!)
One risk with machine learning is that the model you train learns every little quirk of your training set but does not generalise well outside (in other words, it ‘overfits’). We sought to avoid that pitfall using cross-validation, a standard approach that splits the data into ‘folds’, trains models in all folds except one and then predicts the results for the left-out fold. The chosen model is the one that, on average, generalises better from seen data to unseen data, which is what we want.
We used this strategy to predict whether organisations are ‘immersive specialists’ or not (that is, whether they generate more than 50% of their turnover in immersive), and their turnover and employment size-band. We used all this information to produce the estimates about the size of the sector and its economic scale that we present in the report.
Using Natural Language Processing and clustering to find public grants for immersive
In the project, we also wanted to measure the levels of public funding for immersive R&D in open grants databases such as the Gateway to Research (which contains information about Research Council and Innovate UK research grants), Innovate UK’s transparency dataset, and information from the European Union’s Horizon 2020 programme to support R&D in the EU (CORDIS). This would help us get a sense of the extent to which research funders are already in the case, supporting innovation in the immersive economy.
We combined data from the sources above into a single dataset with 74,000 projects. Now what we wanted to do was identify those related to immersive. Since reading their descriptions was out of the question, we got machine learning: enter Clio.
Clio is an information retrieval system that Kostas in the team is working on, and which can be used to query innovation databases. Our assumption here is that when a user queries an innovation dataset, they are interested in a wider set of results including items related to their initial query. This is a good assumption to make if a non-Immersive expert is interested in finding projects using terms that experts in the field use but she is not familiar with. For example, if they are looking for augmented reality projects, we assume they are also interested in virtual reality and mixed reality projects, as well as other terms which are semantically close (i.e. which have a similar meaning). This is a great example of an unsupervised ML problem: we want to find clusters of terms which are similar to the one we started with.
To do this, we train word2vec, a text mining model on the project descriptions. Word2vec measures similarities between words based on the context in which they appear. Continuing with our example, ‘augmented reality’ will appear close to ‘virtual reality’ and ‘mixed reality’ in the representation of the text that word2vec creates. We use this to identify the 20 most similar words to our initial query. We then reduce the list’s length by removing very common and very rare words.
We then use the resulting list to do a keyword search in the project descriptions and return those projects mentioned in at least one of the queries. In order to rank this list, we assume that the project containing the most queries is the most relevant one and the rest should be sorted based on their similarity to it. To measure this, we find the vector representation of documents using doc2vec, which measures the similarities between documents along the same lines as word2vec, and sort the results using the KD Tree algorithm. This is another example of unsupervised ML in action (see below for another diagram)
Conclusion: Combining data ready-mades and data custom-mades to inform R&I policy
In Bit by Bit, his excellent introduction to research methods for digitally-enhanced social science, Matthew Salganik distinguishes between data ‘ready-mades’ that researchers encounter ‘in the wild’ and repurpose to address their research questions, and data ‘custom-mades’ that researchers design with the specific goal of addressing their questions. In our Immersive Economy project, business websites would be a data ready-made and the survey of immersive companies we ran, a data custom-made.
Salganik points out that some of the most exciting opportunities for research today involve creative combinations of data ready-mades and custom-mades. A machine learning mindset for prediction, classification and matching is the glue for those connections. We believe that this approach could be applied to many other sectors beyond Immersive, helping us to generate reliable information about emerging industries to inform better R&I policies.
Drop us a line at [email protected] or [email protected] if you want to find out more, and come to the Nesta Sparks on New Frontiers for Innovation Data on the 30th of May to learn about our methods and projects in a bit more detail.
[i] There are many ways in which this can play out. For example, there may be ‘learning by doing’ where the leading company becomes more efficient as it learns how to develop the technology, or ‘network effects’ if it sets up a platform where the larger the installed user base is, the bigger its value for new joiners. Localised knowledge spillovers between companies located close to each other could make it hard for other places to compete.
[ii] The research was commissioned by Immerse UK and the Knowledge Transfer Network (KTN), and the research team also included Glass, a business analytics start-up, and MTM London, a creative industries consultancy.
[iii] These terms were extracted from a set of immersive tech meetups we identified in a previous project mapping the creative industries in the UK.