This article is one in a series that walks through our key algorithms in the Open Jobs Observatory. Previous articles in this series have described how we created algorithms to extract locations from job adverts and detect skills mentioned within the text of the adverts.
All the aggregate data series in the Observatory can be downloaded from Github. We aim to update the data on a monthly basis. Unfortunately we are unable to share the job adverts that we have collected.
This article explains how we built an algorithm that detects the most appropriate occupation group for any given job advert. For example, the algorithm will determine that a job advert for an ‘Economist’ should be placed into the occupational group called ‘Actuaries, economists and statisticians’. The occupation groups come from the Office for National Statistics and these are known as SOC2020 or SOC codes.
Our SOC-assignment algorithm is a key component of the Open Jobs Observatory. The Observatory is an initiative to provide free and granular insights on the skills mentioned in UK job adverts. Breaking down skill demands by occupation groups enables us to supply career advisors with information on the latest skills required for a given occupation group. In the longer term, it will allow us to tell policy makers about the workers in occupations that are undergoing rapid skills-transformation, who may need additional support.
We began collecting job adverts in January 2021, with the permission of job boards. At the time of publication, the Observatory contained several million job adverts. In addition to skills and occupations, we are extracting information on a range of other variables from the adverts, including salaries and locations. You can learn more about the Observatory here.
The SOC2020 classification of occupations has four levels of granularity. The most granular level, called ‘4 digit SOC’, contains 412 different occupation groups and each of these has a unique 4 digit SOC code.
The challenge of assigning an advert to one of these codes is made slightly easier by the ONS’s SOC coding index. The SOC coding Index gives sets of job titles that are associated with each SOC code. For example, the job title ‘Financier’ is assigned in the index to SOC code 2422, which is called ‘Finance and investment managers’. The index contains almost 30,000 job titles and their corresponding SOC codes.
The most basic approach to assigning SOC codes would involve ‘looking up’ the job title of each advert in the SOC coding index and identifying its corresponding SOC code. However, the SOC coding index is not exhaustive and many of the job titles in adverts are not found in the index. For that reason, our algorithm calculates the similarities between a job title from an advert and those job titles in the SOC coding index, with the aim of finding the most appropriate SOC code.
Before calculating the similarities between our job titles and those in the SOC coding index, there is a preprocessing stage. In this stage, we clean our set of job advert titles. This involves removing mentions of locations and contract types from titles (such as ‘London’ and ‘full-time’), expanding acronyms and removing common job advert terms such as ‘experience needed’ and ‘immediate start’.
The algorithm must be capable of detecting different, but similar-meaning, words that are used within the job titles that we have collected and those in the SOC coding index. To capture these meanings, we use an approach from Machine Learning that is able to create numerical representations (‘embeddings’) of phrases that can be compared mathematically. This is done with a type of neural network model known as a transformer, namely DistilBert. The transformer is ‘pre-trained’ on large volumes of text to learn the relationships between different words in a particular language. Capturing these relationships as quantitative parameters gives the transformer the ability to subsequently generate numeric representations of any piece of text. We selected the DistilBERT transformer because it is a smaller, lighter and faster model than many other transformers, whilst preserving a high level of performance.
After representing a job title (from our dataset) using embeddings, the next step is to compare this representation to the embeddings of every job title in the SOC coding index. However, with almost 30,000 job titles in the index and millions of job adverts, doing a pairwise comparison would take too long. To speed up the process, we use an approach called FAISS, which was originally developed by Facebook. This approach stores the embeddings of job titles from the SOC coding index in a compressed format, that allows for a very quick similarity search process. We then calculate the similarity between the embedding of one job advert title and the compressed embeddings of the 100 most similar job titles in the SOC coding index (using the Manhattan distance). The final output of this stage is, for each of our job adverts, a list of the most similar job titles from the SOC coding index.
Having identified the most similar job titles from the SOC coding index, the next step is to select the most appropriate SOC code for a job advert’s title. To make this selection, we use a decision-tree approach that has three decision points.
The first two decision points are determined by distance measure thresholds that are set relative to the semantic space.
The first decision point asks if the closest job title from the index has a distance (divided by the median distance of the closest 100 job titles) of less than or equal to the first threshold. The threshold is set by looking at the distances of the closest job title from the index for a sample of job adverts. We then divided each closest distance by the median distance of the nearest 100 job titles for each job advert (the local space). This gives us a relative measure based on each local space. From here, we look at where we can confidently set the threshold. If the distance of the closest job title divided by the median of the local space is less than the threshold, then we simply select the SOC code that corresponds to that job title and the process of assigning a SOC code is complete.
If the distance threshold is not met by the closest job title in the SOC coding index, then this suggests that it’s SOC code may not be the best choice for our job title. In these instances, we group the 100 most similar job titles (from the SOC coding index) by their SOC codes and then calculate the median distance for each code. If the median distance of the closest SOC code from the title of the job advert (divided by the median of the local space) is less than or equal to the second threshold that we set, then we directly assign that SOC code to our job title and the process is complete. The threshold is set similarly to the first threshold but looking at the median distance of the closest SOC code.
The third decision point is when the distance measure thresholds are not met by the distances between the title of the job advert and the job titles in the SOC coding index. In this scenario, the algorithm changes tack and begins to look at the words that make up our job advert’s title. We arrange all the titles in the SOC coding index by the number of words that they contain. Starting with the longest titles, the algorithm examines whether all the words in a title are included in the title of our job advert. This is essentially a substring method. If all the words are found in the title of the job advert, the SOC code corresponding to that title is assigned to the advert. If there is no match after repeating this process for every progressively shorter job title in the SOC coding index, then the job advert is not assigned a SOC code.
The chart below shows the results from applying the algorithm to a sample of 50,000 new adverts that were collected by the Observatory. We found that 27% of the adverts in our sample had job titles that closely matched a specific job title within the SOC coding index. A further 8% were sufficiently similar in meaning to multiple job titles associated with a SOC code that the job advert could be confidently assigned to that code. For example, the job title ‘Recruitment Manager in Financial Services’ is not in the SOC coding index, but it has a similar meaning to several of the job titles in SOC code 3571 which is ‘Human resources and industrial relations officers‘.
Finally, 43% of adverts were assigned to a SOC code on the basis that the words within the title matched the words (but not necessarily the order) of a job title within the SOC coding index. Only 22% of the adverts could not be assigned to any code. We hope to reduce this percentage in future iterations of the algorithm.
In addition to opening up our code, we feel it is important to identify the strengths of our algorithm as well as areas for future development.
The design of the algorithm has two important strengths. The first is the use of a pre-trained model to create the sentence embeddings. This speeds up the process since we do not need to train a model from scratch for our task. Another strength of the algorithm is the multi-step approach which reduces the chance of not assigning a SOC code, compared to a single-step method.
We have identified three areas in which we may be able to improve the algorithm. The first is to undertake additional cleaning of the job titles that we have collected. Although the pre-processing stage already removes many non-informative terms from our job titles, such as location names, there is still room to detect and remove additional terms. For example, ‘Amazon’ in 'Amazon Warehouse Associate'.
A second area for improvement lies in the method used to set the thresholds for similarity. There are two key thresholds in the algorithm and these determine whether a job title is sufficiently similar to those in SOC coding index. We tried different thresholds and, by manual inspection, we found values that balanced the risk of assigning the wrong code with the risk of missing a suitable code. With additional time, we would like to explore alternative approaches to setting these thresholds such as dynamic thresholding.
The final area for improvement is revisiting the method used to create sentence embeddings. One approach would be to create embeddings within embeddings. This would entail creating additional embeddings for longer job titles, consisting of unigrams or bigrams that were present in the title.
This article has described the first version of an algorithm that assigns titles from job adverts to occupation groups. The code for the algorithm is available here. As far as we are aware, this is the first open code for assigning SOC codes.
The Observatory is a pilot project and we welcome your feedback and suggestions for future improvements. We are also seeking funding to keep the Observatory running. If you have suggestions or are interested in supporting the work of the Observatory, please reach out to us by emailing [email protected].