Open Jobs Observatory: Extracting skills from online job adverts

This article is one in a series that walks through our key algorithms in the Open Jobs Observatory. Previous articles in this series have described how we created algorithms to extract locations from job adverts and assign adverts to occupation groups.

Learn more about the Observatory and see the latest insights

All the aggregate data series in the Observatory can be downloaded from Github. We aim to update the data on a monthly basis. Unfortunately we are unable to share the job adverts that we have collected.

Introduction

The Open Jobs Observatory was created by Nesta, in partnership with the Department for Education. The aim of the Observatory is to provide insights from online job adverts about the demand for occupations and skills in the UK. We are collecting the adverts with the permission of job sites, and to date we have collected several million job postings. To extract and analyse the rich information contained in the job advert text, we have also developed a host of algorithms. Previous articles in this series have described how we extracted locations from job adverts and assigned adverts to occupation groups. This article explains the algorithm that was developed to extract skills from the descriptions of job adverts.

The importance of a skills detection algorithm

Detecting skills and competencies mentioned within a job advert is an essential component of the Open Jobs Observatory project. Developing an automatic approach for this task is critical, as we are already collecting tens of thousands of new job postings every week. Extracting the skills that are mentioned in job adverts allows us to build an understanding of the skills demanded in the UK and, in turn, support the work of policy makers, career advisors, training providers, employers and recruiters.

For local and national policy makers, the skills demanded within job adverts can inform regional skill mismatches, thus helping to address the misalignment between the supply and demand of skills and improve labour productivity and economic growth. For career advisors, information about the skills required by local employers can help to provide more tailored and timely reskilling and upskilling recommendations. Similarly, trainers and educators can use up-to-date insights on emergent and redundant skills to inform their course curricula and tailor them to the local labour market. Finally, by characterising and comparing the skills of different occupations and job titles, we can help employers and recruiters to identify candidates from other industries who might have the right skills for a given job.

Researchers and institutions commonly purchase skills intelligence from private providers that the providers have collected and analysed using their own proprietary algorithms. As a result, it can be difficult to discern how the skills were extracted and, due to restrictions on sharing, it is not possible to make comparisons between providers. Moreover, keeping the skills intelligence up-to-date inevitably incurs further costs, thus supporting a tendency to produce one-off pieces of insightful - but immediately out-of-date - analysis. There is a real need to create a reliable set of ongoing indicators for skill demands.

Another motivation for this work is to provide, what we believe to be, the first open algorithm for extracting skills that is being applied continuously and at scale. By publishing our code, and explaining our methodology in this article, we hope that other researchers who have their own datasets of job adverts will be able to use (and build on) our algorithm.

The challenges in extracting skills

The essential task is to detect all those words and phrases, within the description of a job posting, that relate to the skills, abilities and knowledge required by a candidate. The thousands of detected skills and competencies also need to be grouped in a coherent way, so as to make the skill insights tractable for users. These challenges are made harder by the fact that there is no definitive framework of UK skills, and no standard way of grouping skills into categories.

We see two general directions for approaching the challenge of skills detection: ‘top down’ and ‘bottom up’. For a ‘top down’ approach, one can take an established list of known skills, and detect the presence of these skills in job postings. While the UK does not have an official list of skills, there are other national and international frameworks that can be used for this purpose, such as O*Net which was developed by the U.S. Department of Labour, and the multilingual ESCO which was created by the European Commission. This approach is not as simple as looking up words or phrases from a list and finding them within the job advert text. The same skill can be expressed in a myriad of different ways and this variability must be taken into account. For example, the skill of “creative thinking” could also be expressed as “to think creatively”, or perhaps even “innovative thinking”. We will show further below how we tackled this problem.

The alternative ‘bottom up’ approach starts with the job advert text and either automatically or manually identifies words and phrases that appear to be related to skills and competencies. The extracted phrases should then be interpreted and related to some known skills or concepts (for example, to Wikipedia articles). Alternatively, new skills would need to be defined, based on extracted phrases, and the definitions would need to be reviewed by industry experts.

Both the ‘top down’ and ‘bottom up’ approaches have their own strengths and can complement each other. The ‘top down’ approach, which uses a skills framework assembled by experts, allows us to detect skills that are only mentioned rarely within online job adverts, such as skills related to agriculture. This approach can therefore shed light on the potential biases in online job adverts. Conversely, the limitation of these expert-derived frameworks is that they may not be comprehensive and can quickly fall out of date as new skills are constantly emerging. The ‘bottom up’ approach can help us to identify these missing skills from the frameworks.

Example of an ESCO skill entity with the preferred label  ‘identify skills gaps’.

Example of an ESCO skill entity with the preferred label ‘identify skills gaps’.

Our approach to extracting skills

We decided to pursue a ‘top down’ approach and used the European Commission’s ESCO as our reference framework of skills and competences. ESCO represents a major public effort to systematise occupational information across Europe, mapping relationships between 2,942 occupations and 13,485 unique occupational attributes. These attributes are all part of the ESCO skills pillar and they encompass knowledge, skills, attitudes and values. We used all of these attributes in our analysis, but for simplicity we refer to all of them as ‘skill entities’. Each skill entity consists of one preferred and several alternative labels, and a description that usually is one or several sentences long. A skill entity might also have links to other, broader or narrower skills, as well as a set of occupations where this skill is expected to be either an essential or optional requirement.

To detect a skill entity within an online job posting, we used the available textual information on the skill’s preferred and alternative labels, as well as its description. We used natural language processing techniques to generate from these texts a set of so-called ‘surface forms’ - simpler words and phrases that are intended to represent the underlying skills entity - and searched for them in the job postings. For example, for the skill to ‘handle helpdesk problems’, we generated simpler surface forms ‘helpdesk’ and ‘helpdesk problem’.

Conceptually, surface forms and their corresponding skill entities are related to the idea of surface and deep structures from the field of linguistics, and popularised by Noam Chomsky. The same conceptual framework has also been used by researchers developing commercial skills detection algorithms.

Summary of our approach to build the algorithm for detecting skills in job adverts.

Summary of our approach to build the algorithm for detecting skills in job adverts.

Generating surface forms

For each skill entity, we generated two main types of surface forms. The simplest type was generated by cleaning (or ‘preprocessing’) the preferred and alternate skills entity labels, which involved removing punctuation, converting all text to lowercase, and lemmatising. Lemmatisation is the simplification of inflected word forms by converting them to their canonical, dictionary forms. The second type of surface form was generated by extracting linguistic features, called noun phrases, from each skill label and description (using the spaCy library). This allowed us to break up skill labels and descriptions into coherent chunks. For example, we extracted the noun phrase ‘vaccination procedure’ from the skill label ‘assist with vaccination procedures’.

The extraction of noun phrases was particularly useful in the case of the longer skill description texts, which contain additional, rich information about the skill and the context of its use. For example, we could link the term ‘chemotherapy’ to the skill to ‘treat acute oncology patients’.

In some instances, the same surface form was generated from two or more skill entities. To ensure that each surface form was linked to only one skill entity, we identified the most appropriate skill for each duplicated surface form and removed the lower priority duplicates. We selected the skill entity for which the surface form had been derived from its labels rather than its descriptions. We also prioritized those entities where the form had been derived using the simpler preprocessing approach, rather than noun phrase extraction. This methodology yielded approximately 130,000 surface forms with on average 10 surface forms per skill entity. In this way, we have created a rich initial set of terms for each ESCO skill; however, these still need to be assessed before they can be used to detect skills in job adverts.

Evaluating surface form quality

Developing a method to evaluate the quality of the surface forms was a critical component of the methodology. Initial inspection had revealed instances of forms that were too vague or not representative enough of their skill entities. For example, we found the surface form ‘dignity’ to be generated from the skill description for ‘conducting physiotherapy assessment’. While the term might be somewhat related to this skill, it is too general considering the specific definition of the skill. Manually reviewing every surface form would have been prohibitively time consuming. Instead, we derived a set of experimental quantitative indicators to assess the match between each surface form and its skill entity.

As the first automated step, we assessed the specificity of each surface form derived from noun phrases to their skill entity. For this purpose we used a measure based on 'term frequency-inverse document frequency’ (tf-idf) statistics. If, based on this metric, the surface form was found to be more specific to another skill entity than to its own, we deemed it to be ambiguous and removed the form from the detection algorithm.

We then manually reviewed the suitability of a subset of the surface forms, and used machine learning to predict the quality of all other surface form and skill entity pairs. To do this, we initially detected the exact matches of surface forms in the preprocessed job postings. By prioritising the more frequently occurring surface forms, we manually reviewed approximately 1500 surface forms. These were either accepted, discarded or reassigned to a more appropriate skill entity. Among the surface forms that were discarded there were a number of commonly occurring boilerplate words and phrases. For example, the terms ‘application process’ or ‘job opportunity’ can relate to the competencies of the career guidance advisor occupation, but they are also commonly used in adverts for many other jobs (e.g. “..the first stage of the application process is to apply online”). There were also words or phrases related to the benefits offered by the employer rather than the job position. For example ‘life insurance’ could relate to a job position in the insurance industry, or to an employee perk.

A supervised machine learning model was then trained to use a range of indicators comparing the surface forms and skills entities, and predict our manual assessment of whether a surface form ought to be discarded or not. These indicators were based both on the previously mentioned tf-idf statistics as well as on the more recent natural language processing approach of contextual sentence embeddings. This yielded a prediction model that was trained with almost 90% accuracy, and tested on new data to be about 75% accurate. The model was then used to automatically reject about 3,600 surface forms that fell below an acceptance threshold, leaving us with still more than 100,000 surface forms.

Developing skills categories

After filtering out the lower quality surface forms, the skills detection algorithm was ready to be reapplied to the job posting data that we have so far collected. To aid the interpretation of the results, we aggregated the surface forms into coherent categories by using an unsupervised machine learning approach called clustering. Specifically, we built new numerical, vector representations for each surface form detected in our job posting data, by using a variant of the widely used word2vec approach (namely, the skill2vec approach). These vectors were then clustered into categories using a hierarchical, consensus community detection approach that we had developed in an earlier research study. In this way, surface forms that are frequently mentioned in the same job postings are also more likely to reside in the same cluster.

Prior to clustering, we also identified and set apart a set of skills that we deemed to be transversal, as previous work on skills clustering has shown that they might distort the results. This included both using the ESCO guidance on which skills are transversal, as well as employing a combination of measures from network science to identify core skills that are interconnected with various and different other skills. In addition, while reviewing results after clustering, we identified a further set of transversal skills that were too general for their respective skills clusters. Among the final set of more than 400 transversal competencies were spoken languages, fundamental digital skills such as using a computer or a handheld device, interpersonal and intrapersonal skills such as teamwork, communication and creative thinking, as well as general workplace functions like working in shifts and managing quality.

Ultimately, after the automated clustering procedure and some manual adjustments, we arrived at a skills entity hierarchy of three levels with 8, 15 and 41 categories. Whilst it would be possible to obtain even more granular groupings, we leave further skills taxonomy development for future work.

The full skills taxonomy is available on Github.

Data-driven skills categories that were inferred by using clustering analysis. The colour indicates Level 1 (highest) skills categories, whereas the size of the boxes indicate the number of unique surface forms associated with each skills category.

Data-driven skills categories that were inferred by using clustering analysis. The colour indicates Level 1 (highest) skills categories, whereas the size of the boxes indicate the number of unique surface forms associated with each skills category.

Note that we clustered surface forms rather than skill entities, to provide another handle for evaluating the agreement between surface forms and their entities. For most of the detected skill entities (85%) there was no disagreement about the cluster membership of their surface forms - all surface forms were grouped together by the clustering algorithm. For the ambiguous cases, the final cluster membership of a skill entity and its surface forms was decided based on a ‘majority vote’ of their surface forms, weighted by the frequency for their occurrences in the job posting data.

Sometimes the ambiguous skill entities turned out to be transversal. For example, while the surface forms ‘mathematics’, ‘maths’ and ‘numeracy’ are all provided as labels for the ESCO skill entity mathematics, the terms ‘maths’ and ‘numeracy’ were initially clustered in the Education cluster, whereas ‘mathematics’ was more strongly associated with Information & Communication Technologies. As another example, the terms ‘brainstorming’ and ‘think creatively’ were initially associated with the ‘Multimedia & Product Design’ cluster, whereas ‘innovative thinking’ was clustered in ‘Business & Project Management’. In cases like these we manually reassigned the skill to the ‘Transversal’ category. More generally, such clustering disagreements appear to provide an interesting lens on the use of language in job postings.

Visualisation of the numerical representations of surface forms that were the basis of developing the skills categories.

Visualisation of the numerical representations of surface forms that were the basis of developing the skills categories. Each circle corresponds to a surface form, with colours indicating the highest level skills categories, whereas circle size is proportional to the surface form detection frequency in our job advert data. Text labels highlight selected narrower skills categories and transversal skills. Note that the transversal category skills tend to be spread among other category skills groups.

Preliminary results

A preliminary analysis of job adverts scraped between January and mid-May 2021 indicates that the skills with the highest demand predictably reside in the Transversal category. Almost one third of all detected mentions of skills fall within this category, with the most frequent skills being communication (mentioned in 30% of all analysed job adverts), performing services in a flexible manner, planning, working in a team, and paying attention to detail.

The next most detected skills categories are Business Administration, Finance and Law and Sales and Communication, followed then by a tie between Information & Communication Technologies and Engineering, Construction & Maintenance skills categories.

Notably, the Food, Cleaning & Hospitality category had the smallest share of detected skills within the job adverts we analysed. It is conceivable that these skills might be underrepresented in the data due to the ongoing impact of COVID-19. Another factor to consider is that the skills commonly required, for example, in the hospitality sector are in fact transversal or related to customer services and therefore they might reside in other clusters. It will be interesting to monitor the demand for these skills as the UK economy continues to recover from the shock of COVID-19, and as seasonal trends come into play.

Share of detected mentions of skills in each high level skills category, between January and mid-May 2021.

Share of detected mentions of skills in each high level skills category, between January and mid-May 2021.

When breaking down the results by week, the overall demand for each skills category appears to be rather stable, with some minor fluctuations. Interestingly, the uptick in the detection of transversal skills, in around the eighth week in 2021, might be reflecting the temporarily high demand for census workers, with their job adverts mentioning such requirements as ‘performing services in a flexible manner’, ‘working independently’, and the requirement of Welsh language for roles based in Wales.

Share of detected mentions of skills for narrower skills categories, between January and mid-May 2021. Note the uptick in language skills in the last week of February, reflecting the increased mentions of the Welsh language in census-related job postings.

Share of detected mentions of skills for narrower skills categories, between January and mid-May 2021. Note the uptick in language skills in the last week of February, reflecting the increased mentions of the Welsh language in census-related job postings.

By using the ESCO framework, we can also shed some light on the skills that we haven’t detected in our job postings data. So far, we have detected about 60% of all the 13,485 skills mentioned in the ESCO framework at least once. The undetected skills are most commonly related to working with machinery and specialised equipment, handling and moving, as well as agriculture, forestry, fisheries and veterinary types of knowledge. There could be a number of reasons for this, one being the bias of online job postings towards certain types of roles. There might also be discrepancies in the use of language in job adverts, compared to the rather standardised and detailed way the ESCO framework describes skills. Finally, there might be shortcomings in our skills detection process.

Ultimately, the most interesting and actionable insights will come from joining up the results from skills detection with the extracted information about locations, occupations and salaries. This will allow us to, for example, understand the regional demand for skills, identify the most demanded skills for different occupations, and assess how different skills are compensated in the labour market. You can find more details about this on the Open Jobs Observatory website.

Areas for future development

We view the described ‘top down’ skills detection method as a baseline approach, and there are several areas of future development that are immediately apparent.

Firstly, we would like to automatically detect sections of text within job adverts that are unrelated to skills requirements, such as boilerplate text, descriptions of the employer and employee benefits. Some of these text sequences might be easy to spot and manually label. A supervised machine learning approach could then be applied to discard these irrelevant portions of the job posting.

A second approach to detecting spurious surface forms would be to compare the semantic similarity (or ‘closeness in meaning’) of the surface forms detected within each advert. If a particular surface form bore little relationship with the other forms extracted from the advert, then it could be considered for removal. In our case, the semantic similarity could be captured by assessing the alignment of the vectors generated by the skill2vec approach.

These two methods, together with our manual and automated evaluation processes, are primarily focussed on raising the precision of the algorithm, i.e. lowering the false positive rate. However, another important aspect is minimising the number of undetected skills and qualifications, i.e. the false negatives. For example, the ESCO framework is not yet capturing the full range of software and web development tools such as ‘jQuery’ or ‘Amazon Web Services’.

One approach to reducing the number of missed skills is to implement a complementary bottom-up approach, either by manually identifying the surface forms in a sample of job adverts, or by automatically extracting keyphrases. This adds a further complication of matching these new terms to existing skill entities, or even creating new entities. In the case when new skill entities are required, one approach would be to use Wikipedia articles as a base instead of ESCO and link each surface form to its closest article.

A final area for development is creating a robust method to evaluate the algorithm. This would require establishing a ‘ground truth’ by manually identifying the skills in a sample of job adverts, which is a very labour intensive process. While labelling specific skills might be challenging, the present algorithm and surface forms could make the task slightly easier via a semi-automated approach. Namely, the skills detection could be leveraged to train another, more flexible machine learning algorithm that can analyse previously unseen sequences of text and suggest skills labels to the individuals who are performing the manual labelling.

Further use cases

The primary aim of this work has been to contribute towards building open-source, real-time intelligence about local skills demand. However, the applications of a skills detection algorithm extend beyond this particular use case. Skills can be detected not only in job postings, but also in school curricula, course descriptions and apprenticeship standards. Hence, the same skills detection algorithm can be used to understand the supply of skills, and thus bring us closer to painting a full picture of skill mismatches.

Importantly, by using the same algorithm to assess skills demand and supply, we are taking a step closer to building a unified “language” or taxonomy of skills. This in turn could reduce the friction in communication between workers, employers, educators and learners.

The Observatory is a pilot project and we welcome your feedback and suggestions for future improvements. We are also seeking funding to keep the Observatory running. If you have suggestions or are interested in supporting the work of the Observatory, please reach out to us by emailing [email protected].

Author

Karlis Kanders

Karlis Kanders

Karlis Kanders

Senior Data Foresight Lead, Discovery Hub

Karlis is a Senior Data Foresight Lead working in Nesta’s Discovery team.

View profile
Cath Sleeman

Cath Sleeman

Cath Sleeman

Head of Data Discovery, Data Analytics Practice

Dr Cath Sleeman is the Head of Data Discovery.

View profile