Open Jobs Observatory: Identifying the locations of online job adverts

This article is one in a series that walks through our key algorithms in the Open Jobs Observatory. Previous articles in this series have described how we created algorithms to assign adverts to occupation groups and detect skills mentioned within the text of the adverts.

Learn more about the Observatory and see the latest insights

All the aggregate data series in the Observatory can be downloaded from Github. We aim to update the data on a monthly basis. Unfortunately we are unable to share the job adverts that we have collected.

Introduction

This article describes the method that we use to collect standardised locations from job adverts, which is a key step to providing insights in our Open Jobs Observatory (OJO).

The Open Jobs Observatory is the UK’s first-ever open repository of insights about the skills requested by employers in job adverts. We began collecting job adverts in January 2021, and have already amassed several million job adverts. We created the Observatory to provide free and timely access to information on skill demands. Collecting locations from job adverts is particularly important because it allows us to examine how skill demands vary across the UK, and to then identify the ‘skill-specialities’ of any given region. Having a localised view of skill demands may also enable job seekers, educators and local authorities to tailor activities to suit their local skills landscape.

Through the Observatory we are also aiming to fill a methodological gap, which is the lack of open resources for analysing job adverts. We have published the code that we use to extract insights from job adverts and have written this series of articles that walk through our methodology. We hope that this will enable other users of job adverts to benefit from, and build upon, our efforts.

The challenge of extracting locations

There are a number of difficulties in identifying the location of a job from a job advert:

  • The locations mentioned within job adverts sit at differing levels of granularity. Some of the locations given may relate to a region (such as Greater Manchester), whereas others may relate to a much smaller area (such as Lilleshall, Shropshire).
  • The UK has a number of areas that share the same name. For example, there are multiple Newcastles, Newports and Abbeys. This creates an additional challenge of determining which ‘Newcastle’ is being referred to within a job advert.
  • The typical matching algorithms that we may use to assign a place name to an area can be very slow and computationally expensive. As the Observatory grows to contain millions of job adverts, it is critical that we have developed an efficient method for assigning locations.

A broader and less tractable challenge is that the location mentioned in the job advert may not correspond to the location of the individual appointed to the role. The COVID-19 pandemic has led to a sharp rise in remote working. Moreover, the location specified in the job advert may refer to the head office of the company or to the location of a recruitment company. This challenge will continue to be monitored as the Observatory grows.

Prototype method

The key component of the methodology is building an 'index' of locations which can be used to look up the locations mentioned in job adverts. We have adopted the ONS’ Index of Place Names, as it provides a large number of location names (almost 90,000), and for each it gives a latitude and longitude. With this information, we can use Nesta's open source python package nuts_finder to extract further details for each place in the index. NUTS refers to the Nomenclature of territorial units for statistics and is a geographical nomenclature that subdivides the European Union and the UK. Given a latitude and longitude, the nuts_finder package extracts the level 1, 2 & 3 NUTS regions for the location.

The diagram above depicts our approach to extracting a standardized location from a given job advert.

The diagram above depicts our approach to extracting a standardized location from a given job advert.

Building the algorithm

Our 'index' can be supplemented with additional sources of location data, as long as these have a latitude and longitude, from which we can extract a location hierarchy. A location hierarchy allows us to group locations by their size at each level of the hierarchy, and link locations between differing levels. For example, a very granular location may fit into a location hierarchy like so:

Hillsborough → Sheffield → South Yorkshire → Yorkshire and the Humber → England

Before using the index, the locations mentioned within adverts are lightly cleaned. This involves removing punctuation, and making all letters lowercase. We then attempt to match the location to our index, saving the results of the match.

The diagram shows how a job advert location is matched to multiple levels of geography, including NUTS regions, Local Authorities and Health Regions.

The diagram shows how a job advert location is matched to multiple levels of geography, including NUTS regions, Local Authorities and Health Regions.

An alternative approach to locating job adverts would be to build a set of location names using job advert locations, and then spend time manually matching each of these to a standard location hierarchy. We decided not to pursue this ‘bottom-up’ approach as there are already many freely available datasets of locations that can be easily matched to standard location hierarchies. This allowed us to rapidly incorporate location extraction into the Observatory. The alternative 'bottom up' approach may also have caused the location extraction algorithm to only cater for the job board on which the prototype was built. This could have resulted in a large amount of manual location matching every time a new job board was added to the Observatory.

Strengths and limitations

There are two key strengths in our preferred ‘top-down’ approach:

  • The locations in open datasets match very well to the locations defined in job adverts. As a result, our approach requires minimal preprocessing of the raw locations extracted from the job adverts, reducing computational cost.
  • As we grow the Observatory, the number of job adverts collected will rise. Our current approach can scale to this level without requiring additional time for maintenance.

Alongside these benefits there are also limitations to our method that we will need to carefully monitor as we collect more job adverts:

  • The location matching relies on finding an exact match in our index. Any misspellings in the name of an area would cause the location to be missed and the advert would be excluded from any location-based analysis. For example, an advert for a role in 'Parkfield' will not be matched if the advertised location is misspelled as 'Parkfields'.
  • Our prototype method, outlined above, has been built and tested on adverts from a single job board. As more boards are added, we will need to closely monitor the performance of the algorithm on job adverts from these new sites. Specifically, we will check for significant changes in the number of locations that are matched to the Index and the distribution of job adverts across the UK. Any large divergences may suggest that our index of place names needs to be expanded to work effectively with adverts from the new job boards.

Preliminary insights

Our first application of the location extraction method matched 91.3% of adverts to a location. However, it also highlighted immediate areas for improvement. Of the remaining 8.7% of job adverts, almost 40% referred to broad regions, such as South East England, and Yorkshire and Humberside. When we aggregate to finer location levels (such as NUTS3 regions), these adverts will be excluded from the aggregation. We also discovered a significant number of non-UK based job adverts (0.7% of adverts have a location of 'Czech Republic'). These adverts are also excluded from any aggregations.

Bearing in mind that the dataset is not particularly large, that we are still in the process of improving the quality of the algorithm, and that we are not yet employing seasonal adjustment, the chart below shows the growth in the number of adverts between the first and second quarter of this year, broken down by the regions matched to the job adverts. We can see that there is a wide range of growth rates between regions, with our current algorithm indicating that all regions, except Wales, experienced increases in the volume of job adverts between the first and second quarters. However, due to the reasons mentioned above, we will need to further investigate these results to ensure that they aren’t caused by inaccuracies in our algorithm. We are currently entering a period of data quality analysis for our location extraction algorithm, so that we can be confident in the results that it provides.

The graph above shows the growth in the number of new adverts collected between the first and second quarters of this year. Wales is the only area to have experienced a reduction in volume over this period.

The graph above shows the growth in the number of new adverts collected between the first and second quarters of this year. Wales is the only area to have experienced a reduction in volume over this period.

Areas for future development

As we collect more job adverts, there will be further opportunities to improve the performance of the location matching methodology within the Observatory:

  • We can continue to add more locations to our index, either through using new sources of location names, or through manually matching the most frequently unmatched locations to the standardised hierarchy. As we collect more job adverts, we will develop a greater understanding of the unmatched locations that occur most frequently, and this will allow us to prioritise certain locations. Through this process, we would hope to improve the precision and quantity of location matches.
  • We could also use auxiliary data from the description within a job advert to add additional context to the location. Job advert descriptions frequently contain additional location information, beyond what is captured in the location field. For example, the location field within a job advert may be defined as 'Glasgow', but there may be further location detail within the description of the advert, such as 'Partick'.

As the Observatory grows, we will also have opportunities to extract additional insights:

  • Longitudinally tracking the changes of skill demands by region will enable us to understand the growth and stagnation of particular skills across different geographies.
  • Exploring the emergence of new skills across the country is another topic that we could shed light on using the Observatory. We are particularly interested in green skills and identifying those areas that are experiencing especially fast and or slow growth in these skills.

The Observatory is a pilot project and we welcome your feedback and suggestions for future improvements. We are also seeking funding to keep the Observatory running. If you have suggestions or are interested in supporting the work of the Observatory, please reach out to us by emailing [email protected].

Author

Jack Vines

Jack Vines

Jack Vines

Data Engineer, Data Analytics Practice

Jack is a Data Engineer in the Data Analytics Practice.

View profile
Cath Sleeman

Cath Sleeman

Cath Sleeman

Head of Data Discovery, Data Analytics Practice

Dr Cath Sleeman is the Head of Data Discovery.

View profile