Using machine learning to predict heat pump installation costs

If you want to know how much it would cost to get a heat pump installed in your home, you might struggle to find a definitive answer. To help address this problem, we’re working on a tool that can estimate the cost of a heat pump installation based on a small amount of information about a property.

Powering this tool is a model that we’ve built by applying machine learning to over 150,000 records of heat pump installation costs from the certification body MCS. By joining these records to the Energy Performance Certificate (EPC) database, which holds information on the physical characteristics of over 17 million properties in Great Britain, our model is able to capture variations in overall installation cost for different types of property. As far as we know, no other openly available online tools operate in this way.

A data-driven approach

The total cost of a heat pump installation can be broken down into equipment and non-equipment costs. The most costly piece of equipment is usually the heat pump itself, which needs to be bigger (and therefore more expensive) if a property has a higher heat demand. Other equipment such as a buffer tank and hot water cylinder may also be required. If a property’s existing heat distribution system is being upgraded then the equipment may include the cost of new radiators and pipework. Non-equipment costs include design, labour and commissioning costs, which depend on the complexity of the work and on the local labour market.

Rather than trying to estimate all of these costs individually, our model estimates the total installation cost based on factors associated with these components. For instance, the number of habitable rooms in a property holds some information about the property’s heating requirements (and so the size of heat pump needed), as well as the number of new radiators it may require.

Machine learning is a process by which a computer uses a large amount of data to identify links between particular variables (such as, in this case, property characteristics and installation costs), enabling it to make predictions. Using machine learning to build our model means that we can capture more subtle and complex interactions between property characteristics and costs than we could capture from research alone. It also means that the model can capture changes in cost over time without manual intervention and becomes more reliable as we collect more data.

"Rather than trying to estimate each cost element individually, our model estimates the total installation cost based on factors such as the number of rooms in a property, or the number of radiators required."

It’s important to bear in mind that the accuracy of a data-driven model is highly dependent on the quality of the data used to build it. Like most datasets, the MCS and EPC databases are imperfect, with data entered by engineers or surveyors often while they’re on site. However, the methods we’ve created to clean and process these datasets during our previous work have helped us to identify and correct data errors, giving us confidence in our results.

The data used to fit a model should also be representative of the types of installation that the model is used to predict. To ensure that the data we use is as relevant as possible, we’ve filtered out installations that we believe relate to new builds or mass retrofits, which are likely to have different costs. All of our cost data relates to installations by MCS certified installers, who may charge a premium to reflect that certification. Given these make up a high proportion of all installations, and householders are required to choose MCS certified installers when applying for the Boiler Upgrade Scheme, we don’t think this is likely to be a major issue.

We’re hoping to find other sources for installation cost data that could help establish any potential bias, and to compare our model’s predictions with those of other tools so that we can investigate cases where they differ significantly. We’ll also need to keep a close eye on other factors that may impact the underlying data over time, such as the evolution of schemes and grants.

The model in context

Our model of installation costs doesn’t exist in isolation - it’s designed to be part of a tool that’s accessible to all. This comes with a number of considerations that impact our model’s design.

One consideration is the balance between the model’s accuracy and how easy it is to enter the information it needs. No matter how well they predict costs, the property characteristics used to generate predictions need to be ones that a householder knows or can easily measure, even if they don’t have an EPC. For instance, even if we found that (say) a property’s annual energy consumption was highly predictive of the installation cost, we might want to exclude this feature or try and use a proxy instead if it’s too difficult for a householder to obtain their consumption data.

In order to be meaningful for users, our predictions also need to come with an explanation and a breakdown of the cost. This isn’t something that we can obtain directly from the data, which only gives a single value for the total cost, from which we can obtain a range of values that an installation cost is likely to fall within. We’re working on how best to present this information so that it’s clear and consistent.

Our progress so far

We currently have a prototype model which we’re working on building into a user-friendly interface. For now, the model only predicts the cost of air source heat pump installations, and only in houses rather than flats. Over time we’ll refine the accuracy of this model by augmenting the underlying data, reviewing the machine learning methods we use to fit the model and making improvements based on real world feedback.

Author

Christopher Williamson

Christopher Williamson

Christopher Williamson

Junior Data Scientist, Data Analytics Practice

Chris was a junior data scientist in the Data Analytics Practice, embedded in the sustainable future mission team.

View profile