"They universally hate the process. It's the least favourite part of their job and they're grinding their teeth every time they have to do it. They don't like doing computer work. They want to be installing stuff."
This is what we heard when we spoke to heat pump installers about the admin involved in their work. Dozens of emails to local electricity network companies, known as distribution network operators (DNOs), different templates for every region, responses that are difficult to understand and delays that stretch to months. If it’s winter and a boiler fails, households often can’t afford to wait, they just get a replacement boiler instead of a heat pump.
This kind of problem turned out to be a good fit for an agentic AI solution. But how do you know when that’s the case?
An agent is an LLM that can use tools to achieve a goal, rather than just generate text, images, or videos. It can open up documents, send or receive emails, query databases, or take a range of other possible actions until the task is done.
At Nesta, we’ve been testing agentic AI in a number of applications across our missions and developed a framework of five questions to help you decide if building an agent makes sense. The first three are about whether your problem is right for an agent. The last two are about whether your organisation is ready.
Agents can use a wide range of tools and also produce a wide range of outcomes. To make sure an AI agent is working well, you need to check the final outcome, not just the steps it took to get there. Traditional models are usually easier to check because they produce a simple prediction or estimate. For agents, you must carefully plan how you will confirm that the agent's final output is correct.
Think about the problem you’re trying to solve - does it have a verifiable outcome? Can you check that “the correct action happened”?
Sometimes the outcome is straightforward. We built a ‘nutrition auditor agent’ whose job was to look at potentially incorrect nutrition information and determine if the data was suspicious. The outcome was a classification: 'suspicious' or 'benign’. There were various combinations of tools it could use to get there, not one correct path, but the outcome is still checkable: did it correctly flag something as suspicious or not?
But the outcome isn’t always so easy to pin down. With the DNO heat pump applications, the end goal, a successful application to the DNO, sometimes involved weeks of back and forth between the homeowner, the installer and the DNO. We couldn’t easily verify the entire process in one go. So we broke it into steps - at each stage there was a correct action the agent should take. We hand-labelled real email threads with the right action at each stage, and used that as our test set. If your end-to-end outcome is hard to verify, break the process into steps and verify those instead.
Beyond understanding the outcome, you need to know what’s involved to get there. Many tasks involve both routine, repeatable work (process) and subjective assessment (judgment/human expertise). An agent will attempt to do both, but generally isn't as good at the judgment part.
While building the ‘nutrition auditor agent’, we shadowed Nesta’s in-house nutritionist while she manually audited products. During that session, we identified both process and judgment. The agent could do most of what the nutritionist did, including checking the names of products, looking up reference numbers, doing calculations and comparisons to known databases. But it sometimes got too focused on one piece of data without being able to take a step back. It would return an irrelevant product lookup, not realise it was irrelevant, and continue comparing as if it was relevant. A human would easily catch that.
Before building, map out your task step-by-step. Which parts are mechanical and repeatable? Which require human expertise and judgment? An agent can handle the first sort pretty well. The second part is where you’ll need guardrails or human involvement.
In the above example, we were looking at emulating a nutritionist. It is not a good use of a nutritionist’s time to be auditing thousands of rows of product data. Realistically, it wouldn’t be done. This is a bottleneck. Not just a slow process, but one where the alternative is not doing it at all.
Keep an eye out for these kinds of bottlenecks. In the case of the DNO heat pump applications, Renbee had already identified delays of weeks and months in email follow-ups, and there was no good alternative. If a household’s boiler fails in winter, they need heat urgently. If the process isn't fast, they will just get a boiler replacement instead. So fixing communications is a real bottleneck to the installation of more heat pumps.
It’s also worth asking, is anyone else solving this? Some problems are neglected - not because they’re unimportant, but because the people affected don’t have the resources to fix them. For us, this is part of what makes the work worthwhile. Ask yourself where you are uniquely positioned to act on a problem that others aren’t addressing.
It’s tempting to assume that putting a human in the loop solves your safety problem. If the agent does something wrong, a person will catch it, right?
While we were collecting early feedback on a prototype of the DNO communications manager, we discovered this logic doesn’t always hold up. We presented the agent as eliminating the admin process, so installers expected an auto-pilot mode and weren’t necessarily expecting approval or review. They were at times happy to click through the human in the loop stages without going through the details of the output.
This was a necessary wake up call. A human in the loop isn’t a guardrail if the human isn’t really looking. But the deeper issue was not just that humans were not really looking. It was also that the agent was not well calibrated about when to involve them. Because we had prompted it to be highly responsive, it was overly cautious and asked for permission for every action. We realised we needed to spend more time with the end users to understand which decisions added value vs which could be delegated to the agent.
When designing these systems, you need to think about how to keep them on track even when things get auto-approved - because they will. That means being deliberate about when the agent should act autonomously, when it should ask for input, and when it should escalate because human judgement is genuinely needed. The goal is to remove obstacles so that human experts can focus on the high value work they do best, rather than continuously supervising the AI.
All of the above is moot if there isn’t buy-in from the people who have the final say. And leaders might be cautious - agents equipped with tools are powerful, and there are real considerations around data privacy, security, and hallucinations that have serious consequences if not deployed thoughtfully. It’s worth looping in senior leadership to discuss the risks and trade-offs.
But buy-in isn’t a ‘yes or no’ binary. It’s trust you build over time. While working with our external partner on the DNO communications manager, it wasn’t immediately clear how reimagining their workflows in a non-deterministic process would fit into their roadmap or software stack.
It required lots of collective time and effort, which wouldn’t have been prioritised without buy-in across both teams. At the end of the process, we had a working proof of concept that both sides felt ownership over. A few months later, the partner engineering team took over development and are actively exploring how else it can add value to installers.
To determine if a task is ready for agentic AI, first assess its technical suitability: is the output verifiable, is the process clearly defined, and is there a specific bottleneck to solve? By targeting routine, repeatable work rather than subjective judgment, you ensure the AI addresses real friction points rather than creating new ones.
Success then shifts to organisational readiness, which requires designing for actual human behaviour and securing stakeholder trust. When you solve these human and technical hurdles, you can successfully automate low-value administrative ‘grind’, freeing your experts to focus on the high-impact work they do best.
This framework won’t give you the answer, but it will force you to ask the right questions before you start building.