About Nesta

Nesta is a research and innovation foundation. We apply our deep expertise in applied methods to design, test and scale solutions to some of the biggest challenges of our time, working across the innovation lifecycle.

Six reasons why human-led qualitative research is safe from AI… for now

Being presented with a large dataset (such as lots of interviews) as a qualitative researcher is a double-edged sword. On one hand lots of data can give you more confidence in the themes and patterns that you identify. On the other hand, going through all those interviews takes a LOT of time. Artificial intelligence (AI) promises to help qualitative researchers carry out analysis more quickly - cutting out time spent on processing, coding and analysing large data sets like interviews, newspaper articles, focus group transcripts and ethnographic field notes. 

We are two experienced senior researchers that have tested a variety of human-powered qualitative software throughout our years at Nesta. So, we decided to test NotebookLM to see whether AI-powered analysis could offer a faster, more accessible alternative while maintaining analytical rigour. 

Ultimately, we learnt that NotebookLM isn’t quite ready to replace human-led qualitative analysis for the purposes of our research. Here are six reasons why.

How we used NotebookLM

NotebookLM is Google's AI source-grounded language model that allows the upload of various types of data (video, audio, transcripts, notes etc) that can be interacted with via a chatbot function. Being source-grounded means it can only pull from the data it is given. It can recognise patterns in human language, calculate probabilities and use models to predict which words and patterns are more likely to come next. It can also create more dynamic outputs that are tailored to the users’ requests, whilst also suggesting further prompts and generating quite novel means of describing your data - like audio overviews, interactive quizzes, or AI mind maps.

AI-powered analysis can be quicker than manual coding, help with sense-checking, and sometimes identify patterns of relationships that you might have missed. Established qualitative data analysis tools like NVivo are powerful but aren’t hugely intuitive, so they present a steep learning curve that makes it difficult to onboard new team members or share work across projects. For Nesta, finding tools that allow us to maintain standards of analysis while remaining accessible to colleagues with varying levels of qualitative experience matters for scaling our impact across our missions.

The use of tools like NotebookLM also poses new challenges. Some already well-known risks include hallucinations, fabricated quotes, and false positives. As an independent research organisation and charity, Nesta has a responsibility to test these tools critically.

We drew on two recent projects, identifying the next three million heat pump owners and paying for heat pumps, to pool our reflections. The first project involved 200 structured, asynchronous interviews undertaken by an AI-powered chatbot on Whatsapp (the interviewer was an AI agent, conducting the interview based on a topic guide written by our researchers), with participants across eight household segments. 120 interview transcripts were manually coded with Nvivo, with the remaining 80 analysed with NotebookLM. The second project consisted of 24 semi-structured hour long interviews with six consumer segments, coded and analysed first in Nvivo then separately analysed in NotebookLM for comparison. 

Both involved interview transcripts with segmented participant groups, giving us the chance to test NotebookLM's ability to identify themes, distinguish between groups, and support rigorous analysis. Here's what we learned.

Core insights

1. NotebookLM works best when you already know your data

The single most important factor in using NotebookLM effectively was familiarity with the source material. Data familiarity is a crucial step in any qualitative analysis, and normally this happens if you collect the data yourself, or when you read through your sources to begin your analysis. NotebookLM offers the temptation to skip this step, but with potentially detrimental consequences. Having conducted the interviews ourselves, read the transcripts closely, and already manually analysed some of them, meant we could interrogate the tool's outputs and catch errors.

NotebookLM was reasonably good at producing general summaries across a whole sample. It was able to identify broad sentiment, overarching themes, the kind of synthesis a human would spend hours producing manually. But it struggled with nuance, particularly when we asked it to distinguish rationale between different participant segments. It would pull out a single person's comment as a "theme" or treat different words for the same sentiment as distinct patterns.

Tip: If you haven't conducted the interviews yourself, do some manual coding of a subset first. Use that as a benchmark to reverse-QA what NotebookLM produces. Don't use it on data you're not already familiar with.

2. The tool has a strong tendency to produce false positives

NotebookLM displayed a feature common amongst large language models; a determination to please. This means that NotebookLM will tend to generate answers that users want to hear, instead of prioritising factual/objective answers. When prompted to find differences between participant segments, the tool would generate distinct themes for every group. Upon reviewing themes created we noticed two extremes of this sycophancy; either to over attribute the sentiment of a specific individual to a whole group, or to assign sentiment that was voiced across the whole sample to one specific group. Additionally, answers could be influenced by previously asked questions, feeding previously identified patterns into subsequent answers which were not identified as relevant.

We tried different mitigation strategies. One approach was explicit prompting: "If patterns don't exist, don't make them up. Stick to the transcripts." Another was asking the same question in multiple ways without signalling an obvious ‘right answer,’ to see if responses remained consistent.

Both helped, but neither fully solved the problem. The tool still stretched to find distinctions, treating different vocabulary as evidence of different attitudes rather than recognising that people simply describe the same experience in different words.

Tip: Be sceptical of segment-level findings, particularly with small samples. Cross-check any patterns the tool identifies against your own reading of the transcripts.

3. Source linking is useful in theory but unreliable in practice

One of NotebookLM's useful features is being able to link claims back to specific sources, in our case individual transcripts labelled by participant segment. In practice, this feature was inconsistent. The tool would attribute a quote to one participant when it actually came from another. Sometimes, when we searched the original transcript for the quoted text, we couldn't find the alleged quote at all.

This raised concerns about whether NotebookLM was paraphrasing or even fabricating quotes to fit a narrative. For qualitative research, where verbatim quotes are essential evidence, this is a significant limitation.

Tip: Always verify quotes against the original transcripts. Don't assume the source links are accurate - copy a phrase from the tool's output and search for it directly in your data.

4. The output volume creates its own analysis burden

A less obvious challenge was the sheer volume of information NotebookLM produces. We saved all our prompts and outputs, intending to use them for reporting. But by the end, we had a 60-page Google Doc of prompt responses - a muddle of partially overlapping outputs that were difficult to navigate.

With established human-led software, analysis is condensed: you can click into a code, see the summary, and pull quotes quickly. With NotebookLM, we found ourselves needing an additional stage of analysis just to synthesise the AI's outputs into something usable. That partly undercuts the time-saving benefit.

Tip: Don't treat raw prompt outputs as finished analysis. Build in time to review, synthesise, and condense what the tool produces. Consider whether the time saved on initial coding is offset by the time spent making sense of outputs.

5. QA processes for AI-assisted analysis need rethinking

Quality assurance (QA) with traditional methods is well established: a second researcher codes a subset, you compare codebooks, review reporting of the analysis and discuss discrepancies. The process is subjective and iterative, but it ensures that the primary researcher's interpretation of the data is not overly influenced by their own assumptions, experiences, biases or human error.  

With NotebookLM, we were less sure of what good QA looks like. NotebookLM had introduced an additional layer of interpretation below that of the primary researcher. But unlike human-led research, we were not able to discuss and review the logic behind NotebookLM’s determinations as these are made in a ‘black box’ rather than explicitly stated. Running the same prompts independently, whilst useful to check the replicability of the findings, don’t provide additional information into the logic behind NLM’s outputs.

Evaluative frameworks on AI are being developed across different organisations and updated in research papers at a fast pace; there's not 'one' standard, although there are widely agreed upon best practice principles, such as for example the NPC’s Listen and Learn principles for doing qualitative research

Tip: If using NotebookLM for any analysis that will be published or used for decision-making, consider building in manual verification of a subset of findings.

6. Cultural and linguistic nuance gets lost

British understatement, irony, and deadpan language caused problems. NotebookLM picked up on explicit statements but missed implicit sentiment, sarcasm, and hedged positivity. Phrases like "not bad" or "it was all right" could indicate positive sentiment to a human reader familiar with the context of the interview, but the tool sometimes categorised these phrases ambiguously or negatively.

For research on topics where participants may be reluctant to express strong views, like personal finance during the paying for heat pumps project, this matters as it can under or over portray sentiment towards points of discussion. 

Tip: Be particularly cautious with culturally specific language, topics where participants tend to understate, or interviews where rapport and tone carry meaning beyond the literal words.

So, is AI ready to replace human-led qualitative analysis?

In short, not yet. NotebookLM is not a replacement for human-led qualitative analysis. It's a support tool that works best when the researcher already has some familiarity with the principles and application of qualitative data analysis. The value of the tool lies in speed and exploratory sense-checking, not in producing rigorous findings from scratch.

We think NotebookLM could be of use to support qualitative analysis in these scenarios:

  • Quick and dirty overviews of large transcript sets.
  • Exploratory analysis to identify areas worth investigating manually.
  • Sense-checking themes you've already identified through traditional methods.
  • Summarising across a whole sample when segment-level distinctions aren't critical.

NotebookLM struggles with:

  • segment-level or correlational analysis with small samples
  • producing reliable, verifiable quotes
  • maintaining appropriate caution when patterns are weak or absent, especially with small samples 
  • handling linguistic and cultural nuance

To help researchers navigate this tool, here’s a suggested checklist before using NotebookLM for analysis:

  • Is the data set relatively small and manageable?
  • Are you already familiar with the data due to collecting it yourself (or willing to read a substantial subset)?
  • Is the output for internal learning rather than academic publication?
  • Do you have someone available to QA the work?

If you answer "no" to any of these, NotebookLM may not be the right tool.

What's next?

We're continuing to test NotebookLM across projects, learning as we go. One area we'd like to explore further is whether the tool can be used to quality assure human-produced codebooks, uploading manual draft analysis and instructing NotebookLM to critique or identify gaps in the draft, rather than producing analysis from scratch. We are experimenting with AI tools and use cases across our project lifecycle, you can read about our experiments with AI-powered interviews in this blog.

Author

Max Woollard

Max Woollard

Max Woollard

Analyst, sustainable future mission

Max joins Nesta as an analyst in the sustainable future mission.

View profile
David Bleines

David Bleines

David Bleines

Senior Researcher, Central Programmes

David is a senior researcher who works across Nesta's sustainable future mission and fairer start mission.

View profile