Can AI models be trained to assess the accuracy of charts?

The Covid-19 pandemic gave new importance to data visualisations. While many of us are now familiar with concepts such as “flatten the curve” as a result, the factors that can make charts and graphs persuasive, trustworthy, or readable are themselves still relatively poorly understood. A team of computer scientists at King’s College London, University of Vienna, and Google felt that the sudden increase in the salience of data visualisations demanded further exploration.

“One day we were all suddenly watching government press conferences with slide decks full of charts, and there seemed to be an implicit assumption that this was the new way to do public messaging,” says KCL’s Elena Simperl, who led the research team. “That was a defining moment. We realised that charts were going to be everywhere.”

Simperl and her team have previously looked at how the quality of online information impacts how people consume it, and how that can in turn influence disinformation and polarisation. But she points out that charts as a medium have yet to be studied in a similar way—both in terms of how they are perceived, and how people then go on to use those charts (or the information within them).

“We wanted to understand how we could use AI to influence how people consume charts,” explains KCL postdoctoral computer science researcher Neal Reeves. “Firstly, by understanding how chart design influences people’s perceptions, and secondly, by seeing whether either AI technology or other humans ‘verifying’ the charts would make a difference. Are they more or less likely to trust the message if you change the messenger?”

What we did

The KCL team derived a three-stage experiment that recruited 12,179 participants from crowdsourcing platforms Mechanical Turk, Prolific, and a third platform which cannot be named for confidentiality reasons. They planned to show the participants a series of different charts, with controlled visual tweaks between each one, and collect data about how those tweaks affected the charts’ readability and trustworthiness. That data could then be used to train a machine learning algorithm to itself understand the factors that make a chart readable or trustworthy.

The team created bespoke charts – 640 about the weather, 400 about air pollution – based on real-world data and designed to control for a single visual factor per chart. For example, the type of chart, the presence of text explanations in the chart, or the use of colour. Weather and air pollution were chosen because they were considered “neutral” subjects, unlikely to inspire passionate opinions among respondents. They also had to necessarily limit the variety of designs for the sake of practicality. “Creating a corpus that could represent all of the different design dimensions of real-world charts would take millions, if not billions, of charts,” says Simperl. “At the same time, these are experiments that have never been done before at this scale with online crowdsourcing.”

Stage one: Generate a baseline
The first step, says Reeves, “was to show us how people interacted with the charts without any kind of messaging.” Participants were shown at least 12 of the weather charts each, with variations in factors like type, labelling, and colour, and were asked two simple questions on trust (“Would you trust this chart to organise an outdoor event?”) and readability (“Could you explain this chart to a friend?”). They could answer either yes, no, or skip the question altogether.

There was also a secondary objective of giving the team a better understanding of the three platforms. For example, there are different economic incentives at play: while participants from Mechanical Turk and Prolific were paid £0.45 per chart (averaging £9/hour), the third platform is entirely voluntary. There is also some evidence that participant pools between the three can vary significantly, and more published detail from the platforms themselves would be tremendously useful for researchers. It was also “a huge challenge” to design tasks that worked across the three different platforms, even for the simplest of task design factors, and in spite of many commonalities between the platforms. They also had to filter out participants who were detected answering randomly for the sake of quickly earning payment, but this was only a minor issue with no bearing on the final results.

Stage two: Iteration
The team ran the same general experiment design again in the second stage, but with some tweaks. Participants were again shown 12 charts that each varied in design, but this time from the air pollution set. They were prompted with two different claims—“I could explain this chart to a friend” and “I would trust this chart to select a location to hold an outdoor event”—and asked to choose from four options ranging from “strongly agree” to “strongly disagree”.

Stage three: Crowd and AI influence on judgements
In this stage, “we showed people the same air pollution charts,” explains Reeves. “They’d see a chart, and a message like, ‘Our AI thought you wouldn’t be able to explain this,’ or, ‘The public thought you wouldn’t be able to understand this’—‘Do you agree or disagree?’” The participants either saw the AI or crowd prompts before seeing the charts or after. For those who made their judgement before seeing the AI or crowd prompt specifically, they were given a chance to change their minds, as a further point of comparison for how persuasive the prompts were.

What did we learn?

In stage one, several of the factors – such as chart type and colour, source citations, and error bars—had statistically significant impacts on readability and trust. However, these results came with a considerable caveat: while the findings from any one of the crowdsourcing platforms appeared clear, they weren’t consistent or replicable when compared across all three.

For example, while doughnut charts were considered the most readable on the third crowdsourcing platform, it was bar charts on Mechanical Turk, and pie charts for participants on Prolific. There were also other factors (like bar orientation) that were only statistically significant on one or two of the platforms for influencing trust and readability, positively or negatively, but not on all.

“Despite the flexibility these platforms give us as researchers, their limitations mean we have three sets of answers, all statistically significant, but where it’s hard to compare those sets to each other reliably,” says Simperl. “Imagine if that happened with thermometers, and temperatures were measured differently between devices. That would never fly in any other field, but it hasn’t been critiqued enough in crowdsourcing. Now we have a whole side research project on this question of how reliable these crowdsourcing platforms are as scientific tools for collecting data.”

This problem led the team to run stages two and three on Prolific alone, which they chose in part because they considered its user demographic data—which included gender, age, language, country of birth, country of residence, and employment/student status—to be relatively trustworthy. Whether there’s a way to satisfactorily determine true demographic data on crowdsourcing platforms without also sacrificing participants’ privacy in the process remains another open question.

In the final stage they found a weak – but still statistically significant – indication that people are more likely to be influenced by the judgements of other humans rather than those of an AI if shown those judgements before making their own. If shown after, however, they were more likely to be influenced by the AI, and change their minds.

Conclusion

The lack of reproducibility between the three crowdsourcing platforms turned out to be much more challenging than expected, and is arguably the team’s most significant discovery. It meant that the team had to invest resources in crafting multiple task designs and analysing differences across the platforms, and they couldn’t use the crowd data to train a chart comprehension model as hoped.

The combination of low cost and large scale that these platforms offer makes them an attractive option for many AI and other researchers, but if they have fundamental structural issues that are biassing results, it could have huge ramifications in computer science and beyond.

“Every AI system of note is based on a dataset from at least one of these tools, whether it’s a public one like Mechanical Turk or something a company has built for itself,” says Simperl. “We need to do more to understand how the platforms we’re using impact the data that they produce, and the ethics of whether it should or shouldn’t be used for certain tasks. If there are concerns with the data, we need to be able to reproduce how the data came about and what can be done to fix it. Otherwise, an AI trained on poor data will only make these issues bigger.”

“All of us who care about ethical, responsible ways to produce any kind of digital product also need to recognise that there are huge economies based around crowdsourcing services in places like Southeast Asia. There is extremely loose regulation over how this data is collected, and I see far too many AI researchers who are still not questioning this.” The team has also produced a separate paper covering their findings on interoperability and consistency between the three platforms, which will be published later this year.

The team hopes that the issues they’ve raised—from crowdsourcing platform interoperability, to the role of AI in influencing judgement and agreement, to the need for new frameworks for understanding and assessing charts—will inspire other researchers to study them in other contexts.

For more information about this experiment please contact [email protected] or [email protected]

The opinions expressed in this publication are those of the author. For more information, view our full statement on external contributors.

Author

Ian Steadman