Humans are fallible, but so are the algorithms and AIs we increasingly delegate our decisions to.
Dig below the surface of some of today’s biggest tech controversies and you are likely to find an algorithm misfiring:
These errors are not primarily caused by problems in the data that can make algorithms discriminatory, or their inability to improvise creatively. No, they stem from something more fundamental: the fact that algorithms, even when they are generating routine predictions based on non-biased data, will make errors. To err is algorithm.
The costs and benefits of algorithmic decision-making
We should not stop using algorithms simply because they make errors. Without them, many popular and useful services would be unviable. However, we need to recognise that algorithms are fallible, and that their failures have costs. This points at an important trade-off between more (algorithm-enabled) beneficial decisions and more (algorithm-caused) costly errors. Where lies the balance?
Economics is the science of trade-offs, so why not think about this topic like economists? This is what I have done ahead of this blog, creating three simple economics vignettes that look at key aspects of algorithmic decision-making. These are the key questions:
The two sections that follow give the gist of the analysis and its implications. The appendix at the end describes the vignettes in more detail (with equations!).
Modelling the modelling
As the American psychologist and economist Herbert Simon once pointed out, ‘in an information rich world, attention becomes a scarce resource’. This applies to organisations as much as it does to individuals.
The ongoing data revolution risks overwhelming our ability to process information and make decisions, and algorithms can help address this. They are machines that automate decision-making, potentially increasing the number of good decisions that an organisation can make. This explains why they have taken-off first in industries where the volume and frequency of potential decisions goes beyond what a human workforce can process.
What drives this process? For an economist, the main question is how much value will the algorithm create with its decisions. Rational organisations will adopt algorithms with high expected values.
An algorithm’s expected value depends on two factors: its accuracy (the probability that it will make a correct decision), and the balance between the reward from a correct decision and the penalty from an erroneous one. Riskier decisions (where penalties are big compared to rewards) should be made by highly accurate algorithms. You would not want a flaky robot running a nuclear power station, but it might be ok if it is simply advising you about what TV show to watch tonight.
We could bring in human supervisors to check the decisions made by the algorithm and fix any errors they find. This makes more sense if the algorithm is not very accurate (supervisors do not spend a lot of time checking correct decisions), and the net benefits from correcting the wrong decisions (i.e., extra rewards plus avoided penalties) is high. Costs matter too. A rational organisation has more incentives to hire human supervisors if they do not get paid a lot, and if they are highly productive (i.e. it only takes a few of them to do the job).
Following from the example before, if a human supervisor fixes a silly recommendation in a TV website, this is unlikely to create a lot of value for the owner. The situation in a nuclear power station is completely different.
What happens when we scale-up the number of algorithmic decisions? Are there any limits to its growth?
This depends on several things, including whether algorithms gain or lose accuracy as they make more decisions, and the costs of ramping-up algorithmic decision-making. In this situation, there are two interesting races going on.
1. There is a race between an algorithm’s ability to learn from the decisions it makes, and the amount of information that it obtains from new decisions. New machine learning techniques help algorithms ‘learn from experience’, making them more accurate as they make more decisions. However, more decisions can also degrade an algorithm’s accuracy. Perhaps it is forced to deal with weirder cases, or new situations it is not trained to deal with. To make things worse, when an algorithm becomes very popular (makes more decisions), people have more reasons to game it.
My prior is that the ‘entropic forces’ that degrade algorithm accuracy will win out in the end: no matter how much more data you collect, it is just impossible to make perfect predictions about a complex, dynamic reality.
2. The second race is between the data scientists creating the algorithms and the supervisors checking these algorithm’s decisions. Data scientists are likely to ‘beat’ the human supervisors because their productivity is higher: a single algorithm, or an improvement in an algorithm, can be scaled up over millions of decisions. By contrast, supervisors need to check each decision individually. This means that as the number of decisions increases, most of the organisation’s labour bill will be spent on supervision, with potentially spiralling costs as the supervision process gets bigger and more complicated.
What happens at the end?
When considered together, the decline in algorithmic accuracy and the increase in labour costs I just described are likely to limit the number of algorithmic decisions an organisation can make economically. But if and when this happens depends on the specifics of the situation.
Implications for organisations and policy
The processes I discussed above have many interesting organisational and policy implications. Here are some of them:
As I said, algorithms making decisions in situations where the stakes are high need to be very accurate to make-up for high penalties when things go wrong. On the flipside, if the penalty from making an error is low, even inaccurate algorithms might be up to the task.
For example, the recommendation engines in platforms like Amazon or Netflix often make irrelevant recommendations, but this is not a big problem because the penalty from these errors is relatively low – we just ignore them. Data scientist Hillary Parker picked up on the need to consider the fit between model accuracy and decision context in a recent edition of the ‘Not So Standard Deviations’ podcast:
“Most statistical methods have been tuned for the clinical trial implementation where you are talking about people’s lives and people dying with the wrong treatment, whereas in business settings the trade-offs are completely different”
One implication from this is that organisations in ‘low-stakes’ environments can experiment with new and unproven algorithms, including some with low-accuracy early on. As these are improved, they can be transferred to ‘high stake domains’. The tech companies that develop these algorithms often release them as open source software for others to download and improve, making these spill-overs possible.
Algorithms need to be applied much more carefully in domains where the penalties from errors are high, such as health or the criminal justice system, and when dealing with groups who are more vulnerable to algorithmic errors. Only highly accurate algorithms are suitable for these risky decisions, unless they are complemented with expensive human supervisors who can find and fix errors. This will create natural limits to algorithmic decision-making: how many people can you hire to check an expanded number of decisions? Human attention remains a bottleneck to more decisions.
If policymakers want more and better use of algorithms in these domains, they should invest in R&D to improve algorithmic accuracy, encourage the adoption of high-performing algorithms from other sectors, and experiment with new ways of organising that help algorithms and their supervisors work better as a team.
Commercial organisations are not immune to some of these problems: YouTube has for example started blocking adverts in videos with less than ten thousand views. In those videos, the rewards from correct algorithmic ad-matching is probably low (they have low viewership) and the penalties could be high (many of these videos are of dubious quality). In other words, these decisions have low expected value, so YouTube has decided to stop making them. Meanwhile, Facebook just announced that it is hiring 3,000 human supervisors (almost a fifth of its current workforce) to moderate the content in its network. You could imagine how the need to supervise more decisions might put some brakes on its ability to scale up algorithmic decision-making indefinitely.
One way to keep supervision costs low and coverage of decisions high is to crowdsource supervision to users, for example by giving them tools to report errors and problems. YouTube, Facebook and Google have all done this in response to their algorithmic controversies. Alas, getting users to police online services can feel unfair and upsetting. As Sarah T Roberts, a Law professor pointed out in a recent interview about the Facebook violent video controversy:
“The way this material is often interrupted is because someone like you or me encounters it. This means a whole bunch of people saw it and flagged it, contributing their own labour and non-consensual exposure to something horrendous. How are we going to deal with community members who may have seen that and are traumatized today?”
Even when penalties from error are low, it still makes sense to keep humans in the loop of algorithmic decision-making systems. Their supervision provides a buffer against sudden declines in performance if (as) the accuracy of algorithms decreases. When this happens, the number of erroneous decisions detected by humans and the net benefit from fixing them increase. They can also ring the alarm, letting everyone know that there is a problem with the algorithms that needs fixing.
This could be particularly important in situations where errors create penalties with a delay, or penalties that are hard to measure or hidden (say if erroneous recommendations result in self-fulfilling prophecies, or costs that are incurred outside the organisation).
There are many examples of this. In the YouTube advertising controversy, the big accumulated penalty from previous errors only became apparent with a delay, when brands noticed that their adverts were being posted against hate videos. The controversy with fake news after the US election is an example of hard to measure costs: algorithms’ inability to discriminate between real news and hoaxes creates costs for society, potentially justifying stronger regulations and more human supervision. Politicians have made this point when calling on Facebook to step up its fight against fake news in the run-up to the UK election:
“Looking at some of the work that has been done so far, they don’t respond fast enough or at all to some of the user referrals they can get. They can spot quite quickly when something goes viral. They should then be able to check whether that story is true or not and, if it is fake, blocking it or alerting people to the fact that it is disputed. It can’t just be users referring the validity of the story. They [Facebook] have to make a judgment about whether a story is fake or not.”
Before we use economic models to inform action, we need to define and measure model accuracy, penalties and rewards, changes in algorithmic performance due to environmental volatility, levels of supervision and their costs, and that is only the beginning.
This is hard but important work that could draw on existing technology assessment and evaluation tools, including methods to quantify non-economic outcomes (e.g. in health). One could even use rich data from an organisation’s information systems to simulate the impact of algorithmic decision-making and its organisation before implementing it. We are seeing more examples of these applications, such as the financial ‘regtech’ pilots that the European Commission is running, or the ‘collusion incubators’ mentioned in a recent Economist article on price discrimination.
In a Nature article last year, US researchers Ryan Calo and Kate Crawford called for “a practical and broadly applicable social-systems analysis [that] thinks through all the possible effects of AI systems on all parties [drawing on] philosophy, law, sociology, anthropology and science-and-technology studies, among other disciplines”. Calo and Crawford did not include economists in their list. Yet as this blog suggests, economics thinking has much to contribute to these important analyses and debates. Thinking about algorithmic decisions in terms of their benefits and costs, the organisational designs we can use to manage their downsides, and the impact of more decisions on the value that agorithms create can help us make better decisions about when and how to use them.
This reminds me of a point that Jaron Lanier made in his 2010 book, Who Owns the Future: “with every passing year, economics must become more and more about the design of the machines that mediate human social behaviour. A networked information system guides people in a more direct, detailed and literal way than does policy. Another way to put it is that economics must turn into a large-scale, systemic version of user interface design’.
Designing organisations where algorithms and humans work together to make better decisions will be an important part of this agenda.
This blog benefited from comments from Geoff Mulgan, and was inspired by conversations with John Davies. The image above represents a precision-recall curve in a multi-label classification problem. It shows the propensity of a random forests classification algorithm to make mistakes when one sets different rules (probability thresholds) for putting observations in a category.
Appendix: Three economics vignettes about algorithmic decision-making
The three vignettes below are very simplified formalisations of algorithmic decision-making situations. My main inspiration was Human fallibility and economic organization, a 1985 paper by Joe Stiglitz and Raj Sah where the authors model how two organisational designs – hierarchies and ‘polyarchies’ (flat organisations) - cope with human error. Their analysis shows that hierarchical organisations where decision-makers lower in the hierarchy are supervised by people further up tend to reject more good projects, while polyarchies where agents make decisions independently from each other, tend to accept more bad projects. A key lesson from their model is that errors are inevitable, and the optimal organisational design depends on context.
Let’s imagine an online video company that matches adverts with videos in its catalogue. This company hosts millions of videos so it would be economically inviable for it to rely on human labour to do the job. Instead, its data scientists develop algorithms to do this automatically.  The company looks for the algorithm that maximises the expected value of the matching decisions. This value depends on three factors: 
-Algorithm accuracy (a): The probability (between 0 and 1) that the algorithm will make the correct decision.
-Decision reward (r): This is the reward when the algorithm makes the right decision
-Error penalty (p): This is the cost of making the wrong decision.
We can combine accuracy, benefit and penalty to calculate the expected value of the decision:
E = ar – (1-a)p 
This value is positive when the expected benefits from the algorithm’s decision outweigh the expected costs (or risks):
ar > (1-a)p 
Which is the same as saying that:
a/(1-a) > p/r 
The odds of making the right decision should be higher than the ratio between penalty and benefit.
We can reduce the risk of errors by bringing a human supervisor into the situation. This human supervisor can recognise and fix errors in algorithmic decisions. The impact of this strategy on the expected value of a decision depends on two parameters:
-Coverage ratio (k): k is the probability that the human supervisor will check a decision by the algorithm. If k is 1, this means that all algorithmic decisions are checked by a human.
-Supervision cost (cs(k)): this is the cost of supervising the decisions of the algorithm. The cost depends on the coverage ratio k because checking more decisions takes time.
The expected value of an algorithmic decision with human supervision is the following:
Es = ar + (1-a)kr – (1-a)kp – cs(k) 
This equation picks up the fact that some errors are detected and rectified, and others are not. We subtract  from  to obtain the extra expected value from supervision. After some algebra, we get this.
(r+p)(1-a)k > cs(k) 
Supervision only makes economic sense when its expected benefit (which depends on the probability that the algorithm has made a mistake, that this mistake is detected, and the net benefits from flipping a mistake into a correct decision) is larger than the cost of supervision.
Here, I consider what happens when we start increasing n, the number of decisions being made by the algorithm.
The expected value is:
E(n) = nar + n(1-a)kr – n(1-a)(1-k)p 
And the costs are C(n)
How do these things change as n grows?
I make some assumptions to simplify things: the organisation wants to hold k constant, and the rewards r and penalties p remain constant as n increases.
This leaves us with two variables that change as n increases: a and C.
Based on this, and some calculus, we get the changes in expected benefits as we make more decisions as:
∂E(n)/∂(n) = r + (a+n(∂a/∂n))*(1-k)(r+p) - p(1-k) 
This means that as more decisions are made, the aggregated expected benefits grow in a way that is modified by changes in the marginal accuracy of the algorithm. On the one hand, more decisions mean scaled up benefits from more correct decisions. On the other, the decline in accuracy generates an increasing number of errors and penalties. Some of these are offset by human supervisors.
This is what happens with costs:
∂C/∂n = (∂C/∂Lds)(∂Lds/∂n) + (∂C/dLs)(∂Ls/dn) 
As the number of decisions increases, costs grow because the organisation has to recruit more data scientists and supervisors.
 is the same as saying:
∂C/dn = wds/(∂Lds/dn) + ws/zs/(∂Ls/∂n) 
The labour costs of each occupation are directly related to its salary, and inversely related to its marginal productivity. If we assume that data scientists are more productive than supervisors, this means that most of the increases in costs with n will be caused by increases in the supervisor workforce.
The expected value (benefits minus costs) from decision-making for the organisation is maximised with an equilibrium number of decisions ne where the marginal value of an extra decision equals its marginal cost:
r + (a+nda/dn)(1-k)(r+p) - p(1-k) = wds/(∂Lds/∂n) + ws/zs/(∂Ls/∂n) 
Above, I have kept things simple by making some strong assumptions about each of the situations being modelled. What would happen if we relaxed these assumptions?
Here are some ideas:
First, the analysis does not take into account that different types of errors (e.g. false positives and negatives, errors made with different degrees of certainty etc.) could have different rewards and penalties. I have also assumed certainty in rewards and penalties, when it would be more realistic to model them as random draws from probability distributions. This extension would help incorporate fairness and bias into the analysis. For example, if errors are more likely to affect vulnerable people (who suffer higher penalties), and these errors are less likely to be detected, this could increase the expected penalty from errors.
All of the above assumes that algorithms err but humans do not. This is clearly not the case. In many domains, algorithms can be a desirable alternative to humans with deep-rooted biases and prejudices. In those situations, human’s ability to detect and address errors is impaired, and this reduces the incentives to recruit them (this is the equivalent to a decline in their productivity). Organisations deal with all this by investing on technologies (e.g. crowdsourcing platforms) and quality assurance systems (including extra layers of human and algorithmic supervision) that manage the risks of human and algorithmic fallibility.
Before, I assumed that the marginal penalties and rewards remain constant as the number of algorithmic decisions increase. This need not be the case. The table below shows examples of situations where these parameters change with the number of decisions being made:
Increases with more decisions
Decreases with more decisions
The organisation gains market power, or is able to use price discrimination in more transactions
The organisation runs out of valuable decisions to make.
The organisation becomes more prominent and its mistakes receive more attention
Users get accustomed to errors
Getting an empirical handle on these processes is very important, as they could determine if there is a natural limit to the number of algorithmic decisions that an organisation can make economically in a domain or market, with potential implications for its regulation.
 I use the term ‘algorithm’ in a restricted sense, to refer to technologies that turn information into predictions (and depending on the system receiving the predictions, decisions). There are many processes to do this, including rule-based systems, statistical systems, machine learning systems and Artificial Intelligence (AI). These systems vary on their accuracy, scalability, interpretability, and ability to learn from experience, so their specific features should be considered in the analysis of algorithmic trade-offs.
 One could even say that machine learning is the science that manages trade-offs caused by the impossibility of eliminating algorithmic error. The famous ‘bias-variance’ trade off between fitting a model to known observations and predicting unknown ones is a good example of this.
 Some people would say that personalisation is undesirable because it can lead to discrimination and ‘filter bubbles’, but that is a question for another blog post.
 In a 2016 Harvard Business Review article, Ajay Agrawal and colleagues sketched out an economic analysis of machine learning as a technology that lowers the costs of prediction. My way of looking at algorithms is similar because predictions are inputs into decision-making.
 This includes personalised experiences and recommendations in e-commerce and social networking sites, or fraud detection and algorithmic trading in finance.
 For example, if YouTube shows me an advert which is highly relevant to my interests, I might buy the product, and this generates income for the advertiser, the video producer and YouTube. If it shows me a completely irrelevant or even offensive advert, I might stop using YouTube, or kick up a fuss in my social network of choice.
 This is what happened with the Google FluTrends system used to predict flu outbreaks based on google searches – people changed their search behaviour, and the algorithm broke down.
 In many cases, the penalties might be so high that we decide that an algorithm should never be used, unless it is supervised by humans.
 Unfortunately, care is not always taken when implementing algorithmic systems in high-stakes situations. Cathy O’Neil’s ‘Weapons of Maths Destruction’ gives many examples of this, going from the criminal justice system to university admissions.
 Mechanisms for accountability and due process are another example of human supervision.
 Using Albert Hirschmann’s model of exit, voice and loyalty, we could say that supervision plays the role of ‘voice’ , helping organisations detect a decline in quality before users begin exiting.
 The appendix flags up some of my key assumptions, and suggests extensions.
 This decision could be based on how well similar adverts perform when matched with different types of videos, on demographic information about the people who watch the videos, or other things.
 The analysis in this blog assumes that the results of algorithmic decisions are independent from each other. This assumption might be violated in situations where algorithms generate self-fulfilling prophecies (e.g. logically, a user is more likely to click an advert she is shown that one she is not). This is a hard problem to tackle, but researchers are developing methods based on randomisation of algorithmic decisions to address it.
 This does not distinguish between different types of error (e.g. false positives and false negatives). I come back to this at the end.
 I consider the implications of making different assumptions about marginal rewards and penalties at the end.