Crowdsourcing voices to train speech recognition software
Most of the software and voice data that powers the personal assistants in our smart devices is locked up in privately owned systems. Getting access to good‑ quality data takes time and money. As a result, the cost of developing speech recognition and other software that relies on voice data is prohibitively high, giving a few companies a monopoly on these services. There is also little transparency about what data has been used to develop smart assistants, meaning that certain populations can remain underserved. These limitations make the technology less effective for some groups, such as non-native speakers with accents, or for languages spoken by small populations.
Common Voice is a Mozilla initiative, which addresses this challenge by developing the world’s first open-source voice dataset and a speech recognition engine, called Deep Speech. The concept is simple. Common Voice crowdsources voice contributions through an online platform where users are invited to record themselves reading sentences. All sentences are sourced from texts that are under a Creative Commons license , to ensure they can be freely reused by researchers and entrepreneurs in the future. Users can also listen to and validate the contributions from others to ensure that the data is of high enough quality to train an AI algorithm. The market’s leading voice technologies are powered by deep learning algorithms, which can require up to 10,000 hours of validated data to train.
As of January 2020, users have recorded almost 2,500 hours of their voices in 29 different languages for Common Voice. The aim of the project is to ensure that the data used to train voice recognition tools represents the full diversity of real people’s voices. Each data entry contains an audio file with the linked text, as well as any associated metadata about the contributor, if it is available. By making the datasets open, Mozilla is creating opportunities for a wider range of researchers, developers and public sector actors to develop voice technologies that can benefit a wider range of people. This accessibility can help to incentivise innovation and healthy competition for better tools. Mozilla released the first version of Deep Speech in 2017.
Common Voice is an example of how a collective intelligence (CI) approach to data collection – that emphasises diversity and open access – can be used to improve the development of AI, which in turn has the opportunity to be used for other CI purposes.