Speech recognition remains a challenging problem in AI and machine learning. In one step to solve it, OpenAI Today open source Whisper, an automatic speech recognition system that the company says enables “robust” transcription in multiple languages, as well as translation from those languages into English.
Numerous organizations have developed highly capable speech recognition systems, which are at the heart of software and services from tech giants like Google, Amazon and Meta. But what makes Whisper different, according to OpenAI, is that it has been trained on 680,000 hours of multilingual and “multitask” data collected from the Internet, which has led to better recognition of unique accents, background noise, and technical jargon.
“The primary intended users of [the Whisper] models are AI researchers who study the robustness, generalization, capabilities, biases and limitations of the current model. However, Whisper is also potentially very useful as an automatic speech recognition solution for developers, especially for English speech recognition,” OpenAI wrote in the GitHub repo for Whisper, from which different versions of the system can be downloaded. “[The models] display strong ASR results in ~10 languages. They may exhibit additional capabilities…if tailored for certain tasks such as voice activity detection, speaker classification, or speaker diarization, but have not been robustly evaluated in this area.”
Whisper has its limitations, especially when it comes to text prediction. Because the system is trained on a large amount of “noisy” data, OpenAI warns that Whisper may include words in its transcripts that weren’t actually spoken — possibly because it’s both trying to predict the next word in audio and trying to transcribe the audio itself. In addition, Whisper does not perform equally well in all languages, with a higher error rate when it comes to speakers of languages that are not well represented in the training data.
Despite all this, OpenAI sees Whisper’s transcription capabilities being used to improve existing accessibility tools.
“While Whisper models cannot be used directly for real-time transcription, their speed and size suggest that others may be able to build applications on top of them that enable near-real-time speech recognition and translation,” the company said. continues on GitHub. “The real value of useful applications built on top of Whisper models suggests that the disparate performance of these models could have real economic implications… [W]I hope that the technology will be put to good use in the first place, because making automatic speech recognition technology more accessible could allow more actors to build capable surveillance technologies or scale existing surveillance efforts, as the speed and accuracy of affordable automatic transcription and enable high-volume translation of audio communications.”
The release of Whisper is not necessarily indicative of OpenAI’s future plans. While the company is increasingly focusing on commercial endeavors such as DALL-E 2 and GPT-3, the company pursues several purely theoretical lines of research, including AI systems that learn by watching videos.