AI model from OpenAI automatically recognizes speech and translates it to English

Benj Edwards / Ars Technica

On Wednesday, OpenAI released a new open source AI model called Whisper that recognizes and translates audio at a level approaching human recognition ability. It can transcribe interviews, podcasts, conversations and more.

OpenAI trained Whisper on 680,000 hours of audio data and matching transcripts in 98 languages ​​collected from the Internet. According to OpenAI, this open collection approach has led to “improved robustness for accents, background noise and technical language”. It can also detect and translate the spoken language to English.

OpenAI describes Whisper as an encoder-decoder transformer, a type of neural network that can use context obtained from input data to learn associations that can then be translated into the model’s output. OpenAI presents this overview of how Whisper works:

Input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and then passed to an encoder. A decoder is trained to predict the associated text caption, mixed with special tokens that drive the single model to perform tasks such as language identification, sentence-level timestamps, multilingual speech transcription, and speech translation into English.

Read:Vernon Davis on life post-NFL and entertainment career

By open sourcing Whisper, OpenAI hopes to introduce a new foundational model that others can build on in the future to improve speech processing and accessibility tools. OpenAI has a significant track record in this area. In January 2021, OpenAI released CLIP, an open source computer vision model that has arguably ignited the recent era of rapidly advancing image synthesis technology such as DALL-E 2 and Stable Diffusion.

At Ars Technica, we tested Whisper using code available on GitHub, and gave it multiple examples, including a podcast episode and a particularly hard-to-understand audio section from a phone interview. While it took some time to use a standard Intel desktop CPU (the technology doesn’t work in real-time yet), Whisper did a good job transcribing the audio into text via its Python demonstrator – much better than some AI-powered ones. audio transcription services we’ve tried in the past.

Example console output from OpenAI's Whisper Demonstrator as it transcribes a podcast.
enlarge / Example console output from OpenAI’s Whisper Demonstrator as it transcribes a podcast.

Benj Edwards / Ars Technica

With proper setup, Whisper can easily be used to transcribe interviews and podcasts and possibly translate podcasts produced in non-English languages ​​into English for free. That’s a powerful combination that could ultimately disrupt the transcription industry.

Read:Automata on Switch ought to have come earlier but it’s nearly perfect

As with almost every major new AI model today, Whisper offers positive benefits and the potential for abuse. On Whisper’s model map (under the “Wider Implications” section) OpenAI warns that Whisper could be used to automate surveillance or identify individual speakers in a conversation, but the company hopes it will be used “primarily for useful purposes.” .

Previous post
Here’s What’s Leaving Netflix In October 2022
Next post
Recession Fears Keep Markets Negative; FDX and COST Report – September 22, 2022