Concept: Meta has introduced Audio-Visual Hidden Unit BERT (AV-HuBERT), a speech recognition framework that can understand speech analyzing sound and the movement of the speaker’s lips. It claims that AV-HuBERT shows recognition accuracy 75% higher than other audiovisual speech recognition systems trained on the same number of transcriptions.
Nature of Disruption: AV-HuBERT leverages unsupervised, or self-supervised ML. This multimodal framework learns to detect language using a combination of audio and lip-movement inputs. It can train supervised learning, algorithms such as DeepMind on labeled example data until it can determine the underlying correlations between the examples and certain outputs. The technology can classify unlabeled data by analyzing it and learning from its inherent structure. Meta claims that the framework can also capture complex correlations between the two data types by merging cues like the movement of the lips and teeth during speech with audio information. AV-HuBERT, according to Meta, recognizes a person’s speech 50% better than audio-only models when loud music or noise is playing in the background. When voice and background noise are both equally loud, AV-HuBERT achieves a WER (Word error rate) of 3.2%, compared to 25.5% for the previous best multimodal model. It boasts that AV-HuBERT only utilizes a tenth of the labeled data, making it potentially useful for languages with limited audio data.
Outlook: According to Meta, AV-HuBERT could open new opportunities for constructing conversational models for low-resource languages like Susu in the Niger-Congo family because it requires less labeled data for training. It can also be used to develop speech recognition systems for those with speech impairments, as well as to detect deepfakes and generate realistic lip motions for virtual reality avatars. AV-HuBERT has the potential to be used in the future to improve the performance of speech recognition technologies in noisy everyday situations, such as at a party or in a crowded street market. This technique could also benefit smartphone assistants, AR glasses, and smart speakers with cameras.
This article was originally published in Verdict.co.uk