Writing Text for Video: Did Someone Say 'Autumn Aided Cap Shins'?

Not long after the invention of the modern computer, a notably incorrect assumption was that computers would shortly be put to use competently processing natural-language data. People can typically communicate tolerably well by the time they’re about 3 years old, so this didn’t seem to be an unreasonable expectation, since computers were known to solve problems beyond the capabilities of the brightest of 3-year-olds. Speech comprehension—the faculty of sensing complex sound-pressure variations caused by another human being speaking and then assigning to them symbolic interpretations informed by the local culture and immediate context—turns out to be very difficult to teach to a computer.

With the explosive growth of streaming media in the past decade, substantial resources have been applied to the challenge of automatically captioning that video. Happily, some improvement has been made. The task of captioning is essentially this: Identify candidate speech sounds the speaker might be making; identify candidate words that fit the sequence of plausible sounds; choose the most probable sequence of candidate words; add appropriate punctuation; and segment the resulting text so it appears on screen in a way that can be easily and fluently read as it is spoken. Each of those tasks is difficult in its own right, and different automated captioning software tools are better at some than at others.

One of those tasks that has improved recently is the identification of phonemes—the vowel and consonant sounds of speech. This is a famously hard problem: Since everyone’s voice is unique, speech recognizers need to be trained to learn the idiosyncrasies of each user. Improvement has come from two directions. On the client side, most of us carry small but powerful computers that have bad keyboards but decent microphones. Both mobile and desktop operating systems now feature voice-enabled assistants that will continuously tune themselves to recognize your unique voice and the way you produce sounds with it. On the server side, we have classifiers, software that selects whether input data belongs to some classification of similar, previously encountered data or rather to another category. A server-side platform can compare your speech signal with impossibly large data sets of phonemic patterns and classify candidate sounds more accurately than predecessor systems.

Another of those tasks that has improved, and will continue to improve, is choosing the most probable sequence of words from available candidates. This is traditionally done with a language model; in its simplest form, a statistical analysis of how commonly different words occur together. The words “automated” and “captions” are more likely to appear together than the words “autumn aided cap shins.” That likelihood is what language models capture.

The captioning of educational video is particularly ripe for driving speech recognition research. A school is a fairly closed ecosystem. We can easily identify who the teacher is giving the lecture and we can easily have that teacher train a custom speech model to be reused whenever she appears on video for captioning. Teachers at large research universities are the brightest minds recruited from all over the world and so their linguistic diversity is extreme; these custom-tuned speech models are critical for accurate captioning when your speakers are from such varied linguistic backgrounds.

Educational video typically includes technical vocabulary and jargon that would be difficult for a standard recognizer to identify. However, we have access to the visual aids the teacher used in the video (typically slides), and those aids can be mined for contextually relevant vocabulary. This is exactly what Microsoft Garage’s Presentation Translator does. It is critical that these atypical jargon words be accurately captioned, since the captions would be misleadingly bad otherwise.

Universities are where many of the top researchers in speech recognition are working and where the need for accurate automatic captioning is desperate. It is a perfect example of where the triple missions of universities—to educate, to research, and to provide public service—demand cooperative action.

[This article appears in the June 2018 issue of Streaming Media magazine as "Autumn Aided Cap Shins."]

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Backward Design for Educational Video Production

Software developers are trained in accessibility issues for front-end development and basic concepts like labeling control elements and reporting state changes to assistive technology—screen-readers—are part of a professional developer's code testing procedures. Despite this progress, two very different forces are swirling with the potential to push back on the trend towards better technological inclusion of the disabled.

04 Oct 2023

Writing Text for Video: Did Someone Say 'Autumn Aided Cap Shins'?

Backward Design for Educational Video Production

An Impending Accessibility Backlash

How to Score, Enhance, and Caption Videos with YouTube Creator Studio

New FCC Caption Requirements: What You Need to Know

How to Caption Live Online Video

Best Practices: Analyzing Your Video Analytics

Best Practices: The Future of Content Delivery

More

Sports Streaming Tech Breakthroughs

Optimising Content Delivery for Impact and Efficiency - Europe-friendly timing

More Web Events

The State of Generative AI 2025

The 2025 Streaming Media All-Stars

The State of OTT and CTV Monetization 2025

The State of Live Sports Streaming 2025