Searchable Video: Accurate Searches, Precise Targeting
Great floods begin with the pitter-patter of tiny raindrops, and so it is with video on the Web. The tiny, jerky videos that we see on websites today will soon become a flood of full-frame, full-motion video integrated into many, if not most, corporate and entertainment websites. And just as Yahoo, Excite, and Alta Vista allow us to navigate billions of words of Web-based text, so will video search engines enable us to wade through millions of hours of Web-based streaming video.
A variety of techniques are employed in video search -- text analysis, speech recognition, language processing, video analysis, and even face recognition. The available video search technologies use one or a combination of these methods - often tuned to the type of video to be searched. By bringing all of these techniques together, a truly effective search can be achieved. "You can now set up very complex searches," said Carlos Montalvo, vice president of products and marketing at
Virage Inc., a video search provider with an impressive list of customers, including CNN, ABC News and CNET. "For example, ‘Find me every clip where Max is on air, on screen, Carlos' voice is off-screen interviewing Max, and they're talking about the economy on the CNN channel.' You could find exactly that clip."
Video content is compelling in itself, and that kind of precision can create an even stronger pull on users - and an equally strong incentive for in-stream advertisers, who can use it to target messages at consumers who reveal their own commercial leanings with the search terms they use.
Text-Based Search Methods
Video search applications are relatively new, but the concept is based on older technology. In fact, sophisticated versions of the old text-based "find and replace" technology form the basic building blocks of today's cutting-edge video search engines.
Excalibur Technologies of Vienna, VA, an 18-year veteran of the pattern recognition business that has deployed text retrieval software in large government facilities and corporate intranets worldwide, is one of a handful of video search pioneers. David Nunnerly, senior VP in Excalibur's media services group, points out, "Even though we're doing video searching on the Internet, it's still going to be 99 percent text. At the end of the day, people enter a text query to find what they want. They want video back, but the actual index and searching is text. And if you don't have a good text engine, you really can't do good video searching."
A searchable text index can be derived from textual metadata already attached to the video, such as closed-captioning, or by employing a speech-to-text engine such as those developed by Lernout & Hauspie, IBM, Microsoft, and others. In controlled environments such as broadcast, training, or corporate communications, speech recognition developers claim up to 90 percent accuracy. "[Speech recognition] is not a perfect system, but if you're just trying to derive an index to be used for search, then it doesn't have to be," said Krishna Pendyala, co-founder and vice-chairman at MediaSite, another major player in the video search engine space. "It's not a dictation test. We are playing a clue game, and the question is how many clues can you gather into the video?"
Virage's Montalvo agrees that video search does not require perfect precision to be effective. "If we had a segment of video that said, ‘Good afternoon, this is Sam Donaldson live on the steps of the Supreme Court, where today the justices found Clinton guilty of contempt,' and the only things I recognized in that clip were ‘Sam Donaldson', ‘Supreme Court', ‘Clinton', and ‘guilty', those are four search terms that I can now use to get back to that clip," he said. "I don't need a 100 percent accurate transcript to get to the clip."
Once the text is acquired, it can then be analyzed to identify changes in topic. MediaSite's Pendyala explains, "Language-processing techniques score words based on their frequency and relevance compared to a large body of historical text. For example, the word 'space' in a movie about space travel will not get a high rating because it'll be used all over the place." Excalibur's Nunnerly offers another example: "If I type in ‘Pandas in China,' our semantic network will allow me to differentiate between ‘China' the country and ‘china' tea cups," he said. "It understands that ‘china' has different meanings and you can exploit that in different ways."
Audio Analysis: Playing It By Ear
Virage also offers a separate analysis engine that recognizes each distinct speaker in a video archive. Using this technology, individual news stories or corporate presentations can be distinguished by noting a change in speakers. "It turns out that every individual has a unique speech ‘signature'," Montalvo said. "So at the same time that we're converting the speech to text, we are tagging the speaker as Sam or Carlos or whomever. You can try to alter your voice, but it doesn't matter. It still recognizes your speech and it's over 90 percent accurate."
Speech is not the only information on the audio track that can be used as metadata to search the video. Virage offers an audio classification engine that recognizes unique audio signals - applause, laughter, a gunshot, or a commercial jingle. "These audio signals are often tied to a video event," Montalvo notes. "For example, a laugh track is always tied to a joke. Search for every laugh track and you can now search for every gag in a sit-com. Or if a particular product has an associated jingle, you can find that the jingle plays four times and know the ad played four times, because the jingle only plays with the ad."
Video Algorithms: Analyzing Video Patterns
The first step in analyzing visual information is to segment the video into a manageable sequence of discrete clips represented by a "storyboard" of video thumbnails. Using pixel-analysis algorithms that compare each frame to the one before it, clips are first segmented based on scene transitions such as cuts, fades and dissolves. Excalibur's Nunnerly explains, "We have an algorithm looking for 'cut' thresholds, [pertaining to] how many of the pixels have changed, and 'fade' thresholds, looking for changes in the pixels in terms of luminance and brightness. Each one of these effects has an algorithm behind it that is comparing pixels mathematically and running in real time."
Different types of video content reflect different segmentation patterns, and the sensitivity of the algorithms can be adjusted accordingly. MediaSite's Pendyala offers an example: "When you look at football, it's very easy to segment it - the entire motion comes to a stop before a play. In that case, you do motion analysis, and when it stops… that's the beginning of a play."
Excalibur goes a step further by offering pre-defined algorithm settings corresponding to a number of different genres of video. "There aren't a lot of cuts within sporting events; there are a lot of big, sweeping pans as they follow people running down the field," said Dan Agan, vice president of marketing. "But in a music video or in a drama, to build suspense, there are a lot of cuts. So by increasing or decreasing the sensitivity of various algorithms you enhance the accuracy across those various genre types. You can think of them like pre-sets. It's a time-saving thing for customers."
Other visual events, such as zooms and lighting changes, can also be used to segment the video. Excalibur's Nunnerly offers this example: "Imagine you have ingested and analyzed video of your nightly news. And in that video, often there is an image of the anchor with an insert in the corner of a graphic indicating a news story. You can use that as a clue, and what you'll get back could be that same anchor, but with all the different graphics for different stories. So you've essentially established a way of finding the in-point for different stories."
Virage recently announced two new analysis techniques - the ability to read on-screen text that is keyed over the video (such as a person's name, sports times and scores, or a show's title) and the ability to recognize faces. The latter is particularly useful for broadcast, corporate communications, sales and training, and the conference industry, where the majority of the video content consists of talking heads. "Face-recognition algorithms look at the unique geometry and feature vectors of the eyes and nose and that portion of the face," Montalvo said. "So it's not really impacted by changes in facial hair, whether it be longer hair, shorter hair, or beards, or even aging, because that part of your face doesn't change that much. It's actually quite accurate."
Video Search in Action
Broadcast news organizations were among the earliest adopters of video search capabilities on the Web. Users enter text keywords to quickly locate stories they're interested in - a convenient way for users to personalize their video news and watch it whenever they choose. But does this added convenience justify all the effort - not to mention expense - of developing or licensing an effective video search engine? The answer lies in the greatly anticipated "convergence" between television and the Internet, and its potential for stimulating e-commerce.
CNET's model may be a foreshadowing of future convergence media technology. The online content pioneer has logged more hours of technology-related television material than any other network. When you run a search for Palm Pilot, you not only get reviews, user groups, and buyer's guides; you also get video from CNET's vast library. The video search capability turns CNET's television material into tightly focused, personal, and interactive content on the website.
Developers of video search engines are betting that the ability to target advertising based on search criteria will drive demand for their products. "If the Internet is going to continue to grow based on an ad model, it's going to be based on targeted ads, selective ads, and ads that are germane to the community, not the interstitial ads that we've come to know from the broadcast space," Montalvo said. "If a user searches for a clip based on a particular keyword, show him this ad. If he gets to the same clip but based on a different keyword, show him a different ad. Individuals that use searchable video can pull the video they want, and depending on how they pull that video, the content providers can target and select the advertising with that pull. It becomes less intrusive and more engaging."
The Corporate Angle
As important as video search may be to the expansion of e-commerce on the Web, demand for video search capabilities on corporate intranets may dwarf that of the public Internet. According to Virage, Boeing has archived over 4 million hours of video and produces over 100,000 new hours each year. Lockheed/Martin has over 300,000 hours in training and communications alone. It doesn't take much imagination to appreciate the value of making these massive video archives easily searchable and accessible to thousands of corporate employees.
For example, companies like Coca-Cola, Proctor & Gamble and General Motors produce up to 10,000 hours of focus group footage each year. Up to three times as much money is then spent analyzing the video as went into production. Today, that's a human process - people watching the videotape, shuttling back and forth, seeing what people are saying about their products. With video search, individual product managers and engineers can quickly see what they want, when they want.
Fast Forward
The current applications for searchable video are only the tip of the iceberg. As video becomes a widespread, easily accessible resource on the Internet, the potential for companies that provide search technologies - and that provide searchable content - will grow dramatically.
The foreseeable applications are many. You may have decided to build that "do-it-yourself" koi pond in the backyard. Why not log on to the Web and search for a "how to" video? Or imagine you're in the kitchen, wrestling with the Coq au Vin. You might like to see a video that shows exactly what Julia Child means by "flame the chicken with brandy." We haven't even begun to explore the myriad ways in which video-on-demand will be used. But as Montalvo observes, "All things analog will go digital, and content will become currency. Like with any currency, if you can't find it, you don't have it."