Google Video Intelligence Analyzes Images in Videos
Google wants to help solve one of the biggest problems facing every organization that generates video—there are too many files and not enough time or people to optimize the value that's sitting in the content. Right now, the idea of identifying every clip where there's a cat in the hat is very daunting.
"(My media clients) have hundreds of thousands or millions of video assets. They're either doing live, recorded, or archive content, and they need to be able to process that and/or repurpose and make some business decisions about that content," says Neil Anderson, CEO of New Media Research Consultancy, which provides media content workflow solutions to many large media companies. "There's never enough staff to watch this content in real time."
The Metadata Problem
Up till now, to create detailed video metadata required a human to make timecode-based annotations in real-time. This is both time-consuming and expensive. To search for an object in a video clip requires a list of all content within the clip. There is no search bar for video content—to find cats required an editor or viewer to scan thru all of a company's content, which is impossible with the amount of content that is now being created. Automated image recognition is part of the Google Cloud's new Video Intelligence API's (details here). This automated image recognition can scan video content and create image tags associated with timecodes.
Cantemo is using Google Cloud to offer video image recognition in their new product iconik, a cloud-based media asset management (MAM) platform. "Ingested assets are analyzed using Google Video Intelligence (GVI) API. We collect the timecode-based metadata and associate them with the asset, (to) make it searchable for users," says Parham Azimi, CEO of Cantemo. "Customers (can) search for specific items within their assets which would otherwise be undiscoverable. This also makes it possible for users to detect specific shots and scenes within a video." Image recognition is only available for archived video content at this time.
Separating Signal from Noise
The GVI API has been trained on content from millions of YouTube videos. Each frame of a video is analyzed and compared to a library of images. Classifications by GVI are broken down into video topic, shot information, and frame analysis. Frame analysis is just what it sounds like—every frame is looked at to identify content based on common objects or activities. Shot analysis identifies the most common objects within all the frames of an edit. Shot analysis can take into account activities that may be happening to the objects within a frame; a single frame may be a person standing, while multiple frames could be a person running or exercising. Video-level annotations can be used to identify the subject of an entire video. It's optimized for short-form content of 3 to 5 minutes, according to Google product manager Ram Ramanathan.
A low-resolution 720p scan is created, and Google Cloud detects all shots and provides individual tags for a piece of video. In this way the production-grade files can stay where they are, says Mikael Wahlberg, VP of product development at Cantemo. The service returns suggestions with a probability variable for users to approve or reject. The tags have a percentage measurement, which allows a human to go back to check the analysis and remove inaccurate tags.
Training
How will something specific like a certain type of car be recognized it if the intelligence is not already in the GVI library? This is where machine learning is applied in order for the image recognition to both identify content in the first place and also get better at identifying images in the future. For visuals outside of known objects and activities, users will need to train the application to identify specific images, say Anderson. "This is the problem with machine learning. By its very nature, it has to learn in order to recognize something."
To automatically recognize logos, for example, all the versions of the logo need to be uploaded into the system first. Users then have to authenticate which logos are correct, says Anderson. "None of these algorithms are ever 100% accurate and the only way to make them better is to train them."
"Being able to custom train machine intelligence platforms specifically for customers to recognize objects that are relevant for them will be an important part of our offering in the future," says Azimi. "Analyzing a video of a sneaker will result in tags such as 'shoe,' 'sneaker,' 'sports shoe,' etc. But a custom-trained system could say the exact model and manufacturer of that sneaker."
Cloud vs. On-Prem
Anderson says 95% of his clients are using on-prem MAMs. Will media companies want to put their content in the cloud? "Some are interested, but I can tell you MTV is not going to put all their content in the cloud," he says. One area his customers have concerns with is when considering to use GVI (or other services for image recognition) is whether they want to make the data in their intellectual property available.
"The service provider—i.e. Google, Amazon, and/or IBM Watson—(owns the data). The customer will get the metadata back, but they're paying the service provider to train their own data sets," says Anderson. "So the challenge is this: The only people that have the MTV archive is MTV." The question for MTV or any other media company is what can be negotiated in terms of data ownership and what tradeoff makes sense.
Future Looking
There are a lot of potential uses for image recognition—content personalization for viewers, better ROI for media companies that can identify content within their media library, the ability to flag inappropriate content, identifying highlights for rough cuts, or simply finding the highest-quality images. Cantemo and Google are hoping to make finding every clip in a specific library that contains a specific image type entirely possible in the future.
Cantemo's iconik is in beta now and will be released at IBC in the fall. The GVI API is open for developers to use.