Tutorial: High-Touch Encoding With Microsoft Expression Encoder 2
A couple of months ago, I was chosen as a member of the inaugural Streaming Media All-Star team. There was a fun video montage presenting images of each of us on baseball cards, with voice over by Colorado Avalanche announcer Alan Roach. Good stuff, but the compression wasn’t quite up to my standards for this rare intersection of compression obsession and personal vanity. So I asked if I could take my own whack at it.
This article is a detailed walkthrough of that process, as an example of what is sometimes called "high-touch" compression. This is the opposite of the high-volume automated encoding that a site such as YouTube does, and it is much more like the labor-intensive, quality-critical workflow that gets used for A-list DVD and Blu-ray titles. While not appropriate for most compression projects, for clips that will be seen widely, it can pay off. If something is going to be seen tens of thousands of times, even a slight improvement in quality can make a difference in communicating its message, and a 10% savings in bitrate can pay off in total bandwidth costs.
Specifically, this article is going to show off some of the new encoding features of Microsoft’s new Expression Encoder 2 product for high-quality compression, authoring with Silverlight skins, and publishing to the Silverlight Streaming service.
The Source
The most labor-intensive encoding projects are usually challenging because of source issues, not codec tweaking. This was no exception to that rule. The first thing I noticed viewing the original clip was that the background graphics and a few of the animations were interlaced.
While a normal deinterlacing filter would have fixed that problem, that can induce additional softness as seen below. The heavyweight motion-adaptive deinterlacers available for tools such as AVISynth can be finicky to configure and extremely slow. And in the end, nothing beats getting the source fixed in the first place. Compression is the art of getting output as close to the original as possible with the available bits; higher-quality sources can provide a much bigger improvement than anything in the codec.
I contacted the post house, and they fixed the background interlacing (it was just a matter of properly flagging the stock footage source as interlaced in After Effects, a common error with stock footage provided in the PNG or animation codecs) and re-rendered it for me as a lossless RGB PNG-codec QuickTime .mov file (PNG is easier to edit and generally produces smaller files than the old default "animation" codec). However, there were two shots that snuck through where one layer was still interlaced.
I didn’t want to wait for another disc, so I dove into After Effects (all difficult preprocessing jobs seem to wind up in After Effects). For the interlaced frames, I used the "Reduce Interlace Filter" with a softness of one to blend the two fields together. Traditional deinterlace methods distorted the text on the cards too much. However, the softness increase from that filter wound up causing a slight visual discontinuity when it kicked in midshot. So, I broke out the two sequences with interlacing into layers, and then used a five-frame cross-dissolve from the original progressive frames to the first interlaced frame, hiding the slight loss of focus. Both interlaced sequences ended on a hard cut, so I was able to switch back to the original video without a transition.
I then rendered the new version out from After Effects into the Lagarith codec in YV12 mode, which uses the native 8-bit 4:2:0 color space of VC-1 and other codecs. This means that Expression Encoder doesn’t need to do any color space conversion, making compression slightly faster.
Markers and Manual Keyframes
As most of you know, efficient codecs make use of both intra- and interframe compression. You can think of intraframe compression as being like a .jpg—it is a self-contained frame of video. These self-contained frames are I-frames, which are also known as keyframes. Interframe compression predicts the contents of a frame based on previous and subsequent I-frames. These are called intermediate frames. Together, this collection of frames is called a group of pictures (GOP).
Keyframes are required for seeking or random access. If you jump to a point in a video that is a keyframe, the seek will be instant. If you jump to any other kind of frame (P for progressive, or B for bidirectional, described below), the codec will need to decode all the P-frames before the current frame, back to the previous keyframe. As a result, fewer keyframes means worse seek performance, but fewer keyframes can also reduce keyframe popping and can increase compression efficiency. Like everything else in compression, we face a classic tradeoff. Note that modern codecs will insert "natural keyframes" at hard cuts and moments of significant motion.