February 10, 2020
By Jan Ozer Contributing Editor
The Producer's View

When it Comes to Video Quality Measurements, Average Won't Cut It

Most video-quality metric tools score a video file by computing a score for all frames and then averaging the score. Like the old saw about being comfortable because you have one foot in the oven and the other in an ice bucket, average scores can be deceiving, mostly because they may hide transient quality issues that can degrade your viewer's quality of experience (QoE). Fortunately, there are several mechanisms to avoid this trap, though they tend to be tool-specific.

For example, SSIMWAVE's SSIMPLUS VOD Monitor produces two SSIMPLUS scores, one computed using a simple average and another using an algorithm called the weighted average index, or WAI. Here's a blurb from an SSIMWAVE white paper: "Perceptual video QoE is not a static process and the impact of different parts of the video to the overall QoE are different. Humans tend to remember the relatively low-quality moments. WAI … is designed to capture the influence of quality fluctuation to overall QoE." What's great about WAI is that you can control how much weight to assign these quality fluctuations via a simple numerical control.

Moscow State University (MSU) adopted a similar approach with its recent version 12 release of the Video Quality Measurement Tool (VQMT), which produces both an average score for each video file and the harmonic mean. Here's an explanation from my contact at MSU: "For metrics where higher scores are better, such as SSIM, PSNR [peak signal-to-noise ratio], and VMAF [Video Multimethod Assessment Fusion], viewers may be unsatisfied if 15 seconds of a 50-second video had a PSNR value of 20 and 35 seconds had PSNR value of 50. The arithmetic mean will be (50*35 + 20*15)/50 = 41 which is quite good. The harmonic mean will be 50/(35/50 + 15/20) = 34.5 which is closer to reality, because low values have higher priority." So, it's the same principle as SSIMWAVE's WAI, but there's no flexibility because you can't control how much weight to give the variations. The formula is the formula.

If you compute VMAF using FFmpeg, which you can learn to do on Streaming Learning Center, you can choose among three pooling methods—min, harmonic mean, or mean. Since the VMAF filter can produce PSNR, SSIM, and MS SSIM scores along with VMAF, you can compute the harmonic mean score for all four metrics. I checked and didn't find the same option in the Netflix vmafossexec.exe tool, though I could be mistaken. Of course, you could always import the individual frame scores into Google Sheets or Excel and compute the harmonic mean, but that's a lot of work.

Another measure of variability is the standard deviation. So, if two video files posted a VMAF score of 90, but one had a standard deviation of 3 and the other 7, you'd know that the latter file had more variability. Hybrik's Media Analyzerwas the first tool I used that included the standard deviation plus the value and location of the highest and lowest score in the file. Not surprisingly, besides adding the harmonic mean, MSU now computes the standard deviation and adds the low and high frame value and location to all metric calculations in version 12 of VQMT.

Note that the SSIMPLUS VOD Monitor, VQMT, and Hybrik all have results plots that graphically display the metric score over the duration of the file, so you can eyeball the variability yourself, which is useful in a one-off situation but hard to build into an automated workflow. With the first two (but not Hybrik), you can also click the graph and view the actual frames to verify that the low score reflects the actual subjective appearance. This is a great feature when comparing encoding techniques or codecs, though again, it's not automatable.

The bottom line is that if you're working with a simple average of the frame scores, you're missing variations that could impact QoE, which is why we use the metrics in the first place. So, find a tool that gives you some of the additional datapoints listed above.