Rotoscope with AI Assist
Rotoscope is a technique of tracing over video frames either to create a matte that can be used to remove parts of the frame—usually the background—or to create an animation based on what the camera captured. Invented in 1915 by animator Max Fleischer, the technique was famously used in works such as Ralph Bakshis 1978 The Lord of the Rings, Steve Barrons 1985 "Take On Me" video for A-Ha, and Richard Linklaters 2001 Waking Life. With the dramatic improvement of AI in scene-detection tasks over the past several years, AI-assisted rotoscoping is now widely available.
For the matte-generation type of AI rotoscoping work, Personify was an early player, initially generating mattes using Microsoft Xbox Kinect cameras—which had both visible and infrared sensors—to identify the person in the frame. It later developed software to identify the person from only a visible light camera, then it licensed that technology to Logitech for its C922 webcam. Similar techniques are now built into Zoom, Microsoft Teams, and other videoconferencing software to remove or blur backgrounds during video calls using scene detection and matte generation. These features are of great value for educational video both to protect privacy and to provide a way to show informative digital visual aids in the background instead of a room.
There are other tools that can be of value to education, such as those that use rotoscope as a means of generating an animation based on camera footage. One tool is EbSynth, which is currently free in beta. A highly entertaining tutorial is available from filmmaker Joel Haver, who has significantly built his YouTube channel on the possibilities opened up by rotoscoping (go2sm.com/roto). A much less time-intensive technique that Ive used is based on a paper by Xinrui Wang and Jinze Yu at the University of Tokyo and the TensorFlow implementation they published on GitHub (go2sm.com/whitebox). TensorFlow is a widely used open source machine-learning platform typically utilized with the Python programming language. The paper describes training Generative Adversarial Networks (GAN) on photographs and hand-drawn cartoons of those photographs so that the GANs can generate new cartoons. By training GANs on a different dataset of photograph/drawing pairs, you can tune the style of cartoon it produces, although the results in the paper were from the same training set.
Video is basically a flipbook, so we can use Wang and Yus technique to redraw a video as a cartoon, page by page and frame by frame. The first step is to probe our video. I encoded a clip from a public domain NOAA video down to 15 frames for simplicity and saved it in the test code folder of the project on GitHub.
>ffmpeg -i NOAA_SharkClip_15.mp4
Input #0, mov,mp4,m4a,3gp,3g2,mj2,
from ‘NOAA_SharkClip_15.mp4:
Duration: 00:00:47.49, start: 0.000000, bitrate: 2582 kb/s
Stream #0:0[0x1](eng): Video: h264 (Main) (avc1 / 0x31637661), yuv420p(progressive), 960x540 [SAR 1:1 DAR 16:9], 2457 kb/s, 15 fps, 15 tbr, 30k tbn (default)
Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 125 kb/s (default)
The important thing to note from this output is that the zeroth stream is the video, and it is 15 fps and has a resolution of 960x540. I then exported all of the frames of this video as PNG files:
>ffmpeg -i NOAA_SharkClip_15.mp4 -an -r 15 -s 960x540 test_images/frame%06d.png
The final parameter of the command tells it to write the PNG files to the test_images folder and to name them "frame," followed by a six-digit count of the frames and then the .png extension. If your video is long, youll need to increase the number of leading 0s. After that, run the cartoonize.py program to convert all of the photographic images in test_images to cartoon images with the same filename in the cartoonized_images folder. This process can take a very long time unless you have a graphics card with CUDA compute capability of 3.5 or higher.
Using an older Tesla K20 card, it took about 6 minutes to process all of the frames of this 48- second video. Once cartoonize.py finishes its work, use FFmpeg to generate a new video from those frames, muxing in a copy of the audio stream from the original video file.
>ffmpeg -r 15 -i cartoonized_images/frame%06d.png -i NOAA_SharkClip_15.mp4 -map 0:0 -map 1:1 -vcodec libx264 -tune animation -b:v 2M -s 960x540 -acodec copy NOAA_SharkToon_15.mp4
The keys here are that were using the -r frame rate flag on the inputs so that only 15 frames per second are used to create the new video stream, and the -map flags instruct FFmpeg to use the zeroth inputs zeroth stream and the first inputs first stream (the audio). Side-by-side results of this process are available herehttps://go2sm.com/noaa. I had two use cases in mind for this technique. One was to suppress irrelevant details in footage of simulated events so students wouldnt get distracted. The other was a similar motivation for blurring background footage: Video can be an intimate teaching mode, and many teachers are uncomfortable broadcasting their face to remote students.
Related Articles
So you need a new video management system (VMS) for your organization? Here are some suggestions for making sure your VMS doesn't compromise your school's or your students' data.
22 Mar 2022
There's been a lot written about online educational video since the beginning of the pandemic, and the results are surprising, though far from conclusive.
15 Feb 2022
In-person or virtual? It's no longer one or the other, and schools and universities need to have clear strategies for delivering hybrid education to their students.
09 Sep 2021