Videos are but a sequence of consecutive images (or frames) with small differences being painted in rapid succession to provide the illusion of motion. Before folks can chase me with pitchforks angered at the gross oversimplification of what goes into storing and playing digital videos of this age - the keyframes, the deltas, the interpolation and all the intelligent algorithms that allow us to encode every required bit of information into a much more compressed format as opposed to a naive sequence of full frames images - allow me to capture the intent of my conversation: all animation for that matter, digital or otherwise, is built on this basic founding premise.
For normal video playback, the primary input variable is nothing but a synthesized numerical value that is repeatedly updated in accordance to how we human beings perceive the passage of “time”. Given a specific value, we know which frame to display. Done repeatedly, we have motion picture.
It’s not hard to imagine that this input variable can be fed in by other sources apart from the so customary time-axis. What about space co-ordinates? Say the user’s scroll position on a page? Or any action that the user takes which can be crunched through a mathematical function and reduced to a value on a number line? Such patterns are fairly well established and sometimes commonplace. Occasionally, they help build quite the creative user experience. Apple Inc., for one, has time and again exhibited their affinity for such patterns, most recently with their Airpods Pro website.
So far every time, almost to a fault, implementation details have revealed that to present us with such animations, a large set of images representing individual frames are downloaded and selectively displayed in rapid succession on the screen in response to an input signal such as a scroll event. That’s downloading a lot of image files whose content vary very little incrementally from one frame image to the next by design. In the process of doing so, are we throwing all the advancements we’ve made together as a tech community in the world of video compression, out of the window?
From my understanding, this is mostly because of the limitations of Web API (or a lack thereof) that would allow us to efficiently go back and forth to paint a specific frame from a video loaded on a web page in a manner that is fast and responsive. The sentiment is perhaps shared and the limitation is acknowledged too.
With all that being said, this article is an attempt to proverbially dip my feet into the water of how we build such experiences and hopefully be able to share some learnings from a bunch of quick prototypes of potential web video frame extraction and scrubbing techniques within the confines of existing limitations of today. The overarching theme is trying to extract necessary frames out of a video either on the client (in-browser) or aided by a server (as in the example above), such that they can later be used to provide a video scrubbing experience based on page scrolling.
The final video used for the purpose of these demos is taken from a public list of samples that I found and is a 1280x720p resolution 15-second-duration video with a download size of ~2.5MB. My tests were run on Chrome 78 on 2015 15” Macbook Pro (desktop) and Chrome 78 for Android on a Oneplus 5 (Snapdragon 835 SoC with 8GB RAM) mobile phone, all over a fairly good WiFi connection.
#1: video-current-time (demo)
This mechanism simply loads the video in an HTML5
video tag and sets the
currentTime property of
the loaded video to scrub it when scrolling. We do not specifically extract out frames from the
video, instead, we just let the normal video playing experience on the web take care of it and see
how it does.
This somewhat worked out on high-end devices (such as my 15” Macbook Pro) especially with a not-too-high quality video, or perhaps as long as the browser is fast and powerful enough to be able to quickly seek back and forth and paint the frames out of the provided video. But it can’t be trusted beyond that. As expected, on mobile devices (even on a decently well-to-do phone such as a Oneplus 5 which I use as my primary mobile device), this was quite miserable with no frame updates happening when the scrolling is in motion, till the UI thread has had the breathing room to update pixels on the page. I also have a hunch that the browser (tested on Chrome 78 for Android) may be purposefully doing things (mobile optimisations?) that it doesn’t do on the desktop version that makes this mechanism not work well on the mobile browser.
It’s important to realise that browsers internally do a lot of magic to understand and optimise what’s the best way to display a video and update it on a page… and unless we’re making the browser’s life easy, it’s going to leave us feeling stupid.
I’ll admit that the videos I had been playing around with we’re not per se additionally optimised and specifically encoded in a way to facilitate extremely fast seeking - and we may anecdotally know that it may have been possible achieve a better experience if we were to do so - but the frame drops I observed were stupendous; drastically falling apart as I went about increasing the resolution of the video (even at 720p) which with the intent of the type of experience we’re trying to build here, will probably be quite hard to sacrifice if we want to build a great experience.
#2: video-play-unpack-frames-canvas (demo)
So the two-line tactic did not work out. Great. Let’s evolve from there.
What we do here is load the video in a hidden HTML5
video tag and unpack video frames from it by
play the video and then listening to
timeupdate events at regular intervals on the
video element being fired as it’s being played, at which point we
pause the video and grab the
current frame by painting the outcome on an
OffscreenCanvas element and collecting the frame’s
image bitmap from its 2D context. When done, we start playing the video again, looping through
the process until the video has come to an end.
The basic idea is to generate a set of static images from the source video by the end of this
exercise. We use an
OffscreenCanvas for possible performance
benefits on top of a normal
canvas element, but that would work as well.
Once the extraction of frames is done (a set of
ImageBitmap objects representing
the frames is retained in memory), for scrubbing we figure out the correct frame to paint based on
the input signal (scroll position) and then draw the correct frame on a visible
element on the page.
The scrubbing part itself worked out fairly well - it was fast enough to scroll and scrub around
without any visible lag on pretty much all devices (desktop and mobile) I tested on. Retaining a
representation of frames in a set of image bitmaps in memory which can be painted rapidly on a
canvas (as opposed to trying to encode and put them into
img elements that are then chosen to be
displayed or hidden in quick succession) must have contributed significantly in making the scrubbing
experience smooth by making the browser do less work.
#3: video-seek-unpack-frames-canvas (demo)
This is quite similar to approach #2 above but it tries to eliminate the glaring video playback
duration wait problem by performing
seek instead of
play while extracting out frames. Quite
obvious really when you think about it.
In the current prototype, a predefined number of frames are unpacked, but this can also be easily changed to a frame-rate based approach rather than overall count.
Once frames are extracted, the scrubbing experience works the same.
Turns out, this is indeed much faster! On the same test setup, the same 15-second 1280x720p video took about 9 seconds to extract out 244 frames (first hit) and 6 seconds when the video was cached (subsequent hits). That’s a 2x-3x improvement for the same number of frames.
But yeah. I’d agree that 6 seconds in itself is not a number to proudly strive for.
#4: video-seek-media-stream-image-capture (demo)
Again, this is largely similar to the above approaches #2 and #3 in terms of seeking through the
video using an HTML5
video tag. But instead of pausing and drawing it on a canvas context to
extract out the frame’s image bitmap data, I wanted to check if we could use
video element to capture the video stream and then we use the captured stream’s
ImageCapture interface to grab
the image bitmap data of a frame at the desired point in time. Well, it works.
For scrubbing, the same approach is followed.
I’d be honest - while the approach to use
MediaStream APIs had
originally somehow struck me as more elegant in concept, in reality, this turned out to be a bit of
a bummer! It was slower than approach #3 performance-wise, taking as much as 12 seconds (first hit)
and 9 seconds (subsequent hits when the video was cached) which is about a 1.3-1.5x degradation
compared to directly drawing the video element in an
OffscreenCanvas and extracting out the image
bitmap from it, on the same test setup. Now I am not 100% certain that I’ve not made any fundamental
mistakes in terms of best practices for using these streaming APIs (I believe I haven’t goofed up),
in retrospect, this was perhaps to be expected due to all the internal complexity that browser has
to take care of to open a media stream and then do things with it. That’s okay - I don’t quite
believe this use-case is something that the MediaStream APIs are intended to solve anyway.
#5: video-server-frames (demo)
Perhaps the simplest mechanism of all, it relies on the server to provide a bunch of video frames as images that are downloaded and scrubbed through.
This works out really well when you know upfront what exact content (the video and hence the image frames) you’re going to load and scrub through exactly, which is legitimately a fair assumption to make in the use-case we’ve been discussing here. You can pre-generate and store a set of frames easily at build time on your server or CDNs and serve them when required by the client. Within the context of discussed use-cases it goes along well with another great software design principle I love and quote from time to time: Avoid doing at runtime what you can do at design time.
For the same number of frames (244) which were pre-computed and delivered from the server, the network bytes transferred was about 20% larger (~3MB as opposed to ~2.5MB video), but getting the frames ready for scrubbing took about 2.5 seconds (first hit) and 1.3 seconds (subsequent hits when the frame images were cached) which is 3x-4.5x faster than having to download the video and then extract frames from it as fast as we can (approach #3). I should mention though that all of this happened over a HTTP/2 connection (which is today’s reality) to the same CDN (which surely worked out in favour of having to make those 244 requests).
Initially, it seemed that downloading an image sprite with a bunch of frames as opposed to individual requests for every frame would a good idea, but it turned out to be very tricky. Based on the actual frame images and parameters like how many frames to fetch, sprites can actually degrade the performance by visibly increasing size of downloads or at least reduce flexibility. In a world with HTTP/2, distinct images fare better - we could even prioritise certain frames and bootstrap the scrubbing experience faster.
Definitely an idea to pursue, although I haven’t yet been able to test this in action.
The idea is to exploit WebAssembly to have an in-browser ffmpeg module loaded which can then be invoked to extract out frames pretty fast. This should be possible today in theory with projects like ffmpeg.js.
Honestly, I tried going through this but have so far given up having faced several difficulties with compiling low-level modules into a build of ffmpeg.js that would be necessary for this experiment - somehow, the default ffpmeg.js builds are not built with the required options needed for performing frame extracts. Oops!
I do hope to try again in the future and write another blog post on how that goes.
One sure shot thing to consider though - for typical small-sized videos or when the actual content in question is known not to be very dynamic in nature, this sounds like a fairly over-engineered idea. For one, the WASM library build for ffmpeg.js itself is humongous in size (~14MB) to have it downloaded and instantiated before any actual work can happen, which is fairly cost-prohibitive for what I had been trying to achieve here. This might, however, breakeven for other frame extraction use-cases which fit the bill better - say we’re dynamically changing a lot of video content, scrubbing through them, saving them back and so on (for eg. in an in-browser video frame extractor and editor).
From the numbers, sending out pre-computed frames from the server (approach #5) turned out to be the most efficient for practical network and device conditions that such use-cases be exposed to in terms of overall cost-benfit, complexity and user experience. So, looks like Apple’s approach was right given the circumstances. Otherwise, if I have to compute it on the client though, I’d go with approach #3.
As for users with constrained network connection and device power, I strongly think that such experiences shouldn’t even go out to such users. Probably find alternate experiences for them that provide more value. For the sake of completeness, I did try out on slower network connections, #5 still worked more reliably than trying to pull a video which somehow got stuck or kept buffering.
On a high-level, one of the major costs we’re trading off here is the network consumption vs. device compute. From the observations, it clearly seems that unless the total download time (factor of size and round-trips) of our image frames is not massively larger than the video (so much as to reach a point of inflex), it distinctly works out in favour of downloading pre-computed image frames rather than the video and then compute out the frames from it. A progressive enhancement to our approaches #2 through #4 could definitely be that we store the computed frames in a cache locally and avoid having to generate them every time the page is loaded, but still, the initial costs far outweigh the benefits when we know what content (the video and hence the frames) is to be scrubbed. The other obvious trade-off is the choice of the flexibility of the content itself - but that’s not really a problem if our content is not truly dynamic.
Given the state of Web APIs, and use-case in question, pre-computed frames from the server is probably the best way to go about it now for production scenarios. That’s the opinion I’m going to stick with for now.
As a bonus, this also opens up pathways for adapting experience parameters such as with the number of frames to download (animation frame-rate), image format or compression level etc. which can be easily negotiated with the server to only download what will be used for an optimal experience on that specific device, based on information on client-side capabilities (device computation power, memory, network speed, data-saver modes and so on) as compared to having to download one of few pre-defined video and then extract usable pieces (some frames) from it.
Do you have other approaches in mind? Do share in the comment below - I’d be excited to give them a try!
In a future where native browser support to unpack frames from a video fast and efficiently, or at least some native API on the browser that provides the capability to write custom logic to do perform efficient processing on video streams (think codecs) become a reality, this is to hoping that we’ll not have to be limited to the current antics. But it’s perhaps a bit too early to clearly say.
Perhaps there is hope with WebCodecs?
While playing around with these experiments, I decided to quickly hack up together a video frame extract tool that can take any video that is uploaded as input and extract out frames from it, conveniently downloaded as a bunch of JPEG images within a single ZIP file.
It isn’t an extremely powerful tool as such but is a little bit configurable, such as how many frames to extract or at what frame rate and gets the job done simply and fairly well.
Be sure to check it out! I’m also eager to listen to any feedback there is.