Emerging TechnologyAI & AutomationBenefiting from AI and deep learning for video summarization

Benefiting from AI and deep learning for video summarization

Adobe Sensei's Director of Machine Learning, Divya Jain shares about AI's key role with video summarization and gauging interest in the content.

The global video market is taking center stage, according to Forbes, over 500 million hours of video are watched on YouTube every day. Google adds that almost 50% of internet users look for videos related to a product or service before visiting a store. Many such statistics show how video content is growing and will remain the mainstream as a means of sharing information. We are already seeing a shift from copy and text to snapshot stories and visual posts (for instance, Instagram) for sharing content. Artificial intelligence (AI) is also playing a large role in this shift to video. We can use AI to improve video quality by stabilization, to understand and classify content for editing purposes, or to better deliver and target.

AI is also playing a key role with video summarization, a process of shortening a video by selecting keyframes or parts of videos that capture the main points in the video. Summarization has many use cases, with one of the most significant being the ability to gauge interest in the content. A flashcard summary can determine how many people will actually watch an entire video. Even a single thumbnail plays a crucial role in determining how many people will click on a video to play it. Along with determining video clicks, video summarization is also necessary for efficient viewing of the material and for video length adaptation for different mediums, like Instagram, Facebook, and the others.

Graph on how the global video market is increasing and taking the center stage

Recently, there have been many advances in using deep learning to increase the processing of images. The ability for AI to understand an image’s context has rapidly improved in accuracy. Similar techniques can be used to understand videos too, but this is a much more complex process. Video is not just a collection of a large number of frames or images, but videos are multi-dimensional, including audio, motion, and a time-series dimension. Each of these dimensions is key in understanding a video, and depending on what the summarization is targeting, different dimensions can be crucial.

The anatomy of AI video summarization

Video summarization can be categorized into two broad areas of machine learning, supervised and unsupervised. Supervised summarization entails learning patterns from previously annotated videos and examples. This works very well in case of videos where a pattern exists, like sporting events. For these videos, we can annotate some sequences and learn from them. However, the biggest challenge with supervised learning is the labeled data. It is costly to create these well-defined datasets. Labeling of data requires domain knowledge and does not work well when it comes to a wide variety of content that is present on the web.

The other machine learning form of summarization is unsupervised, where a smaller number of frames are selected from the original video through change detection in the video. Low-level features such as color, motion, and texture have been commonly used to create histograms and clusters to determine the similar frames within a video. A few frames are then selected that are deemed useful for the summary based on the information that they are conveying from the original video. These techniques work best when the video has distinct visual content, for example, a video taken throughout the different days of a vacation. However, these summaries often lack the context and come out as disjointed images.

Recent forms of deep learning look very promising in addressing the above-mentioned challenges. They lend themselves to much more effective creation of video summaries. While supervised deep learning techniques popularized the process, unsupervised techniques such as generative adversarial networks (GANs) and reinforcement learning are showing great promise, offering excellent advantages that are making them a forerunner in video summarization.

The power of emerging unsupervised deep learning techniques in video summarization

For videos that don’t adhere to any pattern and are completely different from each other, GANs work very nicely. GANs have two neural nets:

  1. An encoder that tries to mimic the real data.
  2. A decoder that is trying to learn if the generated data is fake or not.

This helps GANs learn the data distribution very effectively and create data that is very difficult to distinguish from the original dataset. In this case, each video can be described as a dataset, with GANs creating a subset of frames that are most representative of the given videos. This generates unique summaries for videos while preserving the context and meaning of the videos themselves. This technique can be used by marketers for creating smaller versions of full-length ads or campaigns based on the devices and target the right audience. This can also be used by creative artists to give a preview of their upcoming releases.Flow chart on unsupervised learning using generative adversarial networks (GANs)

For videos that have a common structure, like sporting events, reinforcement learning is more effective than supervised learning because it does not require labeled data. Here, the neural nets can learn which frames to choose based on a reward function. They learn from previous summaries to determine whether certain frames were watched or skipped. Different kinds of reward functions can also be defined in ways where previous information is not required, such as frame diversity and representativeness or frame category classification. Such techniques can be employed by campaign managers to create more watchable and memorable summaries from past experience and engage with their customers effectively.

These new unsupervised techniques are just the start of a new era in deep learning technology when it comes to video summarization. Many advances will be made in the near future to create and optimize the best summaries based on the audience, delivery medium, and intent of summarization. Together with efforts across the industry, we’ll make video summarization highly scalable, reliable, and incredibly efficient.

Divya Jain is Director of Machine Learning at Adobe Sensei. She can be found on Twitter @divyajain1.


US Mobile Streaming Behavior

Whitepaper | Mobile US Mobile Streaming Behavior


US Mobile Streaming Behavior

Streaming has become a staple of US media-viewing habits. Streaming video, however, still comes with a variety of pesky frustrations that viewers are ...

View resource
Winning the Data Game: Digital Analytics Tactics for Media Groups

Whitepaper | Analyzing Customer Data Winning the Data Game: Digital Analytics Tactics for Media Groups


Winning the Data Game: Digital Analytics Tactics f...

Data is the lifeblood of so many companies today. You need more of it, all of which at higher quality, and all the meanwhile being compliant with data...

View resource
Learning to win the talent war: how digital marketing can develop its people

Whitepaper | Digital Marketing Learning to win the talent war: how digital marketing can develop its people


Learning to win the talent war: how digital market...

This report documents the findings of a Fireside chat held by ClickZ in the first quarter of 2022. It provides expert insight on how companies can ret...

View resource
Data Analytics in Marketing

Whitepaper | Digital Transformation Data Analytics in Marketing


Data Analytics in Marketing

The Covid-19 pandemic has accelerated digital transformation, and data has been at the forefront of this change. This has created an opportunity for m...

View resource