A deep audio-visual model for efficient dynamic video summarization

Gamal El-Nagar; Ahmed El-Sawy; Metwally Rashad

doi:10.1016/j.jvcir.2024.104130

A deep audio-visual model for efficient dynamic video summarization

Gamal El-Nagar, Ahmed El-Sawy, Metwally Rashad

Computer Engineering

Benha University

Research output: Contribution to journal › Article › peer-review

Abstract

The adage “a picture is worth a thousand words” resonates in the digital video domain, suggesting that a video could be seen as a composition of millions of these words. Videos are composed of countless frames. Video summarization creates cohesive visual units in scenes by condensing shots from segments. Video summarization gains prominence by condensing lengthy videos while retaining crucial content. Despite effective techniques using keyframes or keyshots in video summarization, integrating audio components is imperative. This paper focuses on integrating deep learning techniques to generate dynamic summaries enriched with audio. To address that gap, an efficient model employs audio-visual features, enriching summarization for more robust and informative video summaries. The model selects keyshots based on their significance scores, safeguarding essential content. Assigning these scores to specific video shots is a pivotal yet demanding task for video summarization. The model's evaluation occurs on benchmark datasets, TVSum and SumMe. Experimental outcomes reveal its efficacy, showcasing considerable performance enhancements. On the TVSum, SumMe datasets, an F-Score metric of 79.33% and 66.78%, respectively, is achieved, surpassing previous state-of-the-art techniques.

Original language	English
Article number	104130
Journal	Journal of Visual Communication and Image Representation
Volume	100
DOIs	https://doi.org/10.1016/j.jvcir.2024.104130
State	Published - Apr 2024

Keywords

SumMe
TVSum
VGGish
Video Skimming
Visualization of score curves

Access to Document

10.1016/j.jvcir.2024.104130

Cite this

@article{7d01aefe7d7b4b6bbe13edef8290a68b,

title = "A deep audio-visual model for efficient dynamic video summarization",

abstract = "The adage “a picture is worth a thousand words” resonates in the digital video domain, suggesting that a video could be seen as a composition of millions of these words. Videos are composed of countless frames. Video summarization creates cohesive visual units in scenes by condensing shots from segments. Video summarization gains prominence by condensing lengthy videos while retaining crucial content. Despite effective techniques using keyframes or keyshots in video summarization, integrating audio components is imperative. This paper focuses on integrating deep learning techniques to generate dynamic summaries enriched with audio. To address that gap, an efficient model employs audio-visual features, enriching summarization for more robust and informative video summaries. The model selects keyshots based on their significance scores, safeguarding essential content. Assigning these scores to specific video shots is a pivotal yet demanding task for video summarization. The model's evaluation occurs on benchmark datasets, TVSum and SumMe. Experimental outcomes reveal its efficacy, showcasing considerable performance enhancements. On the TVSum, SumMe datasets, an F-Score metric of 79.33\% and 66.78\%, respectively, is achieved, surpassing previous state-of-the-art techniques.",

keywords = "SumMe, TVSum, VGGish, Video Skimming, Visualization of score curves",

author = "Gamal El-Nagar and Ahmed El-Sawy and Metwally Rashad",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier Inc.",

year = "2024",

month = apr,

doi = "10.1016/j.jvcir.2024.104130",

language = "English",

volume = "100",

journal = "Journal of Visual Communication and Image Representation",

issn = "1047-3203",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - A deep audio-visual model for efficient dynamic video summarization

AU - El-Nagar, Gamal

AU - El-Sawy, Ahmed

AU - Rashad, Metwally

PY - 2024/4

Y1 - 2024/4

N2 - The adage “a picture is worth a thousand words” resonates in the digital video domain, suggesting that a video could be seen as a composition of millions of these words. Videos are composed of countless frames. Video summarization creates cohesive visual units in scenes by condensing shots from segments. Video summarization gains prominence by condensing lengthy videos while retaining crucial content. Despite effective techniques using keyframes or keyshots in video summarization, integrating audio components is imperative. This paper focuses on integrating deep learning techniques to generate dynamic summaries enriched with audio. To address that gap, an efficient model employs audio-visual features, enriching summarization for more robust and informative video summaries. The model selects keyshots based on their significance scores, safeguarding essential content. Assigning these scores to specific video shots is a pivotal yet demanding task for video summarization. The model's evaluation occurs on benchmark datasets, TVSum and SumMe. Experimental outcomes reveal its efficacy, showcasing considerable performance enhancements. On the TVSum, SumMe datasets, an F-Score metric of 79.33% and 66.78%, respectively, is achieved, surpassing previous state-of-the-art techniques.

AB - The adage “a picture is worth a thousand words” resonates in the digital video domain, suggesting that a video could be seen as a composition of millions of these words. Videos are composed of countless frames. Video summarization creates cohesive visual units in scenes by condensing shots from segments. Video summarization gains prominence by condensing lengthy videos while retaining crucial content. Despite effective techniques using keyframes or keyshots in video summarization, integrating audio components is imperative. This paper focuses on integrating deep learning techniques to generate dynamic summaries enriched with audio. To address that gap, an efficient model employs audio-visual features, enriching summarization for more robust and informative video summaries. The model selects keyshots based on their significance scores, safeguarding essential content. Assigning these scores to specific video shots is a pivotal yet demanding task for video summarization. The model's evaluation occurs on benchmark datasets, TVSum and SumMe. Experimental outcomes reveal its efficacy, showcasing considerable performance enhancements. On the TVSum, SumMe datasets, an F-Score metric of 79.33% and 66.78%, respectively, is achieved, surpassing previous state-of-the-art techniques.

KW - SumMe

KW - TVSum

KW - VGGish

KW - Video Skimming

KW - Visualization of score curves

UR - http://www.scopus.com/inward/record.url?scp=85189521975&partnerID=8YFLogxK

U2 - 10.1016/j.jvcir.2024.104130

DO - 10.1016/j.jvcir.2024.104130

M3 - Article

AN - SCOPUS:85189521975

SN - 1047-3203

VL - 100

JO - Journal of Visual Communication and Image Representation

JF - Journal of Visual Communication and Image Representation

M1 - 104130

ER -

A deep audio-visual model for efficient dynamic video summarization

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this