TY - JOUR
T1 - A deep audio-visual model for efficient dynamic video summarization
AU - El-Nagar, Gamal
AU - El-Sawy, Ahmed
AU - Rashad, Metwally
N1 - Publisher Copyright:
© 2024 Elsevier Inc.
PY - 2024/4
Y1 - 2024/4
N2 - The adage “a picture is worth a thousand words” resonates in the digital video domain, suggesting that a video could be seen as a composition of millions of these words. Videos are composed of countless frames. Video summarization creates cohesive visual units in scenes by condensing shots from segments. Video summarization gains prominence by condensing lengthy videos while retaining crucial content. Despite effective techniques using keyframes or keyshots in video summarization, integrating audio components is imperative. This paper focuses on integrating deep learning techniques to generate dynamic summaries enriched with audio. To address that gap, an efficient model employs audio-visual features, enriching summarization for more robust and informative video summaries. The model selects keyshots based on their significance scores, safeguarding essential content. Assigning these scores to specific video shots is a pivotal yet demanding task for video summarization. The model's evaluation occurs on benchmark datasets, TVSum and SumMe. Experimental outcomes reveal its efficacy, showcasing considerable performance enhancements. On the TVSum, SumMe datasets, an F-Score metric of 79.33% and 66.78%, respectively, is achieved, surpassing previous state-of-the-art techniques.
AB - The adage “a picture is worth a thousand words” resonates in the digital video domain, suggesting that a video could be seen as a composition of millions of these words. Videos are composed of countless frames. Video summarization creates cohesive visual units in scenes by condensing shots from segments. Video summarization gains prominence by condensing lengthy videos while retaining crucial content. Despite effective techniques using keyframes or keyshots in video summarization, integrating audio components is imperative. This paper focuses on integrating deep learning techniques to generate dynamic summaries enriched with audio. To address that gap, an efficient model employs audio-visual features, enriching summarization for more robust and informative video summaries. The model selects keyshots based on their significance scores, safeguarding essential content. Assigning these scores to specific video shots is a pivotal yet demanding task for video summarization. The model's evaluation occurs on benchmark datasets, TVSum and SumMe. Experimental outcomes reveal its efficacy, showcasing considerable performance enhancements. On the TVSum, SumMe datasets, an F-Score metric of 79.33% and 66.78%, respectively, is achieved, surpassing previous state-of-the-art techniques.
KW - SumMe
KW - TVSum
KW - VGGish
KW - Video Skimming
KW - Visualization of score curves
UR - http://www.scopus.com/inward/record.url?scp=85189521975&partnerID=8YFLogxK
U2 - 10.1016/j.jvcir.2024.104130
DO - 10.1016/j.jvcir.2024.104130
M3 - Article
AN - SCOPUS:85189521975
SN - 1047-3203
VL - 100
JO - Journal of Visual Communication and Image Representation
JF - Journal of Visual Communication and Image Representation
M1 - 104130
ER -