Image Caption Generation and Comprehensive Comparison of Image Encoders

Shitiz Gupta; Shubham Agnihotri; Deepasha Birla; Puneet Singh Lamb; Achin Jain; Thavavel Vaiyapuri

doi:10.54216/FPA.040202

Image Caption Generation and Comprehensive Comparison of Image Encoders

Shitiz Gupta, Shubham Agnihotri, Deepasha Birla, Puneet Singh Lamb, Achin Jain, Thavavel Vaiyapuri

Computer Sciences

Guru Gobind Singh Indraprastha University

Research output: Contribution to journal › Article › peer-review

12 Scopus citations

Abstract

Image caption generation is a stimulating multimodal task. Substantial advancements have been made in the field of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transfer learning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention, along with embedded text to generate high accuracy captions. We have compared these models on several benchmark datasets based on different evaluation metrics like BLEU and METEOR.

Original language	English
Pages (from-to)	42-55
Number of pages	14
Journal	Fusion: Practice and Applications
Volume	4
Issue number	2
DOIs	https://doi.org/10.54216/FPA.040202
State	Published - 2021

Keywords

CNN (Convolutional Neural Network)
Image Captioning
RNN (Recurrent neural network)and LSTM (Long Short Term Memory)
Transfer Learning

Access to Document

10.54216/FPA.040202

Cite this

@article{93dfaf976f6c445f868b6cb80c7ea599,

title = "Image Caption Generation and Comprehensive Comparison of Image Encoders",

abstract = "Image caption generation is a stimulating multimodal task. Substantial advancements have been made in the field of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transfer learning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention, along with embedded text to generate high accuracy captions. We have compared these models on several benchmark datasets based on different evaluation metrics like BLEU and METEOR.",

keywords = "CNN (Convolutional Neural Network), Image Captioning, RNN (Recurrent neural network)and LSTM (Long Short Term Memory), Transfer Learning",

author = "Shitiz Gupta and Shubham Agnihotri and Deepasha Birla and Lamb, \{Puneet Singh\} and Achin Jain and Thavavel Vaiyapuri",

year = "2021",

doi = "10.54216/FPA.040202",

language = "English",

volume = "4",

pages = "42--55",

journal = "Fusion: Practice and Applications",

issn = "2770-0070",

publisher = "American Scientific Publishing Group (ASPG)",

number = "2",

}

TY - JOUR

T1 - Image Caption Generation and Comprehensive Comparison of Image Encoders

AU - Gupta, Shitiz

AU - Agnihotri, Shubham

AU - Birla, Deepasha

AU - Lamb, Puneet Singh

AU - Jain, Achin

AU - Vaiyapuri, Thavavel

PY - 2021

Y1 - 2021

N2 - Image caption generation is a stimulating multimodal task. Substantial advancements have been made in the field of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transfer learning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention, along with embedded text to generate high accuracy captions. We have compared these models on several benchmark datasets based on different evaluation metrics like BLEU and METEOR.

AB - Image caption generation is a stimulating multimodal task. Substantial advancements have been made in the field of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transfer learning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention, along with embedded text to generate high accuracy captions. We have compared these models on several benchmark datasets based on different evaluation metrics like BLEU and METEOR.

KW - CNN (Convolutional Neural Network)

KW - Image Captioning

KW - RNN (Recurrent neural network)and LSTM (Long Short Term Memory)

KW - Transfer Learning

UR - https://www.scopus.com/pages/publications/85119150299

U2 - 10.54216/FPA.040202

DO - 10.54216/FPA.040202

M3 - Article

AN - SCOPUS:85119150299

SN - 2770-0070

VL - 4

SP - 42

EP - 55

JO - Fusion: Practice and Applications

JF - Fusion: Practice and Applications

IS - 2

ER -

Image Caption Generation and Comprehensive Comparison of Image Encoders

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this