AGIC:Attention-Guided Image Captioning to Improve Caption Relevance

L D M S Sai Teja1 Ashok Urlana 2 Pruthwik Mishra3
1 NIT Silchar • 2 TCS Research, Hyderabad • 3 SVNIT, Surat

Abstract

Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.

Paper example illustration
LlaVA: Two little girls are playing with bubbles in the park.
Qwen: Two young friends share a joyful moment, creating a world of bubbles in the park.
Fuyu: Two little girls blowing bubbles in the park.
BRNN: Two girls playing in the park.
LSTNet: A young girl is blowing bubbles, holding them in her hands.
R²M: Two young girls playing with bubbles in grass.
AGIC: Two young girls wearing floral dress blowing bubbles in a park covered with grass.

Figure 1: Comparison of various image caption generation models. red: zero-shot, cyan: supervised, violet: unsupervised approaches and blue: our approach.

Method

AGIC processes input images through a pre-trained vision transformer to extract attention weights highlighting semantically relevant regions. The attention matrix Al,h at layer l and head h is computed using query and key projection matrices, with attention weights ail aggregated across all heads to capture contextual relevance.

Al,h = softmax(( (Xl−1WQl,h)(Xl−1WKl,h)T/ √dh))
ail = (1 / H) ∑h=1H Ail,h

Important image features are amplified using attention weights with factor k, where Ia(i,j) = Io(i,j) · (1 + k · a(i,j)). The amplified image is then processed by the captioning model. We employ a hybrid decoding strategy combining beam search with Top-k, Top-p sampling, and temperature scaling for enhanced diversity and fluency.

Ia(i,j) = Io(i,j) · (1 + k · a(i,j))
xt ~ Top-p(Top-k(Softmax(zt / T)))

Diagrams

AGIC Pipeline

AGIC Workflow

Figure 2: The Attention-Guided Image Captioning (AGIC) pipeline.

Inference Time

Inference Time Comparison

Figure 3: Comparative analysis of inference time across zero-shot, supervised, and unsupervised captioning frameworks.

Error Analysis

We conduct a detailed error analysis by manually examining four aspects: hallucination, omission, irrelevance, and ambiguity. While AGIC generally improves caption relevance, our analysis reveals that it occasionally omits salient objects present in the image and tends to hallucinate on Flickr30k data samples. These observations highlight areas for further refinement. Error analysis is conducted on 50 images from each dataset.

Error Analysis

Figure 4: Qualitative error analysis of image captions generated by the AGIC model compared to ground truth (GT) descriptions. Each example highlights a specific type of error: hallucination, omission, irrelevance, vagueness, and ambiguity. One example demonstrates a correct caption with no notable issues.

Hallucination Omission Irrelevance Ambiguity
Flickr8k 7 12 3 5
Flickr30k 11 14 3 4

Table 1: Error analysis of AGIC model on 50 image samples from both Datasets.

Relevancy Correctness Completion
Flickr8k 0.80 0.86 0.75
Flickr30k 0.77 0.77 0.88

Table 2: Human Evaluation of AGIC: Inter rater reliability (ICC) by two human annotators.

Citation

@misc{teja2025agic,
    author = "Teja, L D M S Sai and Urlana, Ashok and Mishra, Pruthwik",
    title = "{AGIC}: Attention-Guided Image Captioning to Improve Caption Relevance",
    note = "arXiv preprint arXiv:2508.06853",
    year = "2025",
    url = "https://arxiv.org/abs/2508.06853"
}