Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.
Figure 1: Comparison of various image caption generation models. red: zero-shot, cyan: supervised, violet: unsupervised approaches and blue: our approach.
AGIC processes input images through a pre-trained vision transformer to extract attention weights highlighting semantically relevant regions. The attention matrix Al,h at layer l and head h is computed using query and key projection matrices, with attention weights ail aggregated across all heads to capture contextual relevance.
Important image features are amplified using attention weights with factor k, where Ia(i,j) = Io(i,j) · (1 + k · a(i,j)). The amplified image is then processed by the captioning model. We employ a hybrid decoding strategy combining beam search with Top-k, Top-p sampling, and temperature scaling for enhanced diversity and fluency.
Figure 2: The Attention-Guided Image Captioning (AGIC) pipeline.
Figure 3: Comparative analysis of inference time across zero-shot, supervised, and unsupervised captioning frameworks.
We conduct a detailed error analysis by manually examining four aspects: hallucination, omission, irrelevance, and ambiguity. While AGIC generally improves caption relevance, our analysis reveals that it occasionally omits salient objects present in the image and tends to hallucinate on Flickr30k data samples. These observations highlight areas for further refinement. Error analysis is conducted on 50 images from each dataset.
Figure 4: Qualitative error analysis of image captions generated by the AGIC model compared to ground truth (GT) descriptions. Each example highlights a specific type of error: hallucination, omission, irrelevance, vagueness, and ambiguity. One example demonstrates a correct caption with no notable issues.
| Hallucination | Omission | Irrelevance | Ambiguity | |
|---|---|---|---|---|
| Flickr8k | 7 | 12 | 3 | 5 |
| Flickr30k | 11 | 14 | 3 | 4 |
Table 1: Error analysis of AGIC model on 50 image samples from both Datasets.
| Relevancy | Correctness | Completion | |
|---|---|---|---|
| Flickr8k | 0.80 | 0.86 | 0.75 |
| Flickr30k | 0.77 | 0.77 | 0.88 |
Table 2: Human Evaluation of AGIC: Inter rater reliability (ICC) by two human annotators.
@misc{teja2025agic,
author = "Teja, L D M S Sai and Urlana, Ashok and Mishra, Pruthwik",
title = "{AGIC}: Attention-Guided Image Captioning to Improve Caption Relevance",
note = "arXiv preprint arXiv:2508.06853",
year = "2025",
url = "https://arxiv.org/abs/2508.06853"
}