AGIC: Attention-Guided Image Captioning

Abstract

Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.

LlaVA: Two little girls are playing with bubbles in the park.

Qwen: Two young friends share a joyful moment, creating a world of bubbles in the park.

Fuyu: Two little girls blowing bubbles in the park.

BRNN: Two girls playing in the park.

LSTNet: A young girl is blowing bubbles, holding them in her hands.

R²M: Two young girls playing with bubbles in grass.

AGIC: Two young girls wearing floral dress blowing bubbles in a park covered with grass.

Figure 1: Comparison of various image caption generation models. red: zero-shot, cyan: supervised, violet: unsupervised approaches and blue: our approach.

Method

AGIC processes input images through a pre-trained vision transformer to extract attention weights highlighting semantically relevant regions. The attention matrix A_l,h at layer l and head h is computed using query and key projection matrices, with attention weights a_i^l aggregated across all heads to capture contextual relevance.

A_l,h = softmax(( (X_l−1W^Q_l,h)(X_l−1W^K_l,h)^T/ √d_h))

a_i^l = (1 / H) ∑_h=1^H A_i^l,h

Important image features are amplified using attention weights with factor k, where I_a(i,j) = I_o(i,j) · (1 + k · a(i,j)). The amplified image is then processed by the captioning model. We employ a hybrid decoding strategy combining beam search with Top-k, Top-p sampling, and temperature scaling for enhanced diversity and fluency.

I_a(i,j) = I_o(i,j) · (1 + k · a(i,j))

x_t ~ Top-p(Top-k(Softmax(z_t / T)))

Diagrams

AGIC Pipeline

Figure 2: The Attention-Guided Image Captioning (AGIC) pipeline.

Inference Time

Figure 3: Comparative analysis of inference time across zero-shot, supervised, and unsupervised captioning frameworks.

Error Analysis

We conduct a detailed error analysis by manually examining four aspects: hallucination, omission, irrelevance, and ambiguity. While AGIC generally improves caption relevance, our analysis reveals that it occasionally omits salient objects present in the image and tends to hallucinate on Flickr30k data samples. These observations highlight areas for further refinement. Error analysis is conducted on 50 images from each dataset.

Figure 4: Qualitative error analysis of image captions generated by the AGIC model compared to ground truth (GT) descriptions. Each example highlights a specific type of error: hallucination, omission, irrelevance, vagueness, and ambiguity. One example demonstrates a correct caption with no notable issues.

	Hallucination	Omission	Irrelevance	Ambiguity
Flickr8k	7	12	3	5
Flickr30k	11	14	3	4

Table 1: Error analysis of AGIC model on 50 image samples from both Datasets.

	Relevancy	Correctness	Completion
Flickr8k	0.80	0.86	0.75
Flickr30k	0.77	0.77	0.88

Table 2: Human Evaluation of AGIC: Inter rater reliability (ICC) by two human annotators.

Citation

@misc{teja2025agic,
    author = "Teja, L D M S Sai and Urlana, Ashok and Mishra, Pruthwik",
    title = "{AGIC}: Attention-Guided Image Captioning to Improve Caption Relevance",
    note = "arXiv preprint arXiv:2508.06853",
    year = "2025",
    url = "https://arxiv.org/abs/2508.06853"
}