Vinyals O., Toshev A., Bengio S., Erhan D. (2014). Show and tell: A neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Summary by Marius Orehovschi and Ahmed Kamal.
The authors of this paper propose Neural Image Captioning – a novel, end-to-end architecture for automated generation of descriptions given images. The architecture has two main elements: 1. A deep Convolutional Neural Network for image feature extraction, and 2. A Long Short-Term Memory based RNN unit for text generation. The authors show that the network significantly outperforms state-of-the-art methods on several quantitative metrics across several datasets.
The first component of the network is a convolutional neural network used to embed images into a high dimensional space, hoping to capture some semantic meaning of what the image contains. This can be largely thought of as any other image classifier. However, rather than outputting what the image contains, the high dimensional vector is passed to the second part of the network - a recurrent LSTM unit. The LSTM unit iteratively outputs words based initially on what it has seen in the image, and then on what it recalls from the image as well as all previously outputted words. After exploring different training techniques and data formats, the authors trained all sets of weights using stochastic gradient descent with fixed learning rate and no momentum. Weights were randomly initialized except for the CNN weights, which were pre-trained on ImageNet. They use dropout layers and ensembles to deal with overfitting.
The authors demonstrate the performance of their architecture on multiple datasets with several metrics. They point out the difficulty of quantitatively evaluating the image captioning task reliably – the one relevant method of evaluation, having human raters score the image captions, is expensive and time-consuming. The authors still show human ratings of captions produced by NIC versus the reference method on the Flickr-8k dataset to illustrate that their method produces convincing descriptions. For the most part however, they rely on BLEU-n, an automated method of evaluating machine translation which correlates with human assessment. In addition, they report scores on 3 other metrics – METEOR, Cider, and description ranking (even though they point out the metric’s inadequacy on generative tasks). Overall, the authors perform a thorough evaluation of their proposed method, showing human rating scores, honest visual results, and scores on a wealth of metrics to illustrate the power of their architecture.
Some of the paper's strengths:
The method is data-driven and can be trained end-to-end
The authors present a thorough ablation study in section 4
The main units in the architecture – the CNN for image feature extraction, the LSTM for text generation, and the BeamSearch decoder for finding the best description at inference time – are described clearly and thoroughly.
- In their evaluation of the network’s performance, the authors devote a lot of attention to scores on the BLEU-n metric and show that the scores they get are drastically higher than the state-of-the-art. But even though this metric is widely accepted for tasks like text-to-text or image-to text translation, it is not ideal, because it relies on comparing output translations with a set of reference translations. Because in a real-life translation scenario, a sentence can be correctly translated in numerous, non-overlapping ways, a higher BLEU-n score is not always guaranteed to correspond to a better translation (from the Wikipedia entry on BLEU-n). Furthermore, the authors report their results on ranking pre-existing descriptions, even though they themselves state that text ranking and generation are very different tasks and the metric is inadequate. As readers, we would have appreciated it if the authors had showed fewer tables of impressive, but ultimately unreliable quantitative metrics, and had instead focused on more meaningful ways to evaluate the method, like showing more results on human rating, or showing more images and output descriptions.
The authors propose a novel architecture for automated image labeling, explain its units thoroughly, and show its impressive performance on several datasets. They also provide a good summary of pre-existing methods, point out the differences between the older approaches and theirs, and credit the architectures that inspired their method. Although there are small disadvantages to the way they present their results, they are mostly due to the nature of this task, which is difficult to assess numerically. Overall, we feel that the paper is well-written, the methods are thoroughly explained and evaluated, and the proposed ideas are well-connected to the rest of the field.