block: bilinear superdiagonal fusion for vqa and vrd

Get Complete Project Material File(s) Now! »

VQA architecture

In Figure 2.1 we show the architecture of a classical VQA system. An image representation (blue rectangle) is provided by a deep visual features extractor that yields one or many vectors. In parallel, a textual representation (red rectangle) is the output of a language model that goes through the question. Then, both these representations are merged (green rectangle), with possibly complex strategies based on multi-modal fusion and high-level reasoning schemes such as iterative processing or visual attention mechanisms. Finally, a prediction module (gray rectangle) provides its estimation of the answer to the question. The modules that compose the VQA system are usually designed to be end-to-end trainable on a dataset set D = f(vi, qi, ai)gi=1:N. It contains ground-truth data where the question qi on the image vi has answer ai 2 A.
A VQA model can be seen as a parametric function fQ that takes (image, question) pairs as input and yields an answer prediction. Using the training data D, we can define an empirical loss function that quantifies how far the predictions of fQ are from the true answers: L(Q,D) = 1 N Nå i=1 ( fQ (vi, qi) , ai) (2.1).
where l measures the difference between the model prediction fQ(vi, qi) and the ground truth ai. As a result of the free-form answer annotation process, answers in A are possibly composed of multiple words. For this reason, early attempts (Malinowski et al. 2016; Gao et al. 2015) model the answer space as sentences, and learn to sequentially decodes each word of the true answer. However, the most widely adopted framework to represent the answer space is classification. In this setup, the scope of possible answers is fixed, each answer corresponds to a class, and the model computes a probability distribution over the set of classes given an image/question pair. As proposed in (M. Ren et al. 2015; Ma et al. 2016), the classes are obtained by taking the most frequent answers in the training set, regardless of whether they contain one or multiple words. Following this setup, the VQA model outputs a probability distribution over possible answers fQ(v, q) 2 [0, 1]jAj, where each coordinate contains the estimated probability of the corresponding answer. To train the model, we use the crossentropy loss function defined as: l( fQ(vi, qi), ai) = 􀀀log fQ(vi, qi)[ai].

Textual embedding

In classical VQA systems, a sentence encoder provides an algebraic representation of the question. This representation should encode precise and fine-grain semantic information about questions with variable length. Multiple models exist for such encoders, with different complexity and expressivity. Their choice and design hold a critical position in question understanding, and in the final performance of the VQA model.
To manipulate texts in natural language, we first need to define the atomic linguistic element. We could consider characters, words, bi-grams of words, etc. In the context of VQA, the usual atomic linguistic unit is words. Before representing arbitrarily long sentences, we need to define how words can be processed by ML models.
Word representation. The simplest way to represent a word is by its one-hot encoding. Given a finite list of words that constitute a vocabulary W, each word w is assigned to an integer index iw. The one-hot encoding of a word w in the vocabulary W is a binary vector vw whose size is the same as W, and where the k-th dimension is defined as follows: vw[k] = 1 if k = iw 0 else (2.4). This very high-dimensional vector is usually substituted by the more compact word embeddings, which provide a learnable representation of words. Each word w is assigned to a vector of parameters xw 2 Rd, referred to as the embedding of w. The dimension d is a hyperparameter, whose typical value is between 50 and 500. These vectors are initialized randomly, or using pre-trained models such as Word2Vec (Mikolov et al. 2013) or Glove (Pennington et al. 2014). Depending on the task on which these vectors have been trained, semantic and syntactic properties of words can be captured in their associated embedding. In particular, the euclidean distance between two embedding vectors reflect some type of semantic similarity between their associated words.

Towards visual reasoning

We saw in Section 2.2 that the input modules of a VQA system often take the form of a DL representation extractor. In particular, an image may be represented as a single vector, or a bag of vectors where each embedding is spatially grounded in the image. Then we introduced the different types of fusion layers in Section 2.3, trained to provide a representation that models the interaction between two vector spaces. In this section, we are interested in how different VQA architectures induce different structures in the model. These structures can be seen as providing visual reasoning capacities to the systems, and are crucial to their performance.

Exploiting relations between regions

In the aformentionned methods, all the regions are usually considered independantly from their context. This could prevent the network from learning to answer questions related to the spatial layout of objects, or to the semantic relations between them. This is why methods that exploit relations between objects for VQA constitute a growing line of research. In (Z. Chen et al. 2017), a structured attention mechanism is used to encode cross-region relations. They remark that the weighted sum used in classical attention schemes can be seen as computing the expectation of a region selection process. They model the joint probability of this process with a grid-structured Conditional Random Field (CRF) which considers a region’s 4-neighbourhood. For each region, a unary potential is computed as a score that measures the likelihood of selecting it, with respect to a question. Similarly, pairwise potentials are calculated for each pair of neighbouring regions. Finally, approximate inference algorithms take as input these unary and pairwise scores to compute the marginal probability of selecting each region. It is however not straightforward to adapt this type of local propagation methods to representations that have a non-regular spatial arrangement. In particular, modern VQA architectures rely on the powerful bottom-up features (see Section 2.2.1), in which region vectors are associated to bounding boxes, whose position and size vary from an image to another. The relational modeling has been adapted to these object detection features by (Norcliffe-Brown et al. 2018). They develop a method that generalizes the convolution to features that are not disposed in a regular grid but in a graph, where each node corresponds to a region. First, the bottom-up representation of each region is fused with the question embedding to provide a per-region multi-modal representation. This set of vectors defines a semantic neighbourhood structure, and the value of the edge between two regions corresponds to the scalar product between their representations. Over this graph, spatial convolutions based on gaussian kernels are used to propagate the node information and make each region vector aware of its local context. Modeling relations in VQA can also be done in a vectorial manner, as in (Santoro et al. 2017). Given a set of region vectors, the Relation Network (RN) computes a representation for each pair of objects. This vector encodes the ways in which two objects are related, with respect to a specific question. All these pair embeddings are summed to provide an aggregated image-level representation, which is based on pairwise relations of objects. Finally, this aggregated relational information is given to an MLP that predicts the answer.

Composing neural architectures

An interesting idea has seen a growing interest within the VQA community. It states that to answer a question, we should first re-write this question as a program that takes as input the image, and that returns the answer. This program is composed of modules, able to accomplish some elementary perceptual tasks. Not only do we need to learn each perceputal module, we also want to learn to assemble these modules into programs. In (Andreas et al. 2016b), the Neural Module Network (NMN) is proposed as a general architecture for discretly composing heterogeneous, jointly-trained neural modules into deep networks. Each module, corresponding to a composable vision primitive, manipulates attention maps over the image. They are associated with concepts such as find[c] which takes as input the image and yields an attention map that locates the argument c (e.g. dog, red, …), or transform[c] which applies the transformation c to an attention map (e.g. above, …). The network is assembled using a parsed version of the question. The parser transforms a sentence like “ What color is the tie? “ into a network with the structure describe[color](find[tie]). By the same authors, the Dynamic Neural Module Network (D-NMN) (Andreas et al. 2016a) generates multiple network layout candidates for a question. These layouts are scored according to the question with a neural network f , and the best scoring layout is selected to perform the forward-backward pass on the image. Selecting a network is a non-differentiable operation, which implies that the layout scorer f cannot be learned by simple backpropagation. Here, tools from the reinforcement learning community are used to compute the gradients on the layout scorer’s parameter, and efficiently learn to choose a layout from candidate layouts, given a question.
To support this line of research, the CLEVR Dataset (Johnson et al. 2017a) has been conceived. In this dataset, very complex questions are asked about images coming from a simple visual world. In the example we show in Figure 2.7, the visual scene is composed of simple objects such as cubes, spheres and cylinders.
They exist in a discrete and limited set of colors, shapes and textures. Opposed to this very simple visual world, the questions require strong reasoning capacities, spatial relationships, and logical operations. such as spatial relationship, logical operations or attribute identifications. Importantly, each one of these natural language questions is associated to a functional program that can be executed on the scene graph representing an image, yielding the answer to the question. This allowed the development of methods that rely on such program annotations to train VQA models.

MUTAN architecture

Multi-glimpse attention. We embed our MUTAN fusion module in the multiglimpse attention mechanism, presented in Section 2.4.1. It takes as input a question embedding q 2 Rdq and the visual representation V 2 Rhwdv provided by a Fully Convolutional Network (FCN) network (see Section 2.2.1). An attentional system with G glimpses, computes G independant attention maps g 2 Rhw. Each coefficient g[i, j] is obtained by a MUTAN fusion between the question representation and the corresponding region vector. More precisely: g[i, j] = MUTANg(q, V [i, j]).

Table of contents :

résumé
contents
list of figures
list of tables
acronyms
1 introduction
1.1 Context
1.1.1 Joint image/text understanding
1.1.2 Visual Question Answering
1.2 Contributions
1.3 Industrial context
2 related works
2.1 VQA architecture
2.2 Mono-modal representations
2.2.1 Image representation
2.2.2 Textual embedding
2.3 Multi-modal fusion
2.3.1 Fusion in Visual Question Answering (VQA)
2.3.2 Bilinear models
2.4 Towards visual reasoning
2.4.1 Visual attention
2.4.2 Image/question attention
2.4.3 Exploiting relations between regions
2.4.4 Composing neural architectures
2.5 Datasets for VQA
2.6 Outline and contributions
3 mutan: multimodal tucker fusion for vqa
3.1 Introduction
3.2 Bilinear models
3.2.1 Tucker decomposition
3.2.2 Multimodal Tucker Fusion
3.2.3 MUTAN fusion
3.2.4 Model unification and discussion
3.2.5 MUTAN architecture
3.3 Experiments
3.3.1 Comparison with leading methods
3.3.2 Further analysis
3.4 Conclusion
4 block: bilinear superdiagonal fusion for vqa and vrd
4.1 Introduction
4.2 BLOCK fusion model
4.2.1 BLOCK model
4.3 BLOCK fusion for VQA task
4.3.1 VQA architecture
4.3.2 Fusion analysis
4.3.3 Comparison to leading VQA methods
4.4 BLOCK fusion for VRD task
4.4.1 VRD Architecture
4.4.2 Fusion analysis
4.4.3 Comparison to leading VRD methods
4.5 Conclusion
5 murel: multimodal relational reasoning for vqa
5.1 Introduction
5.2 MuRel approach
5.2.1 MuRel cell
5.2.2 MuRel network
5.3 Experiments
5.3.1 Experimental setup
5.3.2 Qualitative results
5.3.3 Model validation
5.3.4 State of the art comparison
5.4 Conclusion
6 conclusion
6.1 Summary of Contributions
6.2 Perspectives for Future Work
bibliography