Get Complete Project Material File(s) Now! »
Convolutional Neural Network and Transfer Learning
First of all, let’s have a quick reminder about neural networks and CNNs. A neural network is a superposition of several layers. It takes a numerical input which goes through several hidden layers to nally give a numerical output.
Each layer works in the same way. With an input X of size nI , the output is given by Y = f(W T X + b) where f is an activation function, W is a weight matrix and b is a bias. The output size depends on the number of units in the hidden layer. If there is nh units, the size of W and b are nh nI and nh.1 f is an activation function. It could be a relu (f(x) = max(0; x)), a sigmoid (f(x) = ) or a hyperbolic tangent (f(x) = tanh(x)) for 1+e x example. The output will be of size nh.
To train the neural network, we choose a loss function to minimize. It can be l2-loss (L(y; y~) = jjy y~jj2), or cross entropy(L(y; y~) = y log(~y) (1 y) log(1 y~)) for example. We then feed the neural network with batches of samples and for each weight, we compute and apply the gradient of the loss function. This algorithm requires several passes through the whole dataset. One pass is called an epoch.
The most widely used source is Imagenet. It is a competition for object recognition and localization which provides millions of pictures from one hundred di erent labels. A lot of models are tested on this dataset, and it is supposed to be one of the most complex and complete datasets of pictures. To show some examples of transfer learning, we can cite the paper [6]. They apply transfer learning in order to recognize breast cancer. Another example is the image style transfer proposed by [3], which uses the last layer of VGG, a CNN trained on Imagenet, as a representation of the content of a picture. This last one is very interesting, because we can think that a good representation of the content of a picture can t well in our context.
Nowadays, state of the art results on Imagenet are obtained by Resnet [5]. It is a CNN with residual connections between some layers. The structure of this CNN is clearly described in the paper, and we could not do better, so we let the curious reader have a closer look on it if he wants more information.
To perform transfer learning, we remove one (or more) layer at the end of a pre-trained model. It means that we remove at least the classi er, and we replace it with an other classi er. It can be an other neural network layer, but it could also be a standard classi er (SVM, gradient boosting, random forest, …). We take the weights of the unchanged layers as they are, and we only train the layers and the classi er we added on our task.
Transfer learning has several advantages. The major one is that it gives us the possi-bility to use a very complex and heavy model without training it, which could take days or weeks with our computation power. However, with this method, we can not customize the representation as much as we want. In fact, we will use the representation given by Resnet and we will build something to customize it a little, but we will never refer to raw pictures. It means that if we are interested in information which is not present in the representation given by Resnet, we will miss it and we could not recover it.
If we want to have a more custom model, we have to create our own neural network and train it. To do so, we have two di erent options.
Auto-encoder
The rst option would be to use a classic neural network, convolutional or not, on the top of our raw pictures. It would require a labeled dataset and we could train it in a classic way. However, the pictures we have are very high dimensional (we took the same preprocessing than the one for Resnet, which give pictures of size 224 224 3) and we don’t have a lot of samples (640). With a number of features more than 200 times higher than the number of samples, we will have a big problem during the training. A model can learn perfectly the dataset without learning anything generalizable. That’s why we did not try this method.
The second option is to train the neural network in an unsupervised way with a help of auto-encoders. Instead of classifying pictures, auto-encoders aims to reconstruct their input. However, in order to make it learn something useful, we impose some constraints on the hidden layers of this network. The most common one is to lower the size of the hidden layers. With this constraint, the neural network has to learn to compress the information. Moreover, if the dimension is too low to contain all the information, it has to select which part of the information it will keep to reconstruct as closely as possible the input.
A. Sevin 7 LIX
There are several variants of auto-encoder. We can cite denoising auto-encoder ( [14]), or deconvolutional auto-encoder ( [11]) for example. However, they all work on the same principle, so we will stick to the standard auto-encoder in this explanation.
With this second method, we overcome the problem of the labels, but we are still facing the issue of dimension. Here again, the auto-encoder could learn the dataset perfectly. This is why we will to not train a model from scratch.
Description and tuning of the models
The solution to our lack of data is the use a neural network which has been trained on another dataset, and on another task. We choose the most standard source, Imagenet, and a model which provide state-of-the-art results, Resnet. We removed the last layer, e.g. the classi er and we obtained a representation with 2048 attributes.
We could think that the penultimate layer of Resnet give us the representation we want, but it could contain some information which is not relevant to our context, and the piece of information we want is maybe too small.
First of all, let’s challenge a bit the representation given by Resnet.
Naive Approach
We performed several tests to see if Transfer Learning with Resnet can help us in our context.
The rst one is simply to run the entire Resnet model, with its Imagenet classi er, and to look at the results it gave us. Results are given in gure 3. Even if Imagenet has not a « house » or « building » label, it has « mobile home » and « patio » which cover more than two thirds of our data. The other pictures are split between some building types (« monastery », « boathouse », « church », …) or objects which could be present nearby a house (« picket fence », « tile roof », …).
These labels do not t perfectly the pictures, but they are far from being totally absurd. To con rm that Resnet can recognize a house, we can perform a second test. We fed Resnet with pictures from our dataset and pictures extracted from ImageNet database, and we train a classi er on the top of the penultimate layer to recognize if a picture comes from our dataset or Imagenet dataset, knowing that all pictures received the same preprocessing (resizing and normalization).
In other words, we trained a classi er to recognize if a picture is a house or not. With a simple classi er (a logistic classi cation), we got an accuracy on the training set of 1. and on the testing set of more than 99 %. These scores mean that the Resnet can recognize almost perfectly a house.
Now that we know that the Resnet is not completely irrelevant in our context, we have to check if it is useful for the classi cation of user’s interest. To do so, we created a classi er trained to recognize the labels we gave to the photos.
To train our model, we split the dataset into a training set (479 pictures) and a testing set (161 pictures). The testing set is not used at all during the whole training phase. We just compute a score on this set to have an idea of the generalization power of our model.
We kept the same class frequencies on the training and the testing datasets.
We then performed a cross validation. We split again the training set into ve buckets, in which we also kept the class frequencies. We then trained our model on four buckets and computed a validation score on the fth. We did this for each 4-1 split, and we chose the hyperparameters which maximized the average validation score on these ve splits. Hy-perparameters are parameters whose values are chosen before the beginning of the learning phase (penalization term, learning rate, …). Then, we trained the model with the best hyperparameters on the whole training set, and we computed a nal score on the testing set.
Here, we chose to classify the pictures with a logistic regression with l2 penalization. In this model, if we note (xi; yi)i<n the n available observations and their labels, we estimate the value of y with a function f(x) = wT x+c. We have to nd the best values for w and c. To nd these weights, we minimize the logistic loss Llog(y; x) = log(1 + e yf(x)) for each observation, with a penalization term 21C wT w. C is the inverse of the regularization coe cient, e.g. the lower C is, the higher the penalization is. This gives us the nal optimization problem
n 1
min log(1 + e yi(wT xi+c)) + wT w
w;c Xi 2C
=0
After a quick manual search, we found that a value of 0:008 for C gives the best accuracy during the cross validation (a mean accuracy of 0:88 on the training set, 0:77 on the validation and 0:79 on the testing set). With this classi cation and a training on the whole training dataset, we get a training score of 0:88, and a testing score of 0:81.
We can draw two conclusions from this result. First, knowing that the naive accuracy, e.g. the accuracy we can get by always predicting the same label, is 0:69, we can see that this classi er learned useful information. It means that Resnet is pertinent for our context.
The second conclusion is less certain. During the nal training phase, the training set is bigger than the one during the cross validation (almost 100 samples more). This rise leads to an improvement of 3% in the accuracy, so we can think we could improve a lot the results will have by adding more data.
Now that the relevance of transfer learning is con rmed, we can think about how to improve the representation obtained with Resnet. The rst property we will work on is the dimensionality of this representation. In the next part, we will try to get similar or better results with fewer attributes per sample.
Classic Approach
A classic approach is to add a new hidden layer at the end of Resnet, before the classi er. In order to lower the dimensionality, we will impose the number of units to be lower than 2048, the dimension of Resnet representation. This hidden layer is trained with a classi cation task e.g. we train the neural network to nd the labels we gave to pictures. With this training, we can hope that the hidden layer will catch useful information for our problem while deleting the noise coming from irrelevant information.
In order to test this representation, the perfect measure would be to run the data exploration algorithm on a bunch of user interests, and to see how good it performs depending on the representation we are working on. However, we only have two sets of labels at our disposal, and it would be very resource consuming to run the entire online algorithm with each con guration.
Instead, we chose to train the neural network to classify a single user’s interest in a batch context. It means that it trained on 75% of the database, and we tested its result on the remaining 25%. The score we obtained with this batch classi cation gave us an approximation of the upper bound in the online context. It is an approximation because the data exploration algorithm is supposed to choose its the training set to maximize its accuracy, and not randomly split the database.
To run this experiment, we took the representation given by Resnet of our dataset. We then trained a hidden layer and a classi er on our dataset with the same labels as the previous experiment.
Like before, we split the dataset between a training and a testing dataset, and we performed a cross-validation. However, we did not train the model on the whole dataset to get a nal testing score, because we needed a validation set to stop the training of the neural network. For each model, we trained the neural network and kept the model when the validation score is the highest.The results we present are so averaged on the ve folds of the cross validation.
To measure the performance, the score we chose is the accuracy. It is the ratio between the number of good classi ed samples over the total number of samples.
First experiment with dropouts
First, we trained a neural network with a hidden layer of 128 units with di erent values for the dropout. Dropout consists in randomly sampling units each time you feed some data to the neural network and shutting them down for this training step. Each time you feed data, you sample the units you shut down. We can add dropout to the input, to the hidden layer(s) or both.
Dropout is a classic technique to reduce the over tting of a neural network. We say that a model is over tting when it performs a lot better on the training dataset than on the testing dataset. This can be caused by a lack of data for example. When the training dataset is too small, the model will memorize the dataset instead of learning general information. Like we have a small database, we were strongly thinking that we would face over tting issues.
The results are presented in gures 15 and 16 in annexes. They are presented in a grid, the second gure being the right part of the rst one. When you go down, the dropout on the input is decreasing (the keeping probability is increasing). When you go right, the dropout on the hidden layer is decreasing. The keeping probability take the values 0:1, 0:25, 0:5, 0:7, 0:8, 0:9, 1:, where 1: corresponds to the case without dropout.
Each graph shows the accuracy during the training phase on the training, validation and testing set. These accuracies are averaged on the ve 4-1 splits of the cross validation. Figure 5 is one these graphs for a dropout of 0:1 on the input and on the hidden layer. The training accuracy is represented by the red curve, the green one stands for the validation, and the blue one for the testing.
With these results, we can do some interesting observations on the e ect of the dropout. First, let’s look at the rst row, and at the red curve representing the training accuracy. More precisely, we will consider the point where the model performs better than a model predicting always yes or always no. This type of models can reach an accuracy of 69 %. We can see that the neural network with the highest dropout begins to learn something useful after about 1500 epochs. We can also see that, when the dropout on the hidden layer is decreasing, this point is reached more quickly (something like 500 epochs for the second model, and almost instantaneously for the others).
Then, let’s keep considering the rst row and the training accuracy, and let’s look at the point where the model reaches a precision of 95 %. For the neural network with the highest dropout, this point is reached between 8000 epochs and 10000 epochs, whereas it is reached in less than 6000 epochs for the second. We can see that this point is reached faster when we decrease the dropout on the hidden layer.
We can complete these two observations by saying that, if we x the dropout on the input on another value than 0:1, e.g. taking another row, these two remarks are still true. Moreover, if we x the dropout on the hidden layer and look at the e ect of a variation in the dropout on the input, these two remarks are still true again.
With these observations, we can say that the dropout on the input and the dropout on the hidden layer have the same e ect. They both slow down the training of the neural network. However, this is not the e ect we are interested in. We want to see the impact of dropout on the generalization power of the neural network. To do so, let’s look at the validation and the testing accuracies.
We can observe that there is not a big di erence between the di erent graphs, except when there is a very high dropout, which just slows down the process. In fact, all validation and testing accuracies are reaching the same point and stay more or less constant after. This means that the dropout does not have a big in uence here in the training process.
Finally, we can notice that the validation and testing sets are close to the value we found in the rst experiment with the whole Resnet representation (around 77% of accuracy). However, our model is strongly over tting the training dataset, like we anticipated. It means that we could maybe nd a solution to prevent this over tting and so increase the validation and testing accuracies. Dropout was a rst option, which did not work. Let’s try data augmentation.
Table of contents :
1 Description of the problem
2 Literature Review
2.1 SIFT
2.2 Convolutional Neural Network and Transfer Learning
2.3 Auto-encoder
3 Description and tuning of the models
3.1 Naive Approach
3.2 Classic Approach
3.2.1 First experiment with dropouts
3.2.2 Data Augmentation
3.2.3 Sensitivity of the results
3.3 Approach with auto-encoders
3.3.1 Description of the model
3.3.2 Results
4 Comparison of the three approaches
5 Annexes