Recursive feature elimination with random forest regression

Get Complete Project Material File(s) Now! »

Data description

The dataset contains brightness temperatures, re ectances and other useful vari-ables to determine the cloud top pressure from the imager-instrument AVHRR on the two satellites NOAA-17 and NOAA-18 during the time period 2006-2009. The dataset also contains numerical weather prediction (NWP) variables calculated us-ing mathematical models. In the dataset there are also observed cloud top pressure and cloud top height estimates from the more accurate instrument on the satellite CALIPSO. For some variables where a lot of missing values were prevalent a de-fault value decided from scienti c knowledge of those speci c variables was used for those observations. Other observations containing missing values for any variable were removed. After removal of missing values, the dataset has in total 574828 observations. Each observation represents a pixel. The observations in the data can be divided into the three di erent cloud classes low, medium and high which are derived by an algorithm for the CALIOP-data. CALIOP is the active lidar on the satellite CALIPSO. There are in total 276 variables used as predictors and one response variable. The variables are the following:
Continuous variables:
Azimuth di erence
Brightness temperatures of the 11, 12 and 3.7 micron channel Emissivity of the 3.7, 11 and 12 micron channel
Longitude and latitude
Re ectances of the 0.6 and 0.9 micron channel The satellite zenith angle
The solar zenith angle
Texture of re ectances of the 0.6 micron channel Texture of di erent temperatures
Threshold values for di erent re ectances and temperatures Cloud top pressure
Cloud top height
Total optical depth
Continuous numeric weather prediction variables (NWP):
Fraction of land
Simulated cloud free brightness temperature of the 11 and 12 micron channel measured over land and sea level
Height for 60 di erent levels in the atmosphere Mean elevation
Pressure for 60 di erent levels in the atmosphere Tropopause pressure and temperature
Surface height
Surface land temperature Surface temperature
Temperature at 500,700,850 and 950 hPa Pressure at surface level
Surface sea temperature Temperature at 850 hPa
Temperature for 60 di erent levels in the atmosphere Column integrated water vapour
Discrete variables:
Cloud type
Cloud type conditions Cloud type quality
Cloud type status
Variables used for conversion from cloud top pressure to interpolated cloud top height:
Height and pressure at di erent levels in the atmosphere Surface pressure and surface height
Variables used for dividing the observations into cloud classes:
Number of cloud layers
Feature classi cation ags

Data preprocessing

The response variable describing cloud top pressure and the variable describing cloud top height is measured with the more accurate instrument on the satellite CALIPSO. The variables describing cloud top height and total optical depth are not among the predictors used in the thesis.
The pressure, temperature and height for 60 di erent levels in the atmosphere are treated as 60 di erent variables, one variable for each level. Only 36 out of the 60 variables representing pressure at di erent levels are used as predictors since the last 14 variables has the same value for each observation and are therefore not useful as predictors.
The di erence between the brightness temperature of the 11 and 12 micron channel is also added as a variable to the dataset.

Data preprocessing

The four discrete variables describing cloud type, cloud type conditions, cloud type quality and cloud type status were transformed with one hot encoding into as many dummy variables as there are categories for each variable. This is necessary since otherwise the neural network and random forest model will interpret these variables as numeric which they are not since the categories of these variables do not have any speci c order. Standardization of data in neural networks is important and makes the training of the network faster as well as helps prevent the neural network to get stuck in a local minimum and helps ensure convergence. In random forests however there is no need to standardize the data. The inputs and outputs of the neural network model were standardized to have mean 0 and standard deviation 1 by using the formula:

Methods

Sampling methods

Different sampling methods have been used before splitting the data into training, validation and test data. All data or only part of the data was sampled using both simple random sampling and strati ed random sampling.
A ltered dataset with observations where the total optical depth is larger than the threshold value 0.5 was also used as data. The ltered data was used to see if the performance of the neural network model and random forest model would improve once observations for thinner clouds which can be hard to detect by an instrument were removed.
When cross-validation is used for parameter tuning for the two predictive models in the thesis, 100 000 observations are sampled using simple random sampling and the observations are then split into data used for cross-validation and test data. 80% of the sampled data is used for cross-validation and 20% as test data. When using simple random sampling on all the data, the observations were randomly sampled and 50% of the observations were used as training data, 25% of the observations as validation data and the remaining 25% as test data. When using strati ed random sampling the data is divided into di erent subgroups and observations are randomly selected from each subgroup according to some sample size. The strati ed random sampling was based on the three di erent cloud classes low, medium and high derived by an algorithm for the CALIOP-data. Two strategies, one where an equal amount of observations was sampled from each cloud class and the other where a higher amount of observations being sampled from the low cloud class was used. 50% of the sampled observations from each cloud class were used as training data, 25% of the observations as validation data and the remaining 25% as test data. The same data splitting into training, validation and test data is used for the two predictive models in the thesis when the same sampling method is used.

Random forest regression

The non-linear statistical method random forest was chosen as one of the methods to predict cloud top pressure in this thesis because of its many advantages such as that it’s known to have good predictive power when trained on a lot of data as well as being robust to outliers. Random forest regression is a method that constructs several regression trees. When there is a split in a tree a random set of variables are chosen from all the variables. By doing this one avoids the predictions from the trees having high correlation, since the most important variable will not always be used in the top split in the trees. The algorithm for random forest regression is [9]:
Algorithm 2.1 Random forest regression
1. For b = 1 to B:
(a) Sample N observations at random with replacement from the training data
(b) Grow a random forest tree T b to the sampled N observations, by recursively repeating the following steps for each terminal node of the tree, until a minimum node size nmin is reached
i. Select m variables at random from the p variables
ii. Select the best variable/split-point among the m variables
iii. Split the node into two daughter nodes
2. Output the ensemble of trees fT bgB1
The following formula is used to make a prediction at a new point x: f^rfB (x) = B1 XT b(x) b=1 Parameters to tune when performing random forest regression are the number of trees, the number of variables selected at each split and the maximum depth of the trees. These three parameters were tuned by performing a grid search with 3-fold cross-validation and the model chosen is the model with the average minimum mean squared error (MSE). Random forest is robust to over tting because of the randomness in the trees in the algorithm. When building a random forest model N observations are sampled at random with replacement, a method called boot-strapping, the observations not selected by the model are called out of bag samples. Because of each tree being built on a bootstrapped dataset and a number of vari-ables are selected at random at each split in the tree the random forest algorithm is more unlikely to over t than if a simple regression tree would

Recursive feature elimination with random forest regression

Recursive feature elimination (RFE) with random forest regression is a variable se-lection method which uses a backward selection algorithm and the measure variable importance calculated by the random forest model to determine which variables to eliminate in each step. The recursive feature elimination starts out with all vari-ables and ts a random forest model and calculates the variable importance for each variable to determine the ranking of the variables. For each subset size used in the algorithm the most important variables are kept and a new random forest is t with the kept variables. The performance of the subset is measured by the root mean squared error (RMSE). To take into account the variation in the performance estimates as a result of the variable selection, 3-fold cross-validation is used as an outer resampling method. The chance of over tting the predictors is diminished by using cross-validation. When using 3-fold cross-validation two-thirds of the data will be used to perform the variable selection while one{third of the data is used as a held-back set and will be used to evaluate the performance of the model for each subset of variables. The variable selection process will thus be performed three times using a di erent hold-out set each time. The optimal number of variables in the nal model will at the end be determined by the three hold-out sets. The variable importance’s for each resampling iteration and each subset size is then used to estimate the nal list of predictors to keep in the model.
The algorithm can be described by the following steps [8]:
Algorithm 2.2 Recursive feature elimination with resampling
1. for each resampling iteration do
2. Partition data into training and test/hold-back set via resampling
3. Train a random forest model on the training set using all predictors
4. Predict the held-back samples
5. Calculate variable importance
6. for every subset size Si, i=1…N do
7. Keep the Si most important predictors
8. Train the model on the training data using Si predictors
9. Predict the held-back samples
10. end
11. end
12. Calculate the performance measures for the Si using the held-out samples
13. Determine the appropriate number of predictors
14. Estimate the nal list of predictors to keep in the nal model
15. Fit the nal model based on the optimal Si using the original training set
For RFE 1000 trees are used in the random forest model and the number of variables randomly selected at each split is p/3 where p is the total number of variables. The subset with the lowest average RMSE score is chosen as the optimal subset of variables from the recursive feature selection. In random forest the variable importance of a variable is a measure of the mean decrease in accuracy in the predictions on the out of bag samples when the speci c variable is not included in the model [15].

READ BIBLICAL PERSPECTIVE OF MISSION FOR THE POOR

The multilayer perceptron

The multilayer perceptron is a type of arti cial neural network that is widely used for non-linear regression problems. Advantages of the multilayer perceptron are the fact that it learns by example and it requires no statistical assumptions of the data [16]. Because of the vast possibilities of neural network models and types of networks the neural networks models were limited to only the simple multilayer perceptron with one hidden layer and backpropagation.
The multilayer perceptron has a hierarchical structure and consists of di erent lay-ers. The di erent layers consist of interconnected nodes (neurons). The multilayer perceptron represents a nonlinear mapping between inputs and outputs. Weights and output signals connect the nodes in the network. The output signal is a function of the sum of all inputs to a node transformed by an activation function. The acti-vation functions make it possible for the model to solve nonlinear relations between inputs and outputs. The multilayer perceptron belongs to the class of feedforward neural networks since an output from a node is scaled by the connecting weight and fed forward as an input to the nodes in the next layer [6].

The multilayer perceptron

A multilayer perceptron has three types of layers. One input layer which is only used to pass the inputs to the model. The model can consist of one or more hidden layers and one output layer. Figure 3.1. represents a multilayer perceptron architecture with one hidden layer. The more hidden neurons there are in the hidden layer the more complex is the neural network. The multilayer perceptron is a supervised learning technique and learns through training. If the output for an input when training the multilayer perceptron is not equal to the target output an error signal is propagated back in the network and used to adjust the weights in the network resulting in a reduced overall error. This procedure is called the Backpropagation algorithm and consists of the following steps [6]:
Algorithm 2.3 Backpropagation algorithm
1. Initialize network weights
2. Present rst input vector from training data to the network
3. Propagate the input vector through the network to obtain an output
4. Calculate an error signal by comparing actual output and target output
5. Propagate error signal back through the network
6. Adjust weights to minimize overall error
7. repeat steps 2-7 with next input vector, until overall error is satisfactory small
The output of a multilayer perceptron with one hidden layer can be de ned by with the following equation [1]: yko = fko bko + i=1 wiko yih! = fko bko + =1 wiko fih bih + j=1 wjihxj!! (3.1) ,where, k=1,. . . ,L and L is the number of neurons in the output layer, S is the number of neurons in the hidden layer and N is the number of neurons in the input layer. The elements in the hidden layer are denoted by h and the elements of the output layer is denoted by o. The weight that connects the neuron j of the input layer with the neuron i of the hidden layer is denoted wjih .The weight that connects the neuron i of the hidden layer with the neuron k of the output layer is denoted wiko. bhi is the bias of neuron i of the hidden layer and bok is the bias of neuron k of the output layer. Using bias in a neural network makes it possible for the activation function to shift which might be useful for the neural network to learn. fih is the activation function of neuron i of the hidden layer and fko is the activation function of neuron k of the output layer. Where f is the activation function [1].
In the multilayer perceptron an activation function is used for each neuron in the hidden layer and output layer. In this thesis the activation function for the hidden layer is the tangent hyperbolic activation function and for the output layer the identity activation function is used. The identity activation function is useful when predicting continuous targets such as cloud top pressure with a neural network since it can output values in the interval (-1,1).
The activation function tangent hyperbolic has the form [11]:
f(z) = ez e z (3.2)
ez + e z
The identity activation function has the form :
f(z) = z (3.3)

Mini-batch stochastic gradient descent

An optimization function is used to determine the change in weights during the back-propagation algorithm. One commonly used optimization function is the stochastic gradient descent. Because of the computational time and the amount of data used in this thesis a parallelized version of the optimization algorithm stochastic gradi-ent descent is used in the backpropagation algorithm to train the neural network. The parallelized version performs mini-batch training to speed-up the neural net-work training. A mini-batch is a group of observations in the data. In mini-batch training the average of subgradients for several observations are used to update weight and bias compared to when using the traditional stochastic gradient descent algorithm when only one observation at a time is used to update weight and bias.
[5]. Choosing the right mini-batch size is important for an optimal training of the neural network. A too large mini-batch size can lead to the rate of convergence decreasing a lot [14]. The mini-batch size used when training the neural network is set to 250. The parameters learning rate, momentum and weight decay must be chosen properly for the training of the network to be e ective. Learning rate is a parameter that determines the size of the change in the weights that occur during the training of the network [2]. The smaller the learning rate is set to the smaller are the changes of the weights in the network. The momentum is a parameter that adds a part of the previous weight changes to the current weight changes, a high value of the momentum makes the training of the network faster [2]. Weight decay is useful for the model to generalize well to unseen data. Weight decay is a form of regularization which adds a penalty to the cost function that the neural network tries to minimize through backpropagation. In this thesis the mean squared error is used as the cost function. Weight decay penalizes larger weights which can harm the generalization of the neural network by not letting the weights grow to large if it’s not necessary [17].

Performance evaluation metrics

The choice of values for the parameters learning rate, momentum and weight decay are of importance to whether the neural network will over t or not. To avoid the neural network from over tting one can use a method called \early stopping ». To evaluate if the network is over tting one can monitor the validation error. In the beginning of training the training error and validation error usually decreases but after a certain number of epochs the validation error starts to increase and the training error keeps decreasing. At that point the neural network should stop training since after that point the neural network is starting to over t the data. One epoch is a forward and backward pass of all the training data. If the neural network over ts the data, the network will not perform well on unseen data. If there is no decrease in validation error after a certain number of epochs one should stop training the network. Since the neural network can reach a local minima resulting in a decrease of validation error followed by an increase of validation error, the number of consecutive epochs when to stop training when no decrease in validation error is shown should be chosen properly. In this thesis the training is stopped if there is no decrease in validation error for 300 consecutive epochs. The weights of the neural network are initialized by random sampling from the uniform distribution and the biases are initialized to 0.

Technical aspects

For the recursive feature elimination with random forest regression and resampling the packages caret and randomForest were used in the programming language R. For the random forest regression models and the multilayer perceptrons the pro-gramming language Python was used. For the random forest regression models the package scikit-learn was used. The package Keras was used for the multilayer perceptrons [3].

Table of contents :

1. Introduction
1.1. SMHI
1.2. Background
1.3. Previous work
1.4. Objective
2. Data
2.1. Data description
2.2. Data preprocessing
3. Methods
3.1. Sampling methods
3.2. Random forest regression
3.3. Recursive feature elimination with random forest regression
3.4. The multilayer perceptron
3.4.1. Mini-batch stochastic gradient descent
3.5. Performance evaluation metrics
3.6. Technical aspects
4. Results
4.1. Recursive feature elimination with random forest regression
4.2. Random forest regression
4.2.1. Results using simple random sampling
4.2.2. Results using stratied random sampling
4.2.3. Results using simple random sampling on ltered data
4.2.4. Results using stratied random sampling on ltered data
4.3. The multilayer perceptron
4.3.1. Results using simple random sampling
4.3.2. Results using stratied random sampling
4.3.3. Results using simple random sampling with a log transformed
4.3.4. Results using stratied random sampling with a log transformed
4.3.5. Results using simple random sampling on ltered data with a log transformed response
4.3.6. Results using stratied random sampling on ltered data with a log transformed response
4.4. Comparison of the multilayer perceptron and random forest model
5. Discussion
6. Conclusions
Bibliography
A. Appendix