R Random Forest Classification

Posted on

Random Forestsis a technique that can turn a single tree model with high variance and poor predictive power into a fairly accurate prediction function. Unfortunately, bagging regression trees typically suffers from tree correlation, which reduces the overall performance of the model. Random forests are a modification of bagging that builds a large collection of de-correlated trees and have become a very popular “out-of-the-box” learning algorithm that enjoys good predictive performance. This tutorial will cover the fundamentals of random forests.

  1. R Random Forest Classification Threshold
  2. R Random Forest Classification Vs Regression

Tl;drThis tutorial serves as an introduction to the random forests. This tutorial will cover the following material:.: What you’ll need to reproduce the analysis in this tutorial.: A quick overview of how random forests work.: Implementing regression trees in R.: Understanding the hyperparameters we can tune and performing grid search with ranger & h2o.: Apply your final model to a new data set to make predictions.: Where you can learn more.Replication RequirementsThis tutorial leverages the following packages. Some of these packages play a supporting role; however, we demonstrate how to implement random forests with several different packages and discuss the pros and cons to each. # Create training (70%) and test (30%) sets for the AmesHousing::makeames data. # Use set.seed for reproducibility set.seed ( 123 ) amessplit. Given training data set 2.

Select number of trees to build ( ntrees ) 3. For i = 1 to ntrees do 4. Generate a bootstrap sample of the original data 5. Grow a regression tree to the bootstrapped data 6. for each split do 7.

Random Forest can be used to solve regression and classification problems. In regression problems, the dependent variable is continuous. In classification problems, the dependent variable is categorical. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification). First, Random Forest algorithm is a supervised classification algorithm. We can see it from its name, which is to create a forest by some way and make it random.

Select m variables at random from all p variables 8. Pick the best variable / split - point among the m 9. Split the node into two child nodes 10. end 11.

Use typical tree model stopping criteria to determine when a tree is complete ( but do not prune ) 12. EndSince the algorithm randomly selects a bootstrap sample to train on and predictors to use at each split, tree correlation will be lessened beyond bagged trees.

OOB error vs. Test set errorSimilar to bagging, a natural benefit of the bootstrap resampling process is that random forests have an out-of-bag (OOB) sample that provides an efficient and reasonable approximation of the test error.

This provides a built-in validation set without any extra work on your part, and you do not need to sacrifice any of your training data to use for validation. This makes identifying the number of trees required to stablize the error rate during tuning more efficient; however, as illustrated below some difference between the OOB error and test error are expected. Random forest out-of-bag error versus validation error.Furthermore, many packages do not keep track of which observations were part of the OOB sample for a given tree and which were not. If you are comparing multiple models to one-another, you’d want to score each on the same validation set to compare performance. Also, although technically it is possible to compute certain metrics such as root mean squared logarithmic error (RMSLE) on the OOB sample, it is not built in to all packages. So if you are looking to compare multiple models or use a slightly less traditional loss function you will likely want to still perform cross validation.

Advantages & DisadvantagesAdvantages:. Typically have very good performance. Remarkably good “out-of-the box” - very little tuning required. Built-in validation set - don’t need to sacrifice data for extra validation.

R Random Forest Classification Threshold

No pre-processing required. Robust to outliersDisadvantages:. Can become slow on large data sets. Although accurate, often cannot compete with advanced boosting algorithms. Less interpretableBasic implementationThere are over 20 random forest packages in R. To demonstrate the basic implementation we illustrate the use of the randomForest package, the oldest and most well known implementation of the Random Forest algorithm in R. However, as your data set grows in size randomForest does not scale well (although you can parallelize with foreach).

Moreover, to explore and compare a variety of tuning parameters we can also find more effective packages. Consequently, in the section we illustrate how to use the ranger and h2o packages for more efficient random forest modeling.randomForest::randomForest can use the formula or separate x, y matrix notation for specifying our model. Below we apply the default randomForest model using the formulaic specification. The default random forest performs 500 trees and randomly selected predictor variables at each split.

Indo

Averaging across all 500 trees provides an OOB. # for reproduciblity set.seed ( 123 ) # default RF model m 1. # create training and validation data set.seed ( 123 ) validsplit% gather ( Metric, RMSE, - ntrees )%% ggplot ( aes ( ntrees, RMSE, color = Metric )) + geomline + scaleycontinuous ( labels = scales:: dollar ) + xlab ( 'Number of trees' )Random forests are one of the best “out-of-the-box” machine learning algorithms. They typically perform remarkably well with very little tuning required. For example, as we saw above, we were able to get an RMSE of less than $30K without any tuning which is over a $6K reduction to the RMSE achieved with a fully-tuned and $4K reduction to to a fully-tuned. However, we can still seek improvement by tuning our random forest model. TuningRandom forests are fairly easy to tune since there are only a handful of tuning parameters.

Typically, the primary concern when starting out is tuning the number of candidate variables to select from at each split. However, there are a few additional hyperparameters that we should be aware of. Although the argument names may differ across packages, these hyperparameters should be present:. ntree: number of trees.

We want enough trees to stabalize the error but using too many trees is unncessarily inefficient, especially when using large data sets. mtry: the number of variables to randomly sample as candidates at each split. When mtry the model equates to bagging. When mtry the split variable is completely random, so all variables get a chance but can lead to overly biased results. A common suggestion is to start with 5 values evenly spaced across the range from 2 to p. sampsize: the number of samples to train on. The default value is 63.25% of the training set since this is the expected value of unique observations in the bootstrap sample.

Lower sample sizes can reduce the training time but may introduce more bias than necessary. Increasing the sample size can increase performance but at the risk of overfitting because it introduces more variance. Typically, when tuning this parameter we stay near the 60-80% range. nodesize: minimum number of samples within the terminal nodes. Controls the complexity of the trees.

Smaller node size allows for deeper, more complex trees and smaller node results in shallower trees. This is another bias-variance tradeoff where deeper trees introduce more variance (risk of overfitting) and shallower trees introduce more bias (risk of not fully capturing unique patters and relatonships in the data). maxnodes: maximum number of terminal nodes. Another way to control the complexity of the trees. More nodes equates to deeper, more complex trees and less nodes result in shallower trees.Initial tuning with randomForestIf we are interested with just starting out and tuning the mtry parameter we can use randomForest::tuneRF for a quick and easy tuning assessment.

TuneRf will start at a value of mtry that you supply and increase by a certain step factor until the OOB error stops improving be a specified amount. For example, the below starts with mtry = 5 and increases by a factor of 1.5 until the OOB error stops improving by 1%. Note that tuneRF requires a separate x y specification. We see that the optimal mtry value in this sequence is very close to the default mtry value of. # names of features features.

# randomForest speed system.time ( amesrandomForest. # hyperparameter grid search hypergrid. Optimalranger $ variable.importance%% tidy %% dplyr:: arrange ( desc ( x ))%% dplyr:: topn ( 25 )%% ggplot ( aes ( reorder ( names, x ), x )) + geomcol + coordflip + ggtitle ( 'Top 25 important variables' )Full grid search with H2OIf you ran the grid search code above you probably noticed the code took a while to run. Although ranger is computationally efficient, as the grid search space expands, the manual for loop process becomes less efficient.

H2o is a powerful and efficient java-based interface that provides parallel distributed algorithms. Moreover, h2o allows for different optimal search paths in our grid search. This allows us to be more efficient in tuning our models. Here, I demonstrate how to tune a random forest model with h2o.

R Random Forest Classification Vs Regression

Lets go ahead and start up h2o. # start up h2o (I turn off progress bars when creating reports/tutorials) h 2 o.noprogress h 2 o.init ( maxmemsize = '5g' ) ## Connection successful! # create feature names y. # hyperparameter grid hypergrid.h2o. # Grab the modelid for the top model, chosen by validation error bestmodelid% sqrt ## 1 23104.67 PredictingOnce we’ve identified our preferred model we can use the traditional predict function to make predictions on a new data set. We can use this for all our model types ( randomForest, ranger, and h2o); although the outputs differ slightly.

Also, not that the new data for the h2o model needs to be an h2o object. # randomForest predrandomForest.

Table ( predictions, iris $ Species )Learn more about the PART function and the. Bagging CARTBootstrapped Aggregation (Bagging) is an ensemble method that creates multiple models of the same type from different sub-samples of the same dataset. The predictions from each separate model are combined together to provide a superior result. This approach has shown participially effective for high-variance methods such as decision trees.The following recipe demonstrates bagging applied to the recursive partitioning decision tree for the iris dataset. Table ( predictions, iris $ Species )Learn more about the randomForest function and the. Gradient Boosted MachineBoosting is an ensemble method developed for classification for reducing bias where models are added to learn the misclassification errors in existing models.

Forest

It has been generalized and adapted in the form of Gradient Boosted Machines (GBM) for use with CART decision trees for classification and regression.The following recipe demonstrate the Gradient Boosted Machines (GBM) method in the iris dataset.