Tutorial: Creating a random forest regression model in R and using it for scoring

February 24, 2015

7940 views
1818 downloads


Report Abuse
Azure ML studio recently added a feature which allows users to create a model using any of the R packages and use it for scoring. This experiment serves as a tutorial on creating and using an R Model within Azure ML studio. For this tutorial, we use the Bike Sharing dataset and build a random forest regression model.
This dataset can be retrieved from the [Kaggle](http://www.kaggle.com/c/bike-sharing-demand/data) website, specifically their “train” data. The “train” Bike Sharing data has 10,886 observations, each one pertaining to a specific hour from the first 19 days of each month from 2011 to 2012. The dataset consists of 12 columns that record date-time, weather, and bike rental information. # Model We begin with feature engineering and preprocessing. It is highly recommended that you read the [in-depth tutorial](http://datasciencedojo.com/build-a-random-forest-in-azure-ml-using-r/) to understand the rationale behind each step:: * Convert the “datetime” column to string type using the **metadata editor**. This will allow for feature engineering in R using the **Execute R Script** module. * Breakdown the datetime column into weekday, month, and year columns using the **Exectute R Script** module. Also, remove the single observation where weather = 4. This observation would throw an error if it ended up in the test split because the model expects the test split to have the same number of levels as the training split for categorical variables. * Identify categorical attributes and cast them into categorical features using the **metadata editor**. The following attributes were cast into categorical values: * Weekday, hour, month, year, holiday, workingday, weather, * Remove the extraneous columns: * Registered, casual, datetime * Tell Azure ML what it is trying to predict by casting the response class into a label using the **metadata editor**. * Randomly split and partition the data into 70% training and 30% scoring using the **split** module. # Creating an R Model The Create R Model module can be used in place of Azure ML’s native models. It requires an R script for both training and scoring the model. This module’s strength comes from being able utilize the extensive flexibility and features from R libraries inside of Azure ML. We used R library randomForest to create our model. # Results The evaluate model module does not yet support R models. Any R script can be used to evaluate the results, however we look forward to the convenience of Azure ML when it starts supporting R Models. We imported the results to an **Execute R Script** module to calculate the MSE. #Related 1. [In-Depth Tutorial: Creating a random forest regression model in R and using it for scoring](http://datasciencedojo.com/build-a-random-forest-in-azure-ml-using-r/) 2. [Tutorial: Obtaining feature importance using variable importance plots](https://gallery.azureml.net/Details/964dfc4151e24511aa5f78159cab0485) 3. [Tutorial: Building a classification model in Azure ML](https://gallery.azureml.net/Details/01b2765fa75147ce99679e18482d280f) 4. [Demo: Predicting Bicycle Demand](http://demos.datasciencedojo.com/demo/titanic/)