Sample 4: Cross Validation for Regression: Auto Imports Dataset

By for December 16, 2014
This experiment demonstrates the use of cross validation in regression.
# Cross Validation: Regression This experiment demonstrates how to use **Cross Validate Model** with regression models. We used the [Auto Imports dataset]( and trained multiple regression models using cross validation to predict the price of a car based on its specifications. ## Data The dataset contains 25 features and one label column. There are multiple types of features, including numerical and categorical. The following diagram shows an excerpt of the data: ![][image_dataset] ## Creating the Experiment The following diagram shows the overall workflow of the experiment: ![][image_experiment] ###Missing Data Handling First, we added the dataset to the experiment, and used the **Clean Missing Data** module to replace all missing values with zeros, as shown here: ![][image_missing] ###Cross Validation After the data had been cleaned, we divided the experiment into three branches, each using a different regression algorithm and **Cross Validate Model** to train a regressor. Cross validation is useful for reducing bias in a model that can be caused by using a single training set. Instead of dividing the data into just two sets, one for training and one for testing, we used cross validation to partition the entire dataset into multiple _folds_ and build a model on each fold. Each model is then tested against the remaining data, and error measurements are reported for each model. This information reveals the sensitivity of the model to the training set, and provides you with a better indication of the model’s ability to generalize to new data. **Cross Validate Model** takes two inputs: a machine learning model and a dataset. For this regression problem, we chose three different regression methods: **Linear Regression** with the online gradient descent option, **Boosted Decision Tree Regression**, and **Poisson Regression**. By default, **Cross Validate Model** uses ten-fold cross validation. If you want to specify a different number of folds, you can use the **Partition and Sample** module and select the **Assign to Folds** option. Use the option **Specify number of folds to split evenly into** to set the number of folds. The following diagram shows where you can find these settings: ![][image_partition] Because **Cross Validate Model** trains a model, you must specify the response variable. In this experiment, we selected the column `price` and left the value for **Random seed** at the default of 0, to randomize the distribution of instances into the folds. ![][image_parameters] For this experiment, we used the default 10-fold cross-validation for **Linear Regression** and **Boosted Decision Tree Regression**, and used five-fold cross-validation for **Poisson Regression**. ###Results **Cross Validate Model** has two outputs: the first has the scored results for the training data; the second provides a set of evaluation metrics for each of the folds. The following figure shows the output of the first port output from **Cross Validate Model** used with **Linear Regression**. The "Scored Labels" column shows the predicted value. ![][image_output1] The following figure shows the second output, containing performance data for each fold. Because this is a regression model being evaluated, the metrics used are mean absolute error, root mean squared error, relative squared error, and coefficient of determination. ![][image_output2] Based on the cross validation results, you can tune the model parameters or decide which model to use in the scoring experiment. <!-- Images --> [image_dataset]: [image_missing]: [image_partition]: [image_experiment]: [image_output1]: [image_output2]: [image_parameters]: