Sample 2: Dataset Processing and Analysis: Auto Imports Regression Dataset

By for February 2, 2015


Report Abuse
This sample demonstrates how to use the Metadata Editor, Clean Missing Data, Project Columns modules for basic data processing and compute basic statistics using Descriptive Statistics, Probability Function Evaluation and Linear Correlation modules.
##Data Processing and Analysis## This sample demonstrates how to use some of the basic data processing modules (**Metadata Editor**, **Clean Missing Data**, **Project Columns**) as well as modules used for computing basic statistics on a model or dataset (**Descriptive Statistics**, **Probability Function Evaluation** and **Linear Correlation**). ##Data## In this sample, we will read the Auto Imports dataset from its source on the public [UCI repository](, and then clean the data in preparation for modeling. Problems in the downloaded data include missing column headers and a number of missing data entries. The following diagram shows the entire experiment. ![]( ##Data Processing## Because no header information was provided with the input data file, when you read the data columns are assigned the default names Col1, Col2,..., Col26. Upon selecting Col2 we observe that it has 51 unique values and 41 rows have missing values. We can also generate a histogram of the data. ![][image1] Our first task is to use the **Metadata Editor** module to give meaningful names to some of the important columns. This module can change the name, data type, and status of one or multiple columns. You can also specify whether a column is categorical or indicate if the column should be used as a feature, label, or score. Here, in the first instance of **Metadata Editor**, we have changed the name of `Col1` to `symboling`, changed the data type to `Integer` and made it a categorical variable. In the second instance of **Metadata Editor**, we have made similar fixes to `Col2` and `Col26`, changing the column names to `normalized-losses` and `price` respectively. We have made both columns noncategorical and changed their data type to `Integer`. ![][image2] ![][image3] ![][image4] Next, the experiment splits into three branches to demonstrate the application of three different methods for handling missing data using the module **Clean Missing Data**. In the first branch, the missing values are replaced with `Custom substitution value` (a fixed user-defined value) which is zero(0) in this case. ![][image5] ![][image6] In the second branch, the missing values are replaced with the median value for the column. For the **`normalized-losses`** column the median value is 115. ![][image7] ![][image8] In the third branch, the missing values are replaced using the option, **Probabilistic PCA**, which computes different values for different rows based on other feature values for the same row and overall data statistics. After applying Probabilistic PCA, the first three rows of the `normalized-losses` column have the values 149, 146 and 135 respectively. These values are calculated based on the values of the other columns in that row. ![][image9] ![][image10] When we used the option, **Probabilistic PCA**, we also checked the option, **Generate missing value indicator column**. Creating indicator columns gives us six new columns in the output of the module, increasing the number of columns in the dataset from 26 to 32. Each new column is named after the column from which it is derived: for example, `normalized-losses_IsMissing`, `Col19_IsMissing` etc. The new columns contain true/false values which indicate whether a value was missing for that row in the corresponding column. ![][image11] ##Data Analysis## We added an instance of the **Descriptive Statistics** module in each branch to produce some basic statistics. ![][image12] Next, we used the module **Probability Function Evaluation** to compute some more advanced statistics, including any one of 29 different distributions. To specify which statistic to return, set the following parameters: **Kind of distribution:** ![]( **Distribution parameters:** For example, if you select the distribution, **Normal**, the module will ouput the **Mean** and **Standard deviation**. If you select the **TStudent** distribution, you get a different set of options, and can specify the **Number of degrees of freedom** and choose the method for evaluating the probability function. ![]( **Method for evaluating probability function:** This module supports multiple methods for computing any of the selected functions: _PDF_ (Probability density or mass function, _CDF_ (Cumulative distribution function), or _inverseCDF_ (Inverse cumulative distribution function). ![]( When using this module, any columns that you select as input must be numeric, and the range of data must be valid for the selected probability function, or an error or NaN result might occur. For a sparse column, all elements that correspond to background zeros will not be processed. ### Understanding the Three Methods ***Pdf*** : The probability density function describes the likelihood of each specific value that a variable can have. For a discrete variable, the return value of Pdf is a list containing each value that the variable can have and its associated probability. ***Cdf*** : The cumulative distribution function gives the cumulative probability associated with a distribution. Specifically, it gives the area under the probability density function, up to the value you specify. Cdf can be used to determine the probability of a response being lower than a certain value, higher than a certain value, or between two values. We can also use the Cdf to calculate p-values. The p-value is 1 – Cdf. ***Inverse Cdf*** : The inverse cumulative distribution function gives the value associated with a specific cumulative probability. Use the inverse Cdf to determine the value of the response associated with a specific probability. Finally, we added an instance of the **Linear Correlation** module to compute Pearson correlation between numerical features in the dataset. To ensure that only numerical features are used as input to the module, we used **Project Columns**. ![][image13] The following is the result of the left-most **Linear Correlation** module. ![]( <!-- Images --> [image1]: [image2]: [image3]: [image4]: [image5]: [image6]: [image7]: [image8]: [image9]: [image10]: [image11]: [image12]: [image13]: