Tutorial: Using R package ggplot2 in Azure ML. Histograms, density plots and violin plots

February 26, 2015


Report Abuse
The R library ggplot2 allows you to create more colorful and complex graphs with far less code. ggplot2 is a simple solution for achieving professional graphs for your Azure ML experiments. This tutorial serves as an introduction on how to use the R graphing library ggplot2 inside of Azure ML.
# Data The diamonds dataset is an example dataset packaged with the R library ggplot2. It contains 43930 rows and 10 variables where each row is a series of attributes of a particular diamond. The variables are: price, carat weight, quality of cut, color, clarity, length, width, depth, total depth percentage, and width of top diamond. # Set Up We will use the traditional Azure ML workflow for this experiment. Although unconventional because the data is built into an R package, utilizing Azure ML’s workflow allows for easy data substitution. Also, it’s great practice for future experiments! Here are our steps: * Use an **Execute R Script** Module to Load the ggplot2 library. Save the diamonds dataset to a variable and then output it to AzureML. * Identify categorical attributes and cast them into categorical features using the **Metadata Editor** module. These attributes were cast into categorical values: color, clarity, cut. * Here we use multiple **Execute R Script** modules, which will contain R code for our different ggplots. # R Graphs in Azure A great feature of the **Execute R Script** module is its ability to render R graphics. Any graph made in the module will automatically be outputted to the bottom right node labeled “R Device”. These can be saved to your computer by right clicking and selecting save as. # Using Ggplot2 Graphics in ggplot2 follow a different syntax than built-in R graphics. The new syntax may seem a little complicated at first, but it's much easier and more efficient. ##### Basics - Histograms and Density Plots: Every ggplot2 graphic starts with the function [ggplot()](http://docs.ggplot2.org/current/ggplot.html), which initializes a ggplot2 object. Here we tell ggplot2 which dataset we want to use for making graphs. Now, we will use our diamonds dataset and use ggplot(data = diamonds). In order to make a histogram of the “carat” column, simply add the function geom_histogram(aes(x = carat)) to the ggplot() object where [aes()](http://docs.ggplot2.org/0.9.3/aes.html) specifies which column of the dataset to use for the histogram. The final command is > ggplot(data = diamonds) + geom_histogram(aes(x = carat)) Now, the histogram will conveniently be printed in the bottom right node of the **Execute R Script** module. The ease of ggplot2 is illustrated when we want to create a density plot. Simply replace geom_histogram() with geom_density(). There you have it! Change carat to price in order to see the density of diamond prices. > ggplot(data = diamonds) + geom_density(aes(x = price)) ##### Intermediate Concepts: We will use [box plots](http://en.wikipedia.org/wiki/Box_plot) and [violin plots](http://en.wikipedia.org/wiki/Violin_plot) to dive a bit deeper into the ggplot2 functionality. Creating a single box plot is similar to making a histogram and density plot except we use geom_boxplot and specify both an x and y variable in our [aes()](http://docs.ggplot2.org/0.9.3/aes.html) function. The y variable is used to construct the boxplot, and the x variable is used for making subsets of the boxplot. Here we set x = 1 to get a boxplot for the carats of all diamonds. Make multiple box plots by setting the x variable to categorical variable. In this example, we create separate boxplots for each type of cut. Since we will only be using carat and cut columns from the diamonds dataset, we can conveniently move aes(y = carat, x = cut) inside of our [ggplot()](http://docs.ggplot2.org/current/ggplot.html) command and can save it to a variable: > g <- ggplot(diamonds, aes(y = carat, x = cut)). Now, we use g in place of our long [ggplot()](http://docs.ggplot2.org/current/ggplot.html) command. In order to make our [box plots](http://en.wikipedia.org/wiki/Box_plot) type g + geom_boxplot(). For [violin plots](http://en.wikipedia.org/wiki/Violin_plot), type g + geom_violin(). Our final concept is layering, which is a serious reason to use ggplot2. In order to add points to the violin chart, simply add geom_points() before geom_violin(). > g + geom_point() + geom_violin() Now the graph will show where the outliers are. ##### Conclusion Now you can now use ggplot2 inside of Azure ML to make great graphics. This tutorial only scratched the surface of ggplot2, so stay tuned for tutorials on more advanced topics. In the meantime, here are some in depth resources for ggplot2. #Resources 1. [Tutorial: Base R Graphics in AzureML](https://gallery.azureml.net/Details/0a715b439b5c43b2aa104a92f215624a) 2. [Tutorial: Creating a random forest regression model in R and using it for scoring](https://gallery.azureml.net/Details/b729c21014a34955b20fa94dc13390e5) 3. [Tutorial: Building a classification model in Azure ML](https://gallery.azureml.net/Details/01b2765fa75147ce99679e18482d280f)