In this course, you’ll gain hands-on experience with Microsoft R and HDInsight Spark for scalable data science and machine learning. You will learn about the fundamentals of functional programming, parallel external memory algorithms, Spark on HDInsight, and distributed systems.
# About the Course In this course we see how open source R, Microsoft R and SQL Server R Services can work together to build data science solutions. We run hands-on exercises and learn best practices for R programming. We see how the RevoScaleR package and its data-processing and analytics functions can not only allow our analytics to scale to large datasets, but also deploy it inside of a production environment like SQL Server, all from the comfort of our R IDE. We also learn about the specifics of working with R inside of SQL Server, such as how to store R artifacts (such as model objects) and retrieve them later by R stored procedures (scoring new data with an R model) or SQL Server Reporting Services (for rendering an R plot in a report). This course plays on the interaction between a data analysis problem and the tools of the trade to keep participants engaged and busy learning. Like any data science project, we start with a problem (and data). In the process of cleaning and exploring the data we learn about strengths and shortcomings of our tools and how to leverage the strengths and get around the shortcomings. We then slowly make our way from exploratory data analysis (EDA) to model building, discussing machine learning examples and pitfalls while exploring ways we can improve and iterate. We end by talking about the challenge of deploying a model in a SQL Server production environment and calling it from an application. The course is intended to generate a lot of discussion about data science as a process and teach you how to think like a data scientist. After completing it, students will have a deeper understanding of the data science process, learn how R, Microsoft R and SQL Server R Services can be used to develop a data science pipeline, and what best practices to follow. # Prerequisites There are a few things you will need in order to properly follow the course materials: * Participants are expected to be familiar with R basics, especially the basic data types and structures in R, how to subset each, and similarities and differences between them. Familiarity with data analysis using R and a basic knowledge of stats and the data science process is helpful, as well as some experience with SQL Server. # Agenda * We have an overview of RevoScaleR and show you how to access it by downloading and installing the Microsoft R Client on a Windows workstation. We then get the NYC Taxi data used during the course. Finally, we install the required R packages we will be using throughout the course. * We talk about two different ways that RevoScaleR can handle the data and the trade-offs involved. * We examine the data and ask how we can clean it and then make it richer and more useful to the analysis problem. In the process, we learn how to use RevoScaleR to perform data transformations and how third-party packages can be leveraged. * We examine the data visually and through various summaries to see what does and does not mesh with our understanding of the problem. We look at sampling as a way to examine outliers. * We examine ways of visualizing our results and getting a feel for the data. In the process, we learn how RevoScaleR interacts with various visualization tools. * We look at k-means clustering as our first RevoScaleR analytics function and look at how we can improve its performance when the dataset is large. * We build a few predictive models and show how we can examine the predictions and compare the models. We see how our choice of the model can have performance implications. * We talk about RevoScaleR's write-once-deploy-anywhere philosophy and talk about what we mean by a compute context. We then take this into practice by deploying our code into SQL Server and talk about architectural differences. * We go into SSMS and see how we can call R via stored procedures, retrieve a model object and score new data with it. We also examine how we can retrieve plots and serve them in SSRS. # Technologies Covered * Microsoft R Server