Fraud Detection with Azure HDInsight Spark Clusters

By for August 6, 2017

Report Abuse
This solution demonstrates how to build and deploy a machine learning model with Microsoft R Server on Azure HDInsight Spark clusters for online retailers to detect fraudulent purchase transactions. This solution enables efficient handling of big data on Spark with Microsoft R Server.
> **Note:** If you have already deployed this solution, click [here](https://start.cortanaintelligence.com/Deployments?type=frauddetectionhdi) to view your deployment. ### Estimated Provisioning Time: 25 Minutes > This solution will create an HDInisght Spark cluster with Microsoft R Server. This cluster will contain 2 head nodes, 2 worker nodes, and 1 edge node with a total of 32 cores. The approximate cost for this HDInsight Spark cluster is 8.29USD/hour. Billing starts once a cluster is created and stops when the cluster is deleted. Billing is pro-rated per minute, so you should always **delete your cluster** when it is no longer in use. Use the Deployments page to delete the entire solution once you are done. ## Overview Fraud detection is one of the earliest industrial applications of data mining and machine learning. This solution shows how to build and deploy a machine learning model for online retailers to detect fraudulent purchase transactions. ## Business Perspective This solution shows how to preprocess data, create new features, train R models, and perform predictions in-database. The final Hive table provides a predicted value for each transaction. This predicted value, which can be interpreted as a probability of fraud, can help you determine whether you wish to try to interrupt the transaction. We provide a PowerBI dashboard which shows the predicted scores of the data in the Test set - this is data for which we know whether the transaction was fraudulent, but was not used to build the model itself. Use the "Try It Now" button to view the PowerBI Dashboard. Note in the top table that we predicted fraud in 14,525 cases which were not fraudulent. This is something to keep in mind when deploying the model. Rather than reject a transaction that is predicted to be fraud, we might instead want to add a step to the purchase that would discourage an actual fraudulent transaction while still allowing a valid transaction to occur. ## Data Scientist Perspective This solution demonstrates the end-to-end process of how to develop and deploy machine learning models for detecting fraud in online transactions. It contains sample data, R code for each step of building the model (including data processing, feature engineering, training and evaluating models along with sample data), deploying the model as a web service (on the edge node) and consuming the web service remotely with Microsoft R Server on Azure HDInsight Spark clusters. Data scientists who are testing this solution can work with the provided R code from the browser-based Open Source Edition of RStudio Server that runs on the Edge Node of the Azure HDInsight Spark cluster. By [setting the compute context](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-r-server-compute-contexts) the user can decide where the computation will be performed: locally on the edge node, or distributed across the nodes in the Spark cluster. All the R code can also be found in public Github repository. Have fun! ## Disclaimer ©2017 Microsoft Corporation. All rights reserved. This information is provided "as-is" and may change without notice. Microsoft makes no warranties, express or implied, with respect to the information provided here. Third party data was used to generate the Solution. You are responsible for respecting the rights of others, including procuring and complying with relevant licenses in order to create similar datasets.