Loan Credit Risk with Azure HDInsight Spark Clusters

By for June 29, 2017

Report Abuse
This solution demonstrates how to build and deploy a machine learning model with Microsoft R Server on Azure HDInsight Spark clusters to deploy predictive analytics for a lending institution to reduce the number of loans they offer to those borrowers most likely to default, increasing the profitability of their loan portfolio. This solution enables efficient handling of big data on Spark with Microsoft R Server.
> **Note:** If you have already deployed this solution, click [here](https://start.cortanaintelligence.com/Deployments?type=loancreditriskhdi) to view your deployment. ### Estimated Provisioning Time: 25 Minutes > This solution will create an HDInisght Spark cluster with Microsoft R Server. This cluster will contain 2 head nodes, 2 worker nodes, and 1 edge node with a total of 32 cores. The approximate cost for this HDInsight Spark cluster is 8.29USD/hour. Billing starts once a cluster is created and stops when the cluster is deleted. Billing is pro-rated per minute, so you should always **delete your cluster** when it is no longer in use. Use the Deployments page to delete the entire solution once you are done. ## Overview If we had a crystal ball, we would only loan money to someone we knew would pay us back. A lending institution can make use of predictive analytics to reduce number of loans they offer to those borrowers most likely to default, increasing the profitablity of their loan portfolio. This solution uses simulated data for a small personal loan financial institution, building a model to help detect whether the borrower will default on a loan. ## Business Perspective The business user uses the predicted scores to help determine whether or not to grant a loan. Microsoft R Server on HDInsight Spark clusters provides distributed and scalable machine learning capabilities for big data, leveraging the combined power of R Server and Apache Spark. This solution demonstrates how to develop machine learning models for marketing campaign optimization (including data processing, feature engineering, training and evaluating models), deploy the models as a web service (on the edge node) and consume the web service remotely with Microsoft R Server on Azure HDInsight Spark clusters. The final predictions saved to a Hive table. This data is then visualized in Power BI. The business user can fine tune his prediction by using the PowerBI Dashboard to see the number of loans and the total dollar amount that could be saved under different scenarios. The dashboard includes a filter based on percentiles of the predicted scores. When all the values are selected, he views all the loans in the testing sample, and can inspect information about how many of them defaulted. Then by checking just the top percentile (99), he drills down to information about loans with a predicted score in the top 1%. Checking multiple continuous boxes allows him to find a cutoff point he is comfortable with to use as a future loan acceptance criteria. You can try out this dashboard by clicking the **Try it Now** button to the right. ## Data Scientist Perspective This solution demonstrates the end-to-end process of how to develop and deploy machine learning models for marketing campaign optimization. It contains sample data, R code for each step of building the model (including data processing, feature engineering, training and evaluating models along with sample data), deploying the model as a web service (on the edge node) and consuming the web service remotely with Microsoft R Server on Azure HDInsight Spark clusters. Data scientists who are testing this solution can work with the provided R code from the browser-based Open Source Edition of RStudio Server that runs on the Edge Node of the Azure HDInsight Spark cluster. By [setting the compute context](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-r-server-compute-contexts) the user can decide where the computation will be performed: locally on the edge node, or distributed across the nodes in the Spark cluster. All the R code can also be found in public Github repository. Have fun! ## Disclaimer ©2017 Microsoft Corporation. All rights reserved. This information is provided "as-is" and may change without notice. Microsoft makes no warranties, express or implied, with respect to the information provided here. Third party data was used to generate the Solution. You are responsible for respecting the rights of others, including procuring and complying with relevant licenses in order to create similar datasets.