Data Warehousing and Data Science with SQL Data Warehouse and Spark

By for September 9, 2016

3504 views
509 deployments


Report Abuse
Most enterprises require a centralized data warehouse for the purpose of advanced analytics and reporting. Traditionally, this required IT organizations to manually set up a number of systems and services and build pipelines that ingest, process and store data using them. This solution sets up an end-to-end data ingestion, wrangling and warehousing pipeline using Apache Spark, Azure SQL Data warehouse and Azure Data Factory within a few minutes. It demonstrates how to use Jupyter notebook and Power BI to explore and visualize the raw data and processed results residing in Azure HDInsight and Azure SQL Data Warehouse.
> **Note:** If you have already deployed this solution, click [here](https://start.cortanaintelligence.com/Deployments?type=edwspark) to view your deployment. ### Estimated Provisioning Time: 25 Minutes ## Overview This solution uses the Million Song dataset as a sample to create a data ingestion and processing pipeline. [![Solution Diagram](https://caqsres.blob.core.windows.net/edwspark/EDWSparkDiagram.JPG)](https://caqsres.blob.core.windows.net/edwspark/EDWSparkDiagram.JPG) The various steps involved in this solution are as follows: * Creation of aforementioned Azure resources in user’s Azure Subscription. * Copy of Million Song dataset from a public storage location to a newly created Azure Storage container. * Execution of data sanitization and aggregation using Apache Spark powered by Azure HDInsight. * Transfer of processed results from Apache HDInsight storage location into Azure SQL Data Warehouse using Polybase load queries. * Exploration of raw data and processed results using Jupyter notebook and Power BI. All of the above steps (except data exploration) are orchestrated by Azure Data Factory. ***At the time of deployment, please select a region that supports creation of V12 SQL Server instance for Azure SQL Data Warehouse.*** Dataset courtesy of: Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.