Data Warehousing and Modern BI on Azure

By for April 17, 2017

Report Abuse
This one-click deploy tutorial creates a fact and dimension generation pipeline fronted with tabular models to showcase data warehousing on the cloud using the AdventureWorks dataset.
## Overview With the advent of Big Data infrastructure and Cloud, enterprises have started to collect, persist, process and analyze large amount of structured and unstructured data from various internal and external data sources. Furthermore, traditional appliance-based on-premises Enterprise Data Warehouse (EDW) systems that served as backbone for this analyses are becoming obsolete; they are ported in whole or parts onto the cloud for both efficient and cost-effective processing. Historically, data warehouse modelling conforms to one of the following two approaches, viz., i) Kimball modelling; and ii) Inmon modelling. Kimball based data warehouse modelling adopts a bottom-up approach where each department within an enterprise starts out with their own data marts. These data marts tend to be de-normalized and typically adhere to dimensional models such as star or snowflake schema. These data marts are then assimilated to form a unified enterprise data warehouse. On the other hand, Inmon based data warehouse modelling emphasizes the fact that an enterprise data warehouse should serve as "single source of truth". The tables in Inmon based data warehouse are normalized (i.e. typically in 3NF). Each department within an enterprise then constructs its own data mart from the centralized enterprise data warehouse. The centralized data mart serves as a gateway to generate reports and perform analysis. There have also been hybrid approaches that conflate these two aforementioned approaches. ## Problem Statement We seek to deploy a cloud-backed OLTP to OLAP transformation pipeline that accomodates: * Hosting multiple data marts with each holding subsets of the OLAP dataset. * Batch and incremental load pipelines to support initial one-time load and trickle loading of both the central data lake and the data marts. * Monitoring & alarming to track Fact/Dimension updation failures both in the data lake & it's subsequent propagation into the data marts. # Pricing This solution will cost **$131.49/day** to run. **[Refer here for details.](https://azure.github.io/Azure-CortanaIntelligence-SolutionAuthoringWorkspace/solution-prices.html#data-warehousing-and-modern-bi-on-azure)** ## Architecure [![Solution Diagram](https://caqsres.blob.core.windows.net/avhivedw/Architecture.png)](https://caqsres.blob.core.windows.net/avhivedw/Architecture.png) In this solution, we demonstrate how a hybrid EDW scenario can be implemented on Azure using: 1. **Azure SQL Data Warehouse** as a Data mart to vend business-line specific data. 2. **Azure Analysis Services** as an analytical engine to drive reporting. 3. **Azure Blob Storage** as a Data Lake to store raw data in their native format until needed in a flat architecture. 4. **Azure HDInsight** as a processing layer to transform, sanitize and load raw data into a de-normalized format suitable for analytics. 5. **Azure Data Factory** as our orchestration engine to move, transform, monitor and process data in a scheduled time-slice based manner. Our scenario includes an Extract-Load-and-Transform (ELT) model. Firstly, we extract data from an operational OLTP data source into Azure Blob Storage. Azure Blob acts as landing zone to process initially loaded data. We then transform the data to generate facts and dimensions using Azure HDInsight's Hive as our processing engine. This processed data is then moved into Azure SQL Data Warehouse that acts as data mart for reporting and analysis. We then show how this data can be visualized on tools such as PowerBI. Importantly, we also show how this entire architecture can be orchestrated and monitored through Azure Data Factory. To demonstrate this, we deploy both a batch pipeline to showcase initial bulk data load and an incremental pipeline to instrument change data capture for incoming data slices. ## Data Flow [![Solution Diagram](https://caqsres.blob.core.windows.net/avhivedw/DataFlowChart.png)](https://caqsres.blob.core.windows.net/avhivedw/DataFlowChart.png) The following steps are performed as outlined in the chart above: * **[1->2]** Normalized OLTP data is cloned to Azure Blob storage every hour. Data copied is partitioned by time slice at a 1 hour granularity. * **[3->4->5]** Hive external tables are created against the cloned source OLTP data and used to generate dimensions which are writtern back to a Hive transactional table. Refer [here](#batch-loads) for details of the transforms applied. In the incremental pipeline, deltas are reconciled using the procedure outlined [here](#incremental-loads). * **[5->6->7]** Generated dimensions and source OLTP data are used to generate Hive transactional Fact tables. * **[6->7/8->9]** Fact & Dimension tables are written out to CSV files in Azure Blob to enable Polybase load into the data mart (Azure SQL Data Warehouse). Stored procedure activities in Azure Data Factory are kicked off to load external tables and subsequent inserts into Fact and Dimension tables. In the incremental pipeline, deltas are reconciled in a manner similar to the procedure outlined [here](#incremental-loads). * **[10]** Data sourced from the data mart is used to generate tabular/multi-dimensional models in Azure Analysis Services. * **[11]** Data cached in Azure Analysis services is sourced to visualize dashboards and reports. ## Dataset The data used as our OLTP source models a fictious company named 'Adventure Works Cycles'; a large, multinational manufacturing company. The company manufactures and sells metal and composite bicycles to North American, European and Asian commercial markets. Refer [here](https://technet.microsoft.com/en-us/library/ms124825(v=sql.100).aspx) for deeper look at the various business scenarios addressed by this dataset. We use the following transactional tables from this dataset: * **Sales.SalesOrderHeader (32850 rows)**: Contains the general, or parent, sales order information. The specific products associated with the sales order are stored in SalesOrderDetail Table. * **Sales.SalesOrderDetail (122702 rows)**: Contains individual products associated with a specific sales order. A sales order may include orders for multiple products. The general, or parent, information for each sales order is stored in SalesOrderHeader. Each product that is ordered, or child, is stored in SalesOrderDetail. * **Sales.SalesPerson (17 rows)**: Contains current sales information for the sales representatives. * **Sales.Currency (105 rows)**: Contains International Organization for Standards (ISO) currency descriptions. * **Sales.CurrencyRate (13532 rows)**: Contains currency exchange rates. * **Sales.Customer (19820 rows)**: Contains current customer information. Customers are categorized by type: individual consumer or retail store. * **Sales.SalesOrderHeaderSalesReason (27647 rows)**: Is a cross-reference table that maps the sales orders to the sales-reason codes. * **Sales.SalesPersonQuotaHistory (163 rows)**: Contains the historical sales information for the sales representatives. * **HumanResources.Department (16 rows)**: Contains the departments within the Adventure Works Cycles company. * **HumanResources.Employee (290 rows)**: Contains employee information such as their national identification number, work title, and vacation and sick leave hours. Employee names are stored in the Contact table. * **HumanResources.EmployeeDepartmentHistory (296 rows)**: Contains current and historical data about employees and the departments in which they work. The department in which the employee is currently working is identified by a null value in the EndDate column. * **HumanResources.EmployeePayHistory (316 rows)**: Contains current and historical salary information for employees. * **Person.BusinessEntityAddress (19614 rows)**: Maps persons to their address or addresses. For example, a customer may have separate billing and shipping addresses. * **Person.EmailAddress (19972 rows)**: Maps persons to their email address(es). * **Person.Person (19972 rows)**: Details about a person eg. an employee or a customer. * **Person.PersonPhone (19972 rows)**: Maps persons to their phone number(s). * **Production.Product (504 rows)**: Contains the products sold or used in the manufacturing of sold products. * **Production.ProductCostHistory (395 rows)**: Contains the cost of a product over time. * **Production.ProductDescription (762 rows)**: Contains the product descriptions in several languages. * **Production.ProductListPriceHistory (395 rows)**: Contains the changes to the list price of a product over time. These base tables are used to create dimensions and facts that will power tabular models to gain insights into KPIs for the Adevnture Works Cycles company.