Real-Time Featurization and Prediction using DocumentDB and AzureML

By for January 19, 2017

Report Abuse
Featurize Real-Time Stream of Events to Make Real-Time Predictions
> **Note:** If you have already deployed this solution, click [here]( to view your deployment. ### Estimated Provisioning Time: 10 Minutes ## Overview In many prediction scenarios, using recent data is as important as using historical data. One such scenario is online advertising, where the best offer needs to be selected from a pool of offers. The recent performance of offers is very valuable in making that selection. However, in many available solutions, that recent information cannot be used because of long batch processing delays. This pattern is common to many scenarios, such as choosing which article, video, or product to show a user. This solution demonstrates how to use real-time featurization and prediction in Azure to optimize the overall click-through rate (CTR) of a simple online advertising campaign. The unique characteristics of this solution are: 1. Real-time featurization of ongoing click/no-click events using **Azure Document DB**. 2. Using the real-time feature data to predict the best offer using **Azure Machine Learning** (AML). The 'Deploy' button will launch a workflow that will deploy an instance of the solution within a Resource Group in your Azure subscription. Once the deployment is finished, you'll see some helpful instructions about digging into some of the deployed services. Also, please review the details section below to learn more about the solution. ## Details In this solution, we'll implement a simple online advertising campaign where at any given time, the system has to pick the best offer from a pool of 5 offers. The best offer is shown to the user, and it is either clicked or not clicked by the user. The goal is to maximize the overall CTR of the campaign using some offer selection policy.   The solution will implement and compare the two policies below: 1. In traditional campaigns, the better offers are handpicked by humans and the system shows them in a round-robin fashion. We call this Round Robin (RR) policy. 2. One simple alternative to RR policy is the [epsilon-greedy multi-armed bandit]( (MAB) policy, which uses explore-exploit method to maximize expected campaign CTR. During "exploration", MAB gets more information about the expected CTR of the offers. In "exploitation", MAB picks the offer with the maximum CTR so far. Image below shows the architecture of the solution: ![Solution Diagram]( Let's examine the different components in detail. ### Traffic Simulator In real-life scenarios, this component will be replaced by an actual ad-serving platform. But for the sake of end-to-end demonstration, we've included this traffic simulator to create a synthetic campaign. Since the simulator has high-throughput, it is set to stop generating traffic after 15min to limit the cost of trying this solution. This component simulates offer impressions according to RR and MAB policies. It also simulates the clicks according to a pre-set static CTR for each offer. This results in a stream of events where each event describes what offer was shown and whether or not it was clicked. A sample sequence of a few events looks like this: {action: 'offer2', clicks: 0}, {action: 'offer5', clicks: 1}, {action: 'offer1', clicks: 0} These events are sent to the featurizer in real-time using the logging API. As you can see, the events are designed to be very simple and the complexity is pushed inside the featurizer, hidden from the traffic simulator. To get the best offer for the MAB policy, the traffic simulator calls the prediction API (described in the next section). ### Real-Time Featurization and Prediction System At the heart of the solution is the real-time featurization and prediction system that consists of an event featurizer and a predictor. #### Featurizer The events sent to the logging API go into the featurizer, implemented using DocumentDB (Azure's noSQL database). The featurization logic runs as a stored procedure inside DocumentDB, which is server-side Javascript code that processes the incoming events and updates the stored features in real-time. You can find more about stored procedures [here]( Three simple features are calculated for each offer: total impressions, total clicks, and CTR. The features are time-decayed to gradually discount older data - a click last week is likely not as important as a click today. We've used simple features for this solution. However, the stored procedure can be modified to calculate more complex set of features. We've covered some options in the blog post "[RFM: A Simple and Powerful Approach to Event Modeling](". #### Predictor The predictor finds the best offer based on the MAB policy. The policy runs as a python script inside an Azure Machine Learning (AML) experiment. We use an AML module to read from DocumentDB and retrieve the latest CTR for each offer to be used by MAB. The AML experiment is deployed as a web service that powers the prediction API. The traffic simulator calls this API to get the MAB predictions. ### Real-time Stats Dashboard The final component is the dashboard that shows the following stats in real-time: 1. CTR for MAB and RR policies 2. MAB lift over RR 3. CTR, clicks, and impressions per offer Looking at this dashboard, you can see how the MAB policy outperforms the RR policy. ### Latency and Throughput For low throughput (~1 events/sec), the average latency for getting MAB predictions is 230ms. For higher throughput (~50 events/sec), the average latency is 400ms. Latency and throughput could be improved by adding more AML endpoints to the predictor. ##### Disclaimer ©2017 Microsoft Corporation. All rights reserved. This information is provided "as-is" and may change without notice. Microsoft makes no warranties, express or implied, with respect to the information provided here. Third party data was used to generate the solution. You are responsible for respecting the rights of others, including procuring and complying with relevant licenses in order to create similar datasets.