Forecasting workload on servers is a common business need for technology companies that manage their own infrastructure. A key challenge in forecasting the workload on servers is the huge amount of data. In this scenario, we use 1-TB synthesized data to demonstrate how data scientists can use Azure ML Workbench to develop solutions that require use of big data. We show how a user by using Azure ML Workbench can follow a happy path of starting from a sample of a large dataset, iterating through data preparation, feature engineering and machine learning, and then eventually extending the process to the entire large dataset.
The detailed documentation for this real world scenario includes the step-by-step walkthrough: https://docs.microsoft.com/azure/machine-learning/preview/scenario-big-data For code samples, click the "View Project" icon on the right and visit the project GitHub repo. Key components needed to run this scenario: * An [Azure account](https://azure.microsoft.com/free/) (free trials are available) * An installed copy of [Azure Machine Learning Workbench] and a workspace. * A Data Science Virtual Machine (DSVM) for Linux (Ubuntu). We recommend using a virtual machine with at least 8 cores and 32 GB of memory. You need the DSVM IP address, user name, and password to try out this example. You can choose to use any virtual machine (VM) with Ubuntu and [Docker Engine](https://docs.docker.com/engine/) installed. * A HDInsight Spark Cluster with HDP version 3.6 and Spark version 2.1.x. We recommend using a three-worker cluster with each worker having 16 cores and 112 GB of memory. Or you can just choose VM type "`D12 V2`" for head node and "`D14 V2`" for the worker node. The deployment of the cluster takes around 20 minutes. You need the cluster name, SSH user name, and password to try out this example. * An Azure Storage account. You can follow the [instructions](https://docs.microsoft.com/azure/storage/common/storage-create-storage-account) to create an Azure storage account. Also, create two private Blob containers with name "`fullmodel`" and "`onemonthmodel`" in this storage account. The storage account is used to save intermediate compute results and machine learning models. You need the storage account name and access key to try out this example.