Many machine learning algorithms have one or more knobs, called hyperparameters. These knobs allow tuning of algorithms to optimize their performance over future data, measured according to user-specified metrics (for example, accuracy, AUC, RMSE). Data scientist needs to provide values of hyperparameters when building a model over training data and before seeing the future test data. How based on the known training data can we set up the values of hyperparameters so that the model has a good performance over the unknown test data? A popular technique for tuning hyperparameters is a grid search combined with cross-validation. Cross-validation is a technique that assesses how well a model, trained on a training set, predicts over the test set. Using this technique, initially we divide the dataset into K folds and then train the algorithm K times, in a round-robin fashion, on all but one of the folds, called held-out fold. We compute the average value of the metrics of K models over K held-out folds. This average value, called cross-validated performance estimate, depends on the values of hyperparameters used when creating K models. When tuning hyperparameters, we search through the space of candidate hyperparameter values to find the ones that optimize cross-validation performance estimate. Grid search is a common search technique, where the space of candidate values of multiple hyperparameters is a cross-product of sets of candidate values of individual hyperparameters. Grid search using cross-validation can be time-consuming. If an algorithm has 5 hyperparameters, each with 5 candidate values and we use K=5 folds, then to complete a grid search we need to train 5*5*5*5*5*6=15625 models. Fortunately, grid-search using cross-validation is an embarrassingly parallel procedure and all these models can be trained in parallel. This scenario shows how to use Azure Machine Learning Workbench to scale out tuning of hyperparameters of machine learning algorithms that implement scikit-learn API. We show how to configure and use a remote Docker container and Spark cluster as an execution backend for tuning hyperparameters.
The **detailed documentation** for this real world scenario includes the step-by-step walkthrough: https://docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-distributed-tuning-of-hyperparameters For code samples, click the "**View Project**" icon on the right and visit the project GitHub repo. Key components needed to run this scenario: 1. Ubuntu Data Science Virtual Machine. We recommend using a virtual machine with at least 8 cores and 28 Gb of memory. 2. Spark HDInsight cluster. We recommend having a cluster with at least 4 worker nodes and at least 28 Gb of memory in each node. 3. Azure storage account.