Develop a model that uses various network features to detect which network activities are part of an intrusion/attack.
#Binary Classification: Network Intrusion Detection In this experiment we use various network features to detect which network activities are part of an intrusion/attack. ## Dataset We used a modified [dataset](http://nsl.cs.unb.ca/NSL-KDD/) from KDD cup 1999. The dataset includes both training and testing sets. Each row of the dataset contains features about network activity and a label about type of activity. All activities except one (with value 'normal') indicate network intrusion. The training set has approximately 126K examples. It has 41 feature columns, a label column and an auxiliary 'diff-level' column that is an estimation of the difficulty of correctly classifying a given example (see  for a detailed description of this column). Feature columns are mostly numeric with a few string/categorical features. The test set has approximately 22.5K test examples (with same 43 columns as in the training set). We upload training and test sets into [Azure blob storage](http://azure.microsoft.com/en-us/services/storage/) using the following Powershell commands: Add-AzureAccount $key = Get-AzureStorageKey -StorageAccountName <your storage account name> $ctxt = New-AzureStorageContext -StorageAccountName $key.StorageAccountName -StorageAccountKey $key.Primary Set-AzureStorageBlobContent –Container <container name in your storage account> -File "network_intrusion_detection.csv" –Context $ctxt Set-AzureStorageBlobContent –Container <container name in your storage account> -File "network_intrusion_detection_test.csv" –Context $ctxt For the purpose of this sample experiment we uploaded the files to 'datasets' public container of 'azuremlsampleexperiments' storage account. ## Data preprocessing We import training set into Studio by using **Reader** module with the following parameters: ![Reader][reader] Note that to read from the public blob storage we choose authentication type 'PublicOrSAS'. Importing of the test set is done in a similar way. The original label column, called 'class' has many values and of string type. Each string value corresponds to a different attack. Some attacks do not have many examples in the training set and there are new attacks in the test set. to simplify this sample experiment we build a model that does not distinguish between different types of attacks. For this purpose we replace 'class' column with the binary column that has 1 if an activity is normal and 0 if it is an attack. The Studio provides built-in modules to ease this preprocessing step. The binarization of 'class' column is achieved by using **Metadata Editor** to change the type of 'class' column to categorical, getting binary column with **Indicator Values** module and selecting 'class-normal' column with **Project Columns** module. This sequence of steps is shown below: ![Reader][label_processing] We do this transformation for both training and test sets. ## Comparison of classifiers We compare 2 machine learning algorithms: **Two-Class Logistic Regression** and **Two-Class Boosted Decision Tree**. Also we compare two different training sets, the first one with the original 41 features and the second one with 15 most important features that are found by **Filter Based Feature Selection** module. The parameters of this module are shown below: ![Reader][feature_selection] For every combination of learning algorithm and training set we train a model and generate predictions using the following sequence of steps, illustrated below: ![Reader][training] 1. Split the training set into 5 folds. This is done using **Partition and Sample** module with 'Partition or Sample mode' option set to 'Assign to Folds'. 2. Do 5-fold cross-validation over the training set. This and the next 2 steps are done by **Sweep Parameters** module. We connect partitioned training set to 'training dataset' input of **Sweep Parameters**. Since we use **Sweep Parameters** in cross-validation mode, we leave module's right output unconnected. 3. Finding the best hyperparameters of the learning algorithm on a given training set. We would like to use AUC metric for evaluating performance of our model and hence we set the option 'Metric for measuring performance for classification' of **Sweep Parameters** to 'AUC'. 4. Train the learning algorithm with the training set using the best values of hyperparameters from the previous step. 5. Score the test set using **Score Model** module. 6. Compute AUC over the test set. We use **Evaluate Model** to compute various metrics and **Project Columns** to extract AUC values. Having computed AUC's for all 4 combinations of learning algorithm and training set, we use **Execute R** to generate a table that summarizes all results. This module has the following R code: dataset1 <- maml.mapInputPort(1) dataset2 <- maml.mapInputPort(2) data.set <- data.frame(c("Logistic Regression, all features", "Boosted Decision Tree, all features", "Logistic Regression, 15 features", "Boosted Decision Tree, 15 features"), rbind(dataset1,dataset2)) names(data.set) <- c("Algorithm, features", "AUC") maml.mapOutputPort("data.set") ##Results The final output of the experiment is the left output of the last **Execute R Script** module: ![Reader][results] We conclude that Boosted Decision Tree, trained with all available features, achieves the best AUC. ##References  M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009. <!-- Images --> [reader]:http://az712634.vo.msecnd.net/samplesimg/v1/3/reader.PNG [experiment]:http://az712634.vo.msecnd.net/samplesimg/v1/3/experiment.PNG [label_processing]:http://az712634.vo.msecnd.net/samplesimg/v1/3/label_processing.PNG [feature_selection]:http://az712634.vo.msecnd.net/samplesimg/v1/3/feature_selection.PNG [training]:http://az712634.vo.msecnd.net/samplesimg/v1/3/training.PNG [results]:http://az712634.vo.msecnd.net/samplesimg/v1/3/results.PNG