Stacking Ensemble - Titanic Kaggle Dataset

February 10, 2017
A machine learning model using the Stacking technique of ensembles or popularly called Stacking Ensemble on Titanic Kaggle Dataset.
I am going to show my Azure ML Experiment on the Titanic: Machine Learning from Disaster Dataset from Kaggle. In this dataset, the objective is to create a machine learning model to predict the survival of passengers of the RMS Titanic, whose sinking is one of the most infamous event in the history. While there are number of machine learning branches or topics, the one I will work on is “Supervised Learning” where we assign each of the records to a pre-determined set of categories. So, the goal of creating a machine learning model here is to successfully classify which of the passengers will survive and which will not. The dependent variable or the variable of interest if “Survived” whose possible values are 0 and 1, where 0 is a negative class and indicates passenger didn’t survive and 1 is a positive class which indicates passenger did survive the Titanic disaster. Based on the given data, my goal is creating a machine learning model using the Stacking technique of ensembles or popularly called Stacking Ensemble, to create an ensemble classifier and analyze how it performs compared to the individual classification algorithms. In stacking ensemble technique, we use the output from various models as features and create a model using those outputs/features to predict the target class. In this case study, I am going to use the class probabilities from each of the individual classifier and create a model using logistics regression on those probabilities, to see how ensemble classifier performs compared to individual classifiers. The class probabilities will come from individual classifiers/algorithms like Two-class Bayes Point Machine, Two-class Boosted Decision Tree, Two-class Average Perceptron and Two-class Decision Forest. I am going to preprocess and clean data using the inbuilt capabilities of Microsoft Azure ML studio and Python. I am going to use cross validation over a range of parameters for tuning the parameters of the algorithms, as well as will perform cross validation for reporting accuracies and their confidence intervals. I will use precision and recall along with accuracy to compare various models. Data Description Dataset: Titanic: Machine Learning from Disaster Dataset Format: Comma Separated Values (csv) Source: Kaggle Link to dataset: Description: PassengerId: ID of the embarked passenger on RMS Titanic Pclass: Passenger Class Values: 1 = 1st, 2 = 2nd, 3 = 3rd Name: Name of passenger Sex: Sex of passenger Values: Male, Female Age: Age of passenger Sipsp: Number of Siblings/Spouses Aboard Parch: Number of Parents/Children Aboard Ticket: Ticket number Fare: Passenger Fare Cabin: Cabin number Embarked: Port of Embarkation Values: C = Cherbourg, Q = Queenstown, S = Southampton Survived: Passenger survived or not? Values: 0 = No, 1 = Yes Data cleaning, preprocessing and feature engineering steps: 1) Fill the missing values in “Embarked” column with mode value “S” = Southampton 2) Fill the missing values in “Fare” column with median 3) “Cabin” column has almost 1014 missing values, so I am recoding that column with just two values based on whether there is value in that column or not, so new values are 0: Missing and 1: Present 4) If we look at the names, we see a lot of titles, so I have processed the name field using a regular expression and have saved the title in a separate new field “Title” 5) Fill in missing values in “Age” column based on “Sex”, “PClass” and “Title”. I am grouping by the data based on “Sex”, “PClass” and “Title” and using the median value of each group to fill in the missing values of “Age” variable based on values of “Sex”, “PClass” and “Title” for that record. The reason behind doing the group by and then taking the median is that the “Sex”, “PClass” and “Title” together can give a better estimate rather of age than simply taking the median or doing group on one of the fields and then taking median. 6) New variable “Family_Size” which is sum of values in field “SibSp” and “Parch” for each record 7) One-hot dummy encoding for the variables “Embarked”, “Title”, “PClass”, “Family_Size” and “Sex” which are categorical in nature. 8) Min-max normalization for numeric/continuous variables “Age” & “Fare” 9) Removing “PassengerId” and “Name”, as it provides no information for the classification task. After doing data preprocessing and feature engineering, we are separating out the training samples we have as our test data does not have labels, so we cannot use that anymore in the data modelling or for checking the accuracies of the model.  Moving forward we will use 80% of our data for training and 20% of our data for testing. Out of the 80% training data we will use 80% data for tuning the model hyperparameters using cross validation and 20% as a validation set for tuning the model hyperparameters, then we will test the accuracy of model over whole training dataset (80%) from first split and accuracy of model over testing dataset (20%) from first split.  The machine learning algorithms I used for this case study are: 1) Two-class boosted decision tree 2) Two-class logistics regression 3) Two-class Bayes point machine 4) Two-class averaged perceptron  In the stacking ensemble technique, I will use “Scored Probabilities” from each of the model and will create a logistics regression model over the “Scored Probabilities” from four models and will try to predict the class label and see the accuracy. I will also tune the hyperparameters of the ensemble model. The details of the algorithms, range of parameters used, tuned hyperparameters and experimental results for each of the algorithms are: 1) Two-class Boosted Decision Tree Two-class boosted decision tree is an ensemble of decision tress, in which using the boosting concepts, second tree corrects the error of first tree, the third tree corrects the error of the second tree and predictions are based upon the ensemble of trees produced in this fashion until one of the criterion is reached. The two-class boosted decision tree implementation in Microsoft Azure ML Studio is a memory-intensive implementation and the hold tree is hold in memory so it would create memory bottlenecks if it was not being executed on Azure cloud platforms and so it might not be able to handle very large datasets which other linear algorithms can. Also, visualizations are a problem for Two-class boosted decision trees in Azure ML studio as sometimes if trees created are large or more in numbers. In our case, I was not able to visualize the best selected model and its parameters so I am not reporting it. 2) Two-class Logistics Regression It creates a logistics regression model predicts the probability of occurrence of an event by using a logistics function. Details of the implementation can be found here: 3) Two-class Bayes Point Machine The Bayes Point Machine algorithms in Microsoft Azure ML studio is a Bayesian approach to classification. It approximates the theoretically optimal Bayesian average of linear classifiers (in terms of generalization performance) by choosing one "average" classifier, the Bayes Point. Because the Bayes Point Machine is a Bayesian classification model, it is not prone to overfitting to the training data. It doesn’t require parameter tuning and doesn’t need data to be normalized. Number of training iterations: 30 4) Two-class Averaged Perceptron Two-class averaged perceptron in Microsoft Azure ML studio is a simple version of neural network. In this method, inputs to the algorithm are classified into various possible outputs based on a linear function, and then they are combined with a set of weights that are derived from the feature vector—hence the name. Perceptrons are faster and can learn linearly separable patterns. 5) Stacking Ensembles Here I am going to implement the stacking ensemble or stacking technique for building ensemble of classifiers. I am taking the “Scored Probabilities” of training and testing dataset from each of the scored models and then I will model those using a Two-class Logistics Regression by tuning its hyperparameters using the same method I used for individual classifiers. So, Logistics Regression will take as input four features which will look as follows, where Add(Scored Probabilities_$0) comes from Two-class Boosted Decision Tree, Add(Scored Probabilities (2)_$0) comes from Two-class Logistics Regression, Add(Scored Probabilities (3)_$0) comes from Two-class Bayes Point Machine and Add(Scored Probabilities (2) (2)_$0) comes from Two-class Averaged Perceptron Conclusion  The aim of the experiment was to learn the implementation of one of the ensemble technique and the one I showcased here was Stacking technique in which we use the scored probabilities from each of the individual classifier and use them as features to model them using Logistics Regression and we see increase performance of the ensemble classifier.  So, looking at the performance metrics for each of the algorithms and ensemble and experimental analysis we can say that the stacking ensemble technique does improve the classification accuracy as well as precision and recall and could generalize well on the testing dataset and can be used to work on bigger and complicated datasets as well.