Step 1. Convert Dataset to VW Format

September 15, 2016

360 views
51 downloads


Report Abuse
Convert the Adult Income dataset into Vowpal Wabbit format, split it into training and validation sets, and write them to Azure blob.
This is the step 1 in the [Vowpal Wabbit Samples Collections](https://gallery.cortanaintelligence.com/Collection/Vowpal-Wabbit-Samples-2). VW expects a [special file format](https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format). This example show how to take the beloved Adult Census Income dataset, and convert it to the VW format using Python scripts. It then splits the resulting file into training set and validation set, and finally stores them as two blobs in Azure blob storage. ![screenshot](http://az754797.vo.msecnd.net/docs/vw/vw_convert.png) **NOTE**: This approach works well for smaller datasets. However, VW, being an out-of-core learning system, supports online training over very large datasets well beyond the 10 GB limit allowed by most Azure ML modules. If you have large datasets (over 5 GB say), it is a good idea to do the conversion outside of Azure ML and make the converted datasets available in Azure blob storage for training and/or scoring. Otherwise the Python script module is likely going to run out of memory. Here are some scripts to help you with that using [Perl](https://github.com/JohnLangford/vowpal_wabbit/blob/master/utl/csv2vw) and [Python](https://github.com/zygmuntz/phraug2/blob/master/csv2vw.py). Once data is converted and written into Azure blob, we can proceed to step 2, [training and evaluating a VW model](https://gallery.cortanaintelligence.com/Experiment/Train-and-Evaluate-a-VW-Model-1). Note that the storage account key has been cleared from this experiment. So if you want to open this in Studio and try it, you need to configure the Export Data modules with valid storage account and key information. Following is the Python code in the Execute Python Script module to convert the input dataframe to VW format. Note that for binary classification, it is required that that labels be -1 and 1 respectively. # convert a dataframe into VW format import pandas as pd import numpy as np def azureml_main(inputDF): labelColName = 'income' trueLabel = '>50K' colsToExclude = ['workclass', 'occupation', 'native-country'] numericCols = ['fnlwgt'] output = convertDataFrameToVWFormat(inputDF, labelColName, trueLabel, colsToExclude, numericCols) return output def convertDataFrameToVWFormat(inputDF, labelColName, trueLabel, colsToExclude, numericCols): # remove '|' and ':' that are special characters in VW def clean(s): return "".join(s.split()).replace("|", "").replace(":", "") def parseRow(row): line = [] # convert labels to 1s and -1s and add to the beginning of the line line.append("{} |".format('1' if row[labelColName] == trueLabel else '-1')) for colName in featureCols: if (colName in numericCols): # format numeric features line.append("{}:{}".format(colName, row[colName])) else: # format string features line.append(clean(str(row[colName]))) vw_line = " ".join(line) return vw_line # drop columns we don't need inputDF.drop(colsToExclude, axis = 1) # select feature columns featureCols = [c for c in inputDF.columns if c != labelColName] # parse each row output = inputDF.apply(parseRow , axis = 1).to_frame() return output Here are the [two converted files](http://azuremluxcdnprod001.blob.core.windows.net/docs/vw/income.zip) if you just want to download them. At this point, you are ready to proceed to the next step, [train and evaluate a VW model using these datasets](https://gallery.cortanaintelligence.com/Experiment/Train-a-VW-Model-with-Small-Dataset-1)