Data Simulator For Machine Learning

January 4, 2017


Report Abuse
Simple data simulator for machine learning applications. For now only for binary classification.
# Table of contents 1. Introduction 2. Parameters of the Custom R Module for Data Simulation 3. Contribute 4. Implementation 5. Examples # 1. Introduction Simple data simulator for machine learning applications. Only for binary classification, only numeric columns (or NA, see Example discussion below). Generated dataset has feature dimensionality equal to (__noOfIrrelevantFeatures__ + __noOfRelevantFeatures__) and the total number of columns will be (__noOfIrrelevantFeatures__ + __noOfRelevantFeatures__ + 1), i.e. total number of features plus label column (see parameters description below). (__noOfIrrelevantFeatures__ + __noOfRelevantFeatures__) should be larger than or equal to 1, so the dataset has at least one feature column. Relevant features are univariately correlated with the label column. Correlation directionality (i.e. positive or negative correlation coefficient) is controlled by __correlationDirectionality__ parameter(s). All __noOfIrrelevantFeatures__ irrelevant features are generated using separate __runif__ calls. __simulatedDataLabel__ column values are also based on a separate __runif__, thresholded at the percentile value driven by __positivesWeight__ parameter. Relevant features are rescaled to the (0,1) range received from __runif__ function. # 2. Parameters of the FSSC custom R module All arguments are optional: - __noOfObservations__: (integer > 0, defaults to 100) sets number of observations/rows. - __noOfIrrelevantFeatures__ (integer>=0, defaults to 1) sets number of irrelevant features (named V1, V2 and so on) - __noOfRelevantFeatures__ (integer >=0, defaults to 1) sets number of relevant features (named RelevCol1, RelevCol2 and so on) - __simulatedDataLabelName__ (non-empty string, default="dataLabel") sets name of the label column. - __positivesWeight__ (float > 0, defaults to 20). If in (0,1) range, it is considered a ratio (e.g. .15 means 15% of __noOfObservations__ are from positive class). Values >=1 are coerced to integer (and also capped at __noOfObservations__-1) and indicate the number of positive class examples (e.g. 15 means data has 15 positive class samples and the other __noOfObservations__-15 are negative class samples). Negative and positive class meaning is arbitrary. To match a failure prediction/detection scenario, positive samples have "FALSE" label values, negative values have "TRUE" label values. - __noiseAmplitude__, (default="0.03") csv-ed floats or ints, can be a scalar or an array that will be recycled to match __noOfRelevantFeatures__. This controls how correlated the relevant features are with label column (smaller noize means larger correlation), and how correlated relevant features are between themselves (larger noise, more dis-similar are the relevant features between themselves). - __correlationDirectionality__, (default="1") csv-ed floats or ints, can be a scalar or an array that will be recycled to match __noOfRelevantFeatures__. This controls correlation directionality (i.e. positive or negative) between the relevant features and the label column. Defaults to 1, i.e. all relevant features will be positively correlated with label values. Absolute amplitude values are ignored, only sign values are used (0 means positive correlation). - __seedValue__, (default=1000) integer used for results reproducibility. # 3. Contribute - Right now only binary classification is implemented, multiclass and regression capability should be added. - Maybe add clustering properties similar to [sklearn.datasets.make_classification]( - All feature columns are numeric, categorical features should also be added, but defining their joint histogram with column label is not easy to implement in a user friendly way. - Time column(s) may also be useful to have for time series analysis. Simulator complexity increases a lot since time trends will have to be defined (and implemented). # 4. Implementation ```{r main function for data simulator custom R module, include = TRUE} getSimulatedData <- function( noOfObservations = 100, noOfIrrelevantFeatures=1, noOfRelevantFeatures=1, simulatedDataLabelName = "simulatedDataLabel", positivesWeight=20, noiseAmplitudes="0.03", correlationDirectionality="1", seedValue=1000) { library("data.table") set.seed(seedValue) # positivesWeight is either a ratio (if <1) or the absolute number of positives desiredNumberOfPositives<-positivesWeight if (positivesWeight<1) desiredNumberOfPositives<-round(abs(positivesWeight)*noOfObservations, 0) else (if (positivesWeight>=noOfObservations) desiredNumberOfPositives<-noOfObservations-1) #string to num noiseAmplitudes and correlationDirectionality noiseAmplitude<-as.numeric(unlist(strsplit(gsub(" ", "", noiseAmplitudes),split=","))) correlationDirectionality<-as.numeric(unlist(strsplit(gsub(" ", "", correlationDirectionality),split=","))) #expand noiseAmplitude and correlationDirectionality array to a noOfRelevantFeatures long array noiseAmplitudes<-noiseAmplitude*, noOfRelevantFeatures) correlationDirectionality<-sign(correlationDirectionality)*, noOfRelevantFeatures) correlationDirectionality[correlationDirectionality==0]<-1 #generate numeric array that will be used to generate correlated relevant features, and label column through thresholding labelColumnNumeric<-runif(noOfObservations) trainingData<-cbind( #labels column, correlated with labelColumnNumeric that will be discarded setnames(data.frame(labelcol= as.factor(labelColumnNumeric> quantile(labelColumnNumeric, probs = c((desiredNumberOfPositives/noOfObservations))))), eval(simulatedDataLabelName)), #Relevant Features,to=noOfRelevantFeatures,by=1), function(crtRelevantColumnCounter){ crtNoise<-runif(noOfObservations) desiredRange<-range(crtNoise)#will reuse this range for crt feature #add noise to labelColumnNumeric crtRelevantColValues<- correlationDirectionality[crtRelevantColumnCounter]*labelColumnNumeric+ (noiseAmplitudes[crtRelevantColumnCounter]*(crtNoise-.5)) #reuse the (0,1)-ish range of original noise for the simulated values oldRange<-range(crtRelevantColValues) crtRelevantColValues<-(crtRelevantColValues-oldRange[1])/(oldRange[2]-oldRange[1])* (desiredRange[2]-desiredRange[1])+desiredRange[1] #return relevant columns setnames(data.frame(someCol= crtRelevantColValues), paste0("RelevCol", crtRelevantColumnCounter)) })), #Irrelevant Features matrix(rnorm(noOfIrrelevantFeatures*noOfObservations), noOfObservations,noOfIrrelevantFeatures)) ) cat(summary(trainingData)) trainingData } ``` # 5. Examples ## Simulated Data properties Following experiment setup: ![Data simulator in Azure ML Experiment]( <sub> Data simulator in Azure ML Experiment </sub> is equivalent to this code: source("DataSimulatorForMachineLearning.R") seedValue<-1000 noOfObservations <-200 noOfIrrelevantFeatures<-2 noOfRelevantFeatures<-7 simulatedDataLabelName <- "class label" desiredNumberOfPositives <-70 noiseAmplitudes<-"0 , .1,10" correlationDirectionality<- "1,-1,2,-56" simulatedData<- getSimulatedData(noOfObservations, noOfIrrelevantFeatures,noOfRelevantFeatures, simulatedDataLabelName = simulatedDataLabelName,desiredNumberOfPositives,noiseAmplitudes, correlationDirectionality, seedValue) ## Module log output The custom R module output log file shows the confusion matrix (and training data summary): FALSE: 70 TRUE :130 ![LogFileLocation in Azure ML Experiment]( <sub> Data simulator Log file location in Azure ML Experiment </sub> ## Visualization of Simulated data Simulated data can be visualized by right clicking the output of the custom R module: ![SimulatedData in Azure ML Experiment]( <sub> Visualization of Simulated Data in Azure ML Experiment, showing the good correlation between label column and RelevCol1 feature. </sub> While the output of the data simulator can be used for machine learning task like training a [Classification module](, we can use correlation analysis to understand the relationship between label and feature columns. A simple way to do this in Azure Machine Learning (AML) Studio is to use a "[Filter Based Feature Selection](" module. This module is easy to configure, the user just has to choose the feature-label columns similarity metric (__Feature scoring method__: Pearson Correlation) and the __Target column__ (using the GUI column selector) to match the name used in the simulated data. The __Number of desired features__ field can be left to 1 since we do not do feature selection but will look at all feature correlation values in the right side module output. "Filter Based Feature Selection" module analysis is equivalent (up to correlation sign) to these R command lines: corrResults<-cor(as.matrix(as.numeric(simulatedData[,c(simulatedDataLabelName)])), as.matrix(simulatedData[,!names(simulatedData) %in% c(simulatedDataLabelName)])) corrResults > corrResults > RelevCol1 RelevCol2 RelevCol3 RelevCol4 RelevCol5 RelevCol6 RelevCol7 V1 V2 [1,] 0.8496581 -0.8512635 0.1168883 -0.8496581 0.8470608 0.0794509 0.8496581 -0.04198693 0.04798775 The AML correlation analysis results can be seen by right clicking the right output of the "[Filter Based Feature Selection](" module: ![Correlation Analysis Results in Azure ML Experiment]( <sub> Visualization of Correlation Analysis of Simulated Data in Azure ML Experiment, showing the good correlation between label column and RelevCol1 feature. </sub> ## Simulated data properties Correlation analysis shows the properties of the simulated data: - RelevCol1, RelevCol4, RelevCol7 have identical (absolute value) correlation values with label column since the 3 values long __noiseAmplitude__ array parameter of the data simulator is reused for all 7 relevant columns, so the first value of 0 in __noiseAmplitude__ array is reused for relevant columns 1, 4, and 7. __NOTE__: Missing values in __noiseAmplitude__ array parameter of the data simulator (e.g. like second value in 4 values sequence "0, , 2.1, 7") will create full NA columns in the dataset. - unlike RelevCol1 and RelevCol7, RelevCol4 is anti-correlated with label column since the __label-features correlation directionality__ array parameter of the data simulator has four values, and the value for feature 4 is negative (-56). This result is not visible in the "[Filter Based Feature Selection](" module since correlation sign is irrelevant for feature importance analysis. - RelevCol2 and RelevCol5 have correlation values different but similar to RelevCol1 and RelevCol7, RelevCol4, because of their small noise value (.1) in __noiseAmplitude__ array parameter. - RelevCol3 and RelevCol6 have correlation values similar to irrelevant V1 and V2, because their large noise value (10) in __noiseAmplitude__ array parameter.