Recommend Modules for My Data

By for November 8, 2016

581 views
207 downloads

Language

Report Abuse
Recommend modules to clean and transform dataset.
You can use Recommend Modules for My Data module to find the right modules to clean and transform your dataset. For example, let's take a look at an example of a messy dataset. You can find the sample experiment with this dataset [here][1]. ![Screenshot of messy data][2] All columns in this dataset, except first two, have some data quality issue: missing values, NaN or Inf values, too many unique values, free-form text fields that have to pre-processed, and few strings mixed in with numeric data. To get module recommendations to fix these issues, we simply pass this data to the module: ![enter image description here][3] By clicking the output port of the module we can view the list of module recommendations, which columns should be transformed, and more detailed description of the issue and recommended action. ![enter image description here][4] The module checks the columns in the dataset for following issues: - Some missing values (<=50%) - Lots of missing value (>50%) - NaN values - Inf values - Large proportion of non-numeric unique values that may led to overfitting - Whitespace-tokenized text content - Constant data that has no predictive value - Few string values mixed in with otherwise numeric data The R source code of the module is available at GitHub [here][5]. [1]: http://gallery.cortanaintelligence.com/Experiment/Recommend-Modules-for-My-Data-1 [2]: https://ctpsamples.blob.core.windows.net/customoduleimages/recommendmodules1.PNG [3]: https://ctpsamples.blob.core.windows.net/customoduleimages/recommendmodules2.PNG [4]: https://ctpsamples.blob.core.windows.net/customoduleimages/recommendmodules3.PNG [5]: https://github.com/Azure/custmod/tree/master/RecommendModulesForMyData