This experiment demonstrates how to use execute python script module to perform a simple nature language processing task - tokenize on the amazon book review dataset.
##Execute Python Script## This experiment demonstrates how to use **Execute Python Script** module to perform a simple nature language processing task - tokenize on the amazon book review dataset. ##Data The dataset we use in our experiment is a collection of book review and rating data from amazon. The data is pre-stored as TSV format in our workspace, and contains two columns. The first column is the rating information, and the other column is the review text information associated with that rating. We use following modules to convert the dataset into a feature vector, which can be used for further machine learning purpose: - Using the **Partition and Sample** module to sample 2 percent of the data - Using the **Execute Python Script** module to tokenize the review text and combine the rating and tokenized word into feature vector. Finally we'll output a usable feature vector for further process. ##Creating the Experiment## First, we need to drag the **Book Reviews from Amazon** dataset from **Saved Datasets**. Second, we will sample the datasets using the **Partition and Sample** module. In this experiment, we follow these configurations: 1. For **Partition and Sample** option, we select _Sampling_. 2. In the **Rating of sampling** text box, we provide _0.02_ as the sample rate. 3. In the **Random seed for sampling** text box, we input _14726_ as random seed. 4. For the **Stratified split for sampling** option, we select _False_ as we are not doing stratified sampling. ![image1] Finally, we perform the tokenize operation in **Execute Python Script** module. (The following diagram shows the example code. (To get a copy of this sample Python code, you can create a copy of this experiment, and then edit the Execute Python Script module.) ![image2] ##Data Visualization## You can review the content **Book Reviews from Amazon** dataset by right-click the output port of the **Book Reviews from Amazon** module and select **Visualize**. ![image3] ![image4] After you use the **Execute Python Script** to tokenize the text in second column of the data sampled by **Partition and Sample** module, the modified dataset looks like the following diagram. The modified dataset is much easier to use for further machine leaning purpose. ![image5] <!-- Images --> [image1]:http://az712634.vo.msecnd.net/samplesimg/v1/36/sample.PNG [image2]:http://az712634.vo.msecnd.net/samplesimg/v1/36/python_code.PNG [image3]:http://az712634.vo.msecnd.net/samplesimg/v1/36/structure.PNG [image4]:http://az712634.vo.msecnd.net/samplesimg/v1/36/original_data.PNG [image5]:http://az712634.vo.msecnd.net/samplesimg/v1/36/modified_data.PNG