Text Classification: Step 3A of 5, n-grams TF feature extraction

By for March 18, 2015

7212 views
3001 downloads


Report Abuse
Text Classification aims to assign a text instance into one or more class(es) in a predefined set of classes.
#<a name="back-to-the-top"></a>Text Classification Template The goal of text classification is to assign some piece of text to one or more predefined classes or categories. The piece of text could be a document, news article, search query, email, tweet, support tickets, customer feedback, user product review etc. Applications of text classification include categorizing newspaper articles and news wire contents into topics, organizing web pages into hierarchical categories, filtering spam email, sentiment analysis, predicting user intent from search queries, routing support tickets, and analyzing customer feedback. As part of the _Azure Machine Learning_ offering, _Microsoft_ provides a template to help data scientists easily build and deploy a text classification solution. In this document, you will learn how to use and customize the template through a demo use case. ## Use Case Description Elena works for an Internet-based retailer company selling DVDs, software, video games, toys, electronics, and furniture. The company shows customers feedback at the product level. Her task is to build a pipeline that automatically analyzes customer feedback and Twitter messages, to provide the overall sentiment for each product. The aim is to help consumers who want to understand public opinion before purchasing a product. ### Use Case Data The data used in this use case is [Sentiment140 dataset](http://help.sentiment140.com/), a publicly available data set created by three graduate students at Stanford University: Alec Go, Richa Bhayani, and Lei Huang. The data comprises approximately 1,600,000 automatically annotated tweets. The tweets were collected by using the _Twitter_ Search API and keyword search. The automatic annotation process works as follows: any tweet containing positive emoticon such as `:)`,`:-)`, `:D` or `=D` was assumed to bear positive sentiment, and tweets with negative emoticons such as `:<`, `:-(` or `:(` were supposed to bear negative polarity. Tweets containing both positive and negative emoticons were removed. Additional information about this data and the automatic annotation process can be found in the technical report written by Alec Go, Richa Bhayani and Lei Huang, *Twitter Sentiment Classification using Distant Supervision*, in 2009. We sampled only 10% of the data and shared it as Blob in a public Windows Azure Storage account. You can use this shared data to follow the steps in this template, or you can get the full data set from the Sentiment140 dataset home page. Each instance in the data set has 6 fields: * sentiment_label - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive) * tweet_id - the id of the tweet * time_stamp - the date of the tweet (Sat May 16 23:58:44 UTC 2009) * target - the query (lyx). If there is no query, then this value is NO_QUERY. * user_id - the user who posted the tweet * tweet_text - the text of the tweet We have uploaded in the experiment only the two fields that are required for training as shown below: ![][image-data-view] ##Workflow The following graphic presents the workflow of the template. Each step in the workflow corresponds to an Azure ML experiment. The experiments must run in order because the output of one experiment is the input to the next. As a data scientist, Elena knows that the bag-of-words vector representation model is commonly used for text classification. In this method, the frequency of occurrence of each word, or term-frequency (TF), is multiplied by the inverse document frequency, and the TF-IDF scores are used as feature values for training a classifier. She also heard that n-gram model is another common vector representation model, but knows that there is no conclusive answer of which one works the best – it really depends on the data. This template provides her with a framework to quickly try out different vector representations and choose the best one to build a web service. ![][image-overall-pipeline] Here are the links to each step (experiment) of the template: ###[Text Classification: Step 1 of 5, data preprocessing](http://gallery.azureml.net/Details/f43e79f47d8a4219bf8613d271ea2c45) ###[Text Classification: Step 2 of 5, text preprocessing](http://gallery.azureml.net/Details/464eb78e197d4440a332a129d8d523eb) ###[Text Classification: Step 3A of 5, n-grams TF feature extraction](http://gallery.azureml.net/Details/cf65bf129fee4190b6f48a53e599a755) ###[Text Classification: Step 3B of 5, unigrams TF-IDF feature extraction](http://gallery.azureml.net/Details/7a5a38a13fa34e2b847b519629da1a59) ###[Text Classification: Step 4 of 5, train and evaluate models](http://gallery.azureml.net/Details/28437611ee1a42df8efb8f4b12a7aa88) ###[Text Classification: Step 5A of 5, deploy web service with n-grams TF model](http://gallery.azureml.net/Details/ecaa60e30c19443e9313f53155bc4367) ###[Text Classification: Step 5B of 5, deploy web service with unigrams TF-IDF model](http://gallery.azureml.net/Details/e98cccbbec1c4739b3691848d01b1b56) [Back to the Top] [Back to the Top]:#back-to-the-top ##Data Pipeline In _Azure Machine Learning_, users can upload a dataset from a local file, or they can connect to an online data source, such as the web, an Azure SQL database, Azure table, , or [Windows Azure BLOB storage](http://azure.microsoft.com/en-us/documentation/services/storage/), by using the **Reader** module or [Azure Data Factory](http://azure.microsoft.com/en-us/services/data-factory/). For simplicity, this template uses pre-loaded sample datasets. However, users are encouraged to explore the use of online data because it enables real-time updates in an end-to-end solution. A tutorial for setting up an Azure SQL database can be found here: [Getting started with Microsoft Azure SQL Database](http://azure.microsoft.com/en-us/documentation/articles/sql-database-get-started/). The template steps 1-4 represent the text classification model training phase. In this phase, text instances are loaded into the Azure ML experiment, and the text is cleaned and filtered. Different types of numerical features are extracted from cleaned the text, and models are trained on different feature types. Finally, the performance of the trained models are evaluated on unseen text instances and the best model determined based on a number of evaluation criteria. In steps 5A and 5B, the most accurate model is deployed as a published web service, using either RRS (Request Response Service) or BES (Batch Execution Service). When using RRS, only one text instance is classified at a time. When using BES, a batch of text instances can be sent for classification at the same time. By using these web services, you can perform classification in parallel, using either an external worker or the Azure Data Factory, for greatly enhanced efficiency. [Back to the Top] [Back to the Top]:#back-to-the-top ## Step Description The main steps of the template are: - [Step 1 of 5: Data preparation] - [Step 2 of 5: Text preprocessing] - [Step 3 of 5: Feature engineering] - [Step 4 of 5: Train and evaluate models] - [Step 5 of 5: Deploy trained models as web services] [Step 1 of 5: Data preparation]:#step-1-data-preparation [Step 2 of 5: Text preprocessing]:#step-2-text-preprocessing [Step 3 of 5: Feature engineering]:#step-3-feature-engineering [Step 4 of 5: Train and evaluate models]:#step-4-train-and-evaluate-models [Step 5 of 5: Deploy trained models as web services]:#step-5-deploy-trained-models-as-web-services ### <a name="step-1-data-preparation"></a>Step 1 of 5: Data preparation ![][image-step1-pipeline] ![][image-step1-exp] **1.1.** Load the dataset that contains the text column into the experiment, using one of the following methods: - **Option 1:** Click **New**, and select **DATASET**, and then select **FROM LOCAL FILE**. - **Option 2:** Add a **Reader** module to the experiment if you need to load data from sources such as the Web, a table in an Azure SQL database, Windows Azure table or BLOB storage, or a Hive table. **1.2.** Use the **Project Columns** module to select the label and text columns. **1.3.** Use the **Metadata Editor** module to rename the label and text columns as `label_column` and `text_column`, respectively. **1.4.** The records with missing text values are removed using **Clean Missing Values** module. **1.5.** Use the **Partition and Sample** module to select the top record in the dataset. **1.6.** Use the **Project Columns** module to select the text column. **1.7.** Run the experiment. **1.8.** After the experiment finishes successfully, save the data using one of the following options: - **Option 1:** Click the left output port of the **Clean Missing Values** module and select **Save as Dataset**. Name the dataset `Text - Input Training Data`. - **Option 2:** Add a **Writer** module to the experiment and write the output dataset to a table in an Azure SQL database, Windows Azure table or BLOB storage, or a Hive table. **1.9.** Click the output port of the **Project Columns** module, select **Save as Dataset**, and name the dataset `Text - SingleInstanceDataset`. In this template, we read the use case dataset using the **Reader** module from Public Azure BLOB storage. [Back to the Top] [Back to the Top]:#back-to-the-top ### <a name="step-2-text-preprocessing"></a>Step 2 of 5: Text preprocessing ![][image-step2-pipeline] ![][image-step2-exp] Unstructured text such as tweets, product reviews, or search queries usually requires some preprocessing before it can be analyzed. This experiment includes a number of optional text preprocessing and text cleaning steps, such as replacing special characters and punctuation marks with spaces, normalizing case, removing duplicate characters, removing user-defined or built-in stop-words, and word stemming. These steps are implemented using the R programming language. **2.1.** Use the **Reader** module to the experiment to load the data, if you previously used the “Writer” module to save the data to the Web, an Azure SQL database, Windows Azure table or BLOB Storage, or a Hive table. ![][image-step2-Rmodule] **2.2.** Use the **Execute R Script** module to specify the required text preprocessing steps. - **Option 1:** Set the parameter `replace_special_chars` to `TRUE` if you need to replace special characters with spaces. - **Option 2:** Set the parameter `remove_duplicate_chars` to `TRUE` i - **Option 3:** Set the parameter `replace_numbers` to `TRUE` if you need to replace numbers with spaces. For some text classification tasks, numbers should be used for training as discriminating features. - **Option 4:** Set the parameter `convert_to_lower_case` to `TRUE` if you need to convert the text into lower case. - **Option 5:** Set the parameter `remove_default_stopWords` to `TRUE` if you need to remove stop-words from text using a predefined list of common words in English. - **Option 6:** Set the parameter `remove_given_stopWords` to `TRUE` if you need to remove stop-words from text using your defined list of common words in English. Please note that the stop-words are application-dependent. That is, a word can be considered as a frequent non-discriminating feature for one application while it is a key feature for another one. For instance, words such as **good**, **bad**, **great** are key features to express sentiment while they are stop-words for news articles categorization. - **a:** Load the dataset that contains the common words (see Step 1.1). - **b:** Attach the loaded dataset to the second port of the **Execute R Script** module. - **Option 7:** Set the parameter `stem_words` to `TRUE` if you need to stem the words. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. For instance, the words “connect”, “connects”, “connected”, “connection”, “connecting” will be mapped to “connect”. **2.3.** Run the experiment. **2.4.** After the experiment finishes successfully, save the results using one of the following options: - **Option 1:** Click the output port of the last **Add Rows** module and select **Save as Dataset**. Name the dataset `Text - Preprocessed Input`. - **Option 2:** Add a **Writer** module to the experiment and attach it to the output port of the last **Add Rows** module and write the output preprocessed text dataset to a table in an Azure SQL database, Windows Azure table or BLOB storage, or a Hive table. **2.5.** Go to the second output port of the last **Execute R Script** module named `Draw Word Cloud` and select Visualize if you need to see the most frequent words for each class. In the adopted use case, the first word cloud represents the top positive sentiment-bearing words and the second word cloud shows the most frequent negative sentiment-bearing words in the input training corpus. ![][image-step2-wordcloud-1] ![][image-step2-wordcloud-2] [Back to the Top] [Back to the Top]:#back-to-the-top ### <a name="step-3-feature-engineering"></a>Step 3 of 5: Feature engineering ####Step 3A of 5: N-grams TF feature extraction ![][image-step3A-pipeline] ![][image-step3A-exp] An n-gram is a contiguous sequence of _n_ terms from a given sequence of text. An n-gram of size 1 is referred to as a _unigram_; an n-gram of size 2 is a _bigram_; an n-gram of size 3 is a _trigram_. N-grams of larger sizes are sometimes referred to by the value of n, for instance, "four-gram", "five-gram", and so on. ##### Feature hashing The **Feature Hashing** module can be used to convert variable-length text documents to equal-length numeric feature vectors, using the 32-bit murmurhash v3 hashing method provided by the _Vowpal Wabbit_ library. The objective of using feature hashing is _dimensionality reduction_; also feature hashing makes the lookup of feature weights faster at classification time, because it uses hash value comparison instead of string comparison. In the sample experiment, we set the number of hashing bits to 15, and set the number of n-grams to 2. With these settings, the hash table can hold 2^15 or 32,768 entries, in which each hashing feature represents one or more n-gram features and its value represents the occurrence frequency of that n-gram in the text instance. For many problems, a hash table of this size is more than adequate, but in some cases, more space might be needed to avoid collisions. Please evaluate the performance of your machine learning solution using different number of bits. ![image-feat-hash] ##### Feature selection (dimensionality reduction) The classification time and complexity of a trained model depends on the number of features (the dimensionality of the input space). For a linear model, such as a support vector machine, the complexity is linear with respect to the number of features. For text classification tasks, the number of features resulting from feature extraction is high because each word in the vocabulary and each n-gram is mapped to a feature. To select a more compact feature subset from the exhaustive list of extracted hashing features, we used the **Filter Based Feature Selection** module. The aim is to avoid the effects of the curse of dimensionality and to reduce the computational complexity without harming classification accuracy. To get the top 5,000 most relevant features with respect to the sentiment label out of the 2^15 extracted features, we used the Chi-squared score function to rank the hashing features in descending order. ![][image-feature-selection] **3A.1.** Use the **Reader** module to the experiment to load the data, if you previously used (in Step 2) the **Writer** module to save the results to a table in an Azure SQL database, Windows Azure table or BLOB Storage, or a Hive table. **3A.2.** Click the **Feature Hashing** module, and specify the number of hashing bits, using the `Hashing bitsize` parameter, and specify the n-gram size using the `N-grams` parameter. In the sample experiment, we specify the bit-size as 15 bits to extract 2^15 = 32,768 hashing features. You may increase the bit-size to get better classification performance. **3A.3.** Click the **Filter Based Feature Selection** module, and specify the feature scoring method, and the number of desired features. In the sample experiment, we selected the top 5,000 most relevant features. You may increase the number of desired features to get better classification performance. **3A.4.** Run the experiment. **3A.5.** After the experiment finishes successfully, save the results using one of the following options: - **Option 1:** Click the left output port of the **Filter Based Feature Selection** module and select **Save as Dataset**. Name the dataset `Text - Extracted N-grams TF`. - **Option 2:** Add a **Writer** module to the experiment and attach it to the left output port of the **Filter Based Feature Selection** and write the extracted features to a table in an Azure SQL database, Windows Azure table or BLOB storage, or a Hive table. ####Step 3B of 5: unigrams TF-IDF feature extraction ![][image-step3B-pipeline] ![][image-step3B-exp] ##### Create the Word Dictionary First, extract the set of unigrams (words) that will be used to train the text model. In addition to the unigrams, the number of documents where each word appears in the text corpus is counted (DF). It is not necessary to create the dictionary on the same labeled data used to train the text model -- any large corpus that fairly represents the frequency of words in the target domain of classification can be used, even if it is not annotated with the target categories. ##### TF-IDF Calculation When the metric word frequency of occurrence (TF) in a document is used as a feature value, a higher weight tends to be assigned to words that appear frequently in a corpus (such as stop-words). The inverse document frequency (IDF) is a better metric, because it assigns a lower weight to frequent words. You calculate IDF as the log of the ratio of the number of documents in the training corpus to the number of documents containing the given word. Combining these numbers in a metric (TF/IDF) places greater importance on words that are frequent in the document but rare in the corpus. This assumption is valid not only for unigrams but also for bigrams, trigrams, etc. This experiment converts unstructured text data into equal-length numeric feature vectors where each feature represents the TF-IDF of a unigram in a text instance. ##### Feature selection (dimensionality reduction) We used the Chi-squared score function to rank the hashing features in descending order, and returned the top 5,000 most relevant features with respect to the sentiment label, out of all the extracted unigrams. **3B.1.** Add the **Reader** module to the experiment to load the preprocessed training data, if you previously (in Step 2) used the **Writer** module to save the results, to a table in an Azure SQL database, Windows Azure table or BLOB Storage, or a Hive table. **3B.2.** In the first **Execute R Script** module in the graph `Create the Dictionary`, specify the following parameters for a word to be included in the dictionary created from the input dataset: **a.** the minimum `minWordLen` and maximum `maxWordLen` length of a word. **b.** the minimum `minDF` and the maximum `maxDF` document occurrence frequency. ![][image-step3B-Rmodule-dictionary] **3B.3.** In the second **Execute R Script** module `TF-IDF Calculation`, make sure to specify the same values for `minWordLen` and `maxWordLen` as defined in Step 3B.3. ![][image-step3B-Rmodule-TFIDF] **3B.4.** In the **Filter Based Feature Selection** module, specify the feature scoring method and the number of desired features. In the sample experiment, we selected the top 5,000 most relevant features. You may increase the number of desired features to get better classification performance. ![][image-feature-selection] **3B.5.** Run the experiment. **3B.6.** After the experiment finishes successfully, Click the left output port of the `Create the Dictionary` **Execute R Script** module and select **Save as Dataset**. Name the dataset `Text - Unigrams Dictionary`. **3B.7.** Save the output of feature selection using one of the following options: - **Option 1:** Click the left output port of the **Filter Based Feature Selection** module and select **Save as Dataset**. Name the dataset `Text - Extracted Unigrams TF-IDF`. - **Option 2:** Add a **Writer** module to the experiment and attach it to the left output port of the **Filter Based Feature Selection** and write the output extracted features dataset to a table in an Azure SQL database, Windows Azure table or BLOB storage, or a Hive table. [Back to the Top] [Back to the Top]:#back-to-the-top ### <a name="step-4-train-and-evaluate-models"></a>Step 4 of 5: Train and evaluate models ![][image-step4-pipeline] ![][image-step4-exp] **4.1.** Use the **Reader** module to the experiment to load the data, if you previously used (in Step 2) the **Writer** module to save the results to a table in an Azure SQL database, Windows Azure table or BLOB Storage, or a Hive table. **4.2.** Use the first **Split** module to split the data into two subset. The first subset will be used to train the model and the second subset will be split in the next step into development/validation set and test set. In the sample experiment, we split the data into 70% and 30% respectively. **4.3.** Use the second **Split** module to split the data into two subset. The first subset will be used later by the **Sweep Parameters** module. The second subset is used as test set to evaluate the performance of the trained model. In the sample experiment, we split the 30% data sample into two halves. That is, each of the development set and the test set represents 15% of the input data. **4.4.** Use the **Sweep Parameters** module to get the optimal values for the underlying learning algorithm parameters. In the sample experiment, the parameter sweeping mode is set to **Random sweep** where the module will conduct a number of training runs (specified by the parameter 'Maximum number of runs on random sweep') from the parameter ranges. Use the **Entire grid** option as a parameter sweeping mode to explore all possible values for each parameter as specified in the learning algorithm module such as the **Two-Class Logistic Regression** module. In the sample experiment, the `AUC` is specified as `Metric for measuring performance for classification`. Other performance evaluation criteria can be used for model selection such as precision, recall and F-score. **4.5.** For binary-class classification tasks, you can either keep the **Two-Class Logistic Regression** module, or you can replace it with another binary-class classification trainer, such as **Two-Class Support Vector Machine**, **Two-Class Boosted Decision Tree**, etc. **4.6.** For multi-class classification tasks, you have to replace the **Two-Class Logistic Regression** module with another multiclass classification trainer, such as **One-vs-All Multiclass**, **Multiclass Logistic Regression**, **Multiclass Decision Forest**, etc. **4.3.** Run the experiment. **4.4.** After the experiment finishes successfully, save the results as follows: **a.** Click the right output port of the left **Sweep Parameters** module and select **Save as Trained Model**. Name the model `Text - Trained N-grams model`. **b.** Click the right output port of the right **Sweep Parameters** module and select **Save as Trained Model**. Name the model `Text - Trained Unigrams model`. **4.5.** Click the output port of the **Evaluate Model** module and visualize the results of comparison between the two trained models: N-grams model (in blue color in the graphs below) and unigrams model (in red color in the graphs below). Based on this evaluation, use Step 5 to deploy the best trained model as a web service. Note that there is not a unique learning algorithm that is the best for all classification tasks so evaluate the different state-of-the-art learning algorithms offered by _Azure Machine Learning_ to get the best model for your task. #### ROC curve ![][image-ROC] #### Precision/Recall curve ![][image-PR-Curve] Use the threshold horizontal slider bar to get the corresponding precision, recall and F1 Score at each threshold value. #### Lift curve ![][image-Lift] [Back to the Top] [Back to the Top]:#back-to-the-top ### <a name="step-5-deploy-trained-models-as-web-services"></a>Step 5 of 5: Deploy trained models as web services The web service can be consumed in two modes: RRS (request-response service) and BES (batch execution service). Sample code (C#/python/R) to call the web services are provided in the web service (clicking “API help page” link as shown in the web service page). For more information about how to publish a web service, see [this tutorial](http://azure.microsoft.com/en-us/documentation/articles/machine-learning-walkthrough-5-publish-web-service/). ####Step 5A of 5: Deploy the N-grams TF trained model as a web service ![][image-step5A-pipeline] ![][image-step5A-exp] A key feature of _Azure Machine Learning_ is the ability to easily publish models as web services on Windows Azure. This experiment deploys the N-grams TF text model trained in Step 4 as a web service. Web service entry and exit points are defined using the special Web Service modules. Note that the **Web service input** module is attached to the node in the experiment where input data would enter. **5A.1.** In the `text preprocessing` **Execute R Script** module, specify the required text preprocessing steps, using the same parameters as defined in Step 2.2. **5A.2.** In the “Feature Hashing” module, specify the same number of bits, using the `Hashing bitsize` parameter, and the same n-gram size using the `N-grams` parameter as defined in Step 3A.1. **5A.3.** Set web service entry point. **5A.4.** Set web service exit point. **5A.5.** Run the experiment. **5A.6.** After the experiment finishes successfully, select **Publish Web Service** at the bottom of the experiment canvas. **5A.7.** From the API help page, use the sample code to invoke the published web service **5A.8.** From the Dashboard page, copy the API key and paste it in the sample code. ####Step 5B of 5: Publish the unigrams TF-IDF trained model as a web service ![][image-step5B-pipeline] ![][image-step5B-exp] This experiment deploys the unigrams TF-IDF text model trained in Step 4 as a web service. Web service entry and exit points are defined using the special Web Service modules. Note that the **Web service input** module is attached to the node in the experiment where input data would enter. **5B.1.** In the `text preprocessing` **Execute R Script** module, specify the required text preprocessing steps, using the same parameters as defined in Step 2.2. **5B.2.** In the `TF-IDF Calculation` **Execute R Script** module, use the same parameters as defined at Step 3B.3. **5B.3.** Run the experiment. **5B.4.** After the experiment finishes successfully, select **Publish Web Service** at the bottom of the experiment canvas. **5B.5.** From the API help page, use the sample code to invoke the published web service. ![][image-step5A-service] ##Summary Microsoft Azure ML provides a cloud-based machine learning platform for data scientists to easily build and deploy machine learning applications. The text classification template, based on word and n-grams occurrence frequencies, can be adapted to different text categorization scenarios. This template, along with other templates published by Microsoft, further enables users to perform fast prototyping and deployment of machine learning solutions. [Back to the Top] [Back to the Top]:#back-to-the-top <!-- Images --> [image-data-reader]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/data-reader.PNG [image-data-view]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/data-view.PNG [image-overall-pipeline]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/overall-pipeline.png [image-step1-pipeline]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step1-pipeline.png [image-step1-exp]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step1-exp.png [image-step2-pipeline]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step2-pipeline.png [image-step2-exp]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step2-exp-v2.png [image-step2-Rmodule]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step2-Rmodule.png [image-step2-wordcloud-1]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step2-wordcloud-1.png [image-step2-wordcloud-2]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step2-wordcloud-2.png [image-step3A-pipeline]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step3A-pipeline.png [image-step3A-exp]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step3A-exp.png [image-step3B-pipeline]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step3B-pipeline.png [image-step3B-exp]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step3B-exp.png [image-step3B-Rmodule-dictionary]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step3B-Rmodule-dictionary.png [image-step3B-Rmodule-TFIDF]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step3B-Rmodule-TFIDF.png [image-step4-pipeline]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step4-pipeline.png [image-step4-exp]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step4-exp.png [image-step5A-pipeline]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step5A-pipeline.png [image-step5A-exp]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step5A-exp.png [image-step5B-pipeline]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step5B-pipeline.png [image-step5B-exp]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step5B-exp.png [image-feat-hash]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/feature-hashing.PNG [image-feature-selection]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/feature-selection.PNG [image-ROC]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step4-ROC.PNG [image-PR-Curve]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step4-PR-Curve.PNG [image-Lift]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step4-Lift.PNG [image-step5A-service]:https://az712634.vo.msecnd.net/samplesimg/v1/T2/step5A-service.png