BBC News Classifier

September 6, 2016

703 views
336 downloads

Algorithms

Report Abuse
This sample demonstrates how to use multiclass classifiers and feature hashing in Azure ML Studio to classify BBC news dataset.
This sample demonstrates how to use multiclass classifiers and feature hashing in Azure ML Studio to classify news into appropriate categories. **Data** The 2004-2005 BBC news dataset has been used for this experiment. The dataset consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. The news is classified into five classes as **Business, Entertainment, Politics, Sports and Tech.** Original dataset downloaded from “Insight Resources” [(http://mlg.ucd.ie/datasets/bbc.html][1]) consisted 5 directories, each containing text files with the news articles of particular category. The data has been converted to a CSV file by running a C# console application. ([The code of the application can be found from here][2].) The names of the categories have been used as the class label, or attribute to predict. The CSV file has uploaded to Azure ML Studio to use for the experiment. **Data Preparation** The dummy column headings were replaced with meaningful column names using Metadata Editor. Missing values were cleared. Term frequency–inverse document frequency (TF-IDF) of each unigram was calculated. The bit-size as 15 bits was specified to extract 2^15 = 32,768 hashing features. Top 5000 related features were selected for this experiment. **Feature Engineering** We used the Feature Hashing module to convert the plain text of the articles to integers and used the integer values as input features to the model. **Model** ![BBC News Classifier][3] Multiclass Neural Networks module with default parameters has been used for training the model. The parameters were tuned using “Tune model Hyperparameters” module. Results All accuracy values were computed using evaluate module. The confusion table is shown below. ![Confusion table][4] Citation – D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. [1]: http://%28http://mlg.ucd.ie/datasets/bbc.html [2]: https://naadispeaks.wordpress.com/2016/09/06/building-a-news-classifier-with-azure-ml/ [3]: https://naadispeaks.files.wordpress.com/2016/09/bbc-classifier-model.png [4]: https://naadispeaks.files.wordpress.com/2016/09/5cfa71bcddf14589a7693b8edf8b1194.png