This experiment clusters similar companies into same group given their Wikipedia articles and can be used to assign cluster to new company.
#Clustering: Find similar companies This experiment demonstrates how to use the K-Means clustering algorithm to perform segmentation on companies from the Standard & Poor (S&P) 500 index, based on the text of Wikipedia articles about each company. ![enter image description here] #Data The articles from Wikipedia were pre-processed outside Azure ML Studio to extract and partially clean text content related to each company. The processing included: - Removing wiki formatting - Removing non-alphanumeric characters - Converting all text to lowercase - Adding company categories, where known For some companies, articles could not be found; therefore the number of records is less than 500. #Model First, the contents of each Wiki article were passed to the Feature Hashing module, which tokenizes the text string and then transforms the data into a series of numbers, based on the hash value of each token. Even with this transformation, the dimensionality of the data is too high and sparse to be used by the K-Means clustering algorithm directly. Therefore, Principal Component Analysis (PCA) was applied using a custom R script in the Execute R Script module to reduce the dimensionality to 10 variables. You can review the result of PCA by double-clicking the right-hand output of the Execute R Script R module. From trial and error, we learned that the first variable in the PCA transformed data had the highest variance and appears to have had a detrimental effect on clustering. Therefore, we removed it from the feature set using Project Columns. Once the data was prepared, we created K-Means Clustering module and trained models on the text data.Finally, we used Metadata Editor to change the cluster labels into categorical values. ![enter image description here] Thanks to Microsoft - Brandon Rohrer. https://gallery.cortanaintelligence.com/Experiment/Clustering-Find-similar-companies-23 <br><br> ---------- > This ML experiment is for [Microsoft Azure Machine Learning Course].<br> For the complete experiment list [Click here].<br> Laploy | firstname.lastname@example.org | 084 007 5544 | [www.laploy.com]<br> ![enter image description here] ---------- : https://notebooks.azure.com/laploy/libraries/loyml/html/00001%20Sessions%20summary.ipynb : https://gallery.cortanaintelligence.com/Home/Author?authorId=81E333F747E3429B55A3445E6714C36F60B397C13B4D0B07F34DEF1421F64D73 : http://laploy.com : https://raw.githubusercontent.com/laploy/mli/master//loy-small.jpg : https://raw.githubusercontent.com/laploy/mli/master//12520-000.PNG : https://raw.githubusercontent.com/laploy/mli/master//12520-099.PNG : https://raw.githubusercontent.com/laploy/mli/master//loy-small.jpg