Document classification experiment using Multiclass Decision Forests, with data taken from the CFPB to model support ticket classification.
# Summary This experiment models using multiclass decision forests to categorize documents. The data used is taken from a publicly-available dataset of consumer complaints about financial products; this is a close analogue to a common task of customer support ticket classification. # Description **Data** Data is taken from the Consumer Financial Protection Bureau's online database of consumer complaints about different financial products. The overall dataset contains over 600,000 complaints, over 90,000 of which contain a "Consumer complaint narrative" -- open-ended text describing the problem the consumer is reporting. These complaint span 11 different types of financial products. Because the number of complaints varies substantially across product types, here we limit ourselves to the 5 most frequent product types: - Bank accounts - Credit cards - Credit reporting - Debt collection - Mortgages Additionally, we restrict ourselves here to complaints with narratives at least 500 characters long, to ensure a sufficiently rich training set. This yields a dataset of 36,958 consumer complaints. In addition to columns containing the product type, complaint narrative, and complaint ID, there are numerous other metadata columns as well that are included in the dataset but not used here. **Pipeline** 1. This data is subjected to a series of transformation and cleaning steps: - Converting the "Product" column to be a categorical variable - Setting the "Complaint ID" column to be ignored as a predictive feature - Subsetting the data to contain only the Complaint ID, Product, and Consumer Complaint Narrative columns - Using an R script to convert narrative text to lower case 2. Feature extraction is done using AML's native Feature Hashing module, here set to fairly conservative parameters of unigram features and 12-bit hashing 3. A multiclass decision forest model is used. This is one of several multiclass models available directly in AML. 4. In addition to training up a model, cross-validation is included (defaults to 5-fold). Summary statistics for cross-validation can be viewed directly via the output port of the Evaluate Model node, and predictions from the cross-validation run (collapsed across folds) are also exported to CSV for inspection of model predictions. **Model Performance** Inspection of cross-validation results shows this model to achieve about 82% accuracy on this data. What is particularly interesting is to note the nature of the classification errors. As can be seen in the confusion matrix displayed by the model evaluation step, errors are concentrated in three highly plausible areas: - Bank accounts are mistaken for credit cards - Credit cards are mistaken for credit reporting - Debt collection is mistaken for credit reporting All three of these error types involve topic areas that overlap in real life. Many credit cards are bank-issued, credit reporting is an integral step in securing a card, and credit reporting is also almost always directly implicated in the debt collection process. These errors may well indicate that the labels being applied to this data are not entirely mutually-exclusive, an important consideration when doing classification; it is also interesting that the errors are not symmetrical which may indicate further interesting structural properties of the data.