My Canvas: March 2015

This was a 3-day assignment that I worked on while I was in the Analytics program at Northwestern University. It is an implementation of the Multinomial Naive Bayes Algorithm in Java. Text Analytics was by far my favorite course at the program and I thoroughly enjoyed working on this one. Hope you guys like it and is helpful. Suggestions/comments/criticism are welcome!

Problem: Classify books based on their Title name, Author Name, and Content into pre-defined categories. The categories were:

AMERICANHISTORY

BIOLOGY

COMPUTERSCIENCE

CRIMINOLOGY

ENGLISH

MANAGEMENT

MARKETING

NURSING

SOCIOLOGY

Input data format:
First line contains N, H where N = number of training data records and
H = list of headers. N lines of training data will follow this. Each
field in N lines is tab separated. The next line will have M, H where
M = number of test data records and H = list of headers. M lines of
test data will follow this, each field in a line is tab separated.

Training data has following columns:
categoryLabel, bookID, bookTitle, bookAuthor

Test data has following columns:
bookID, bookTitle, bookAuthor

Example of training data:

N=3 H=[categoryLabel, bookID, bookTitle, bookAuthor]
AMERICAN HISTORY b9418230 American History Survey Brinkley, Alan
SOCIOLOGY b16316063 Life In Society Henslin, James M.
ENGLISH b14731993 Reading for Results Flemming, Laraine E.
M=2 H=[bookID, bookTitle, bookAuthor]
b15140145 Efficient and Flexible Reading McWhorter
b15857527 These United States Unger, Irwin

Output:
A list of all books from the Test dataset with their Book Ids and their Predicted Category.

Solution introduction:

For the given document classification problem, I decided to implement Multinomial Naive Bayes model. Classification process: classify(feat1,...,featN) = argmax(P(cat))*PROD(P(featI|cat)). I implemented this in Java (using Eclipse). Here features are words.

- Multinomial (a document is represented by a feature vector with integer elements whose value is the frequency of that word in the document) preferred over Bernoulli (a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present)
- Laplace’s law of succession or add one smoothing included to eliminate possibility of zero probability

Design:

Read the input data and split it into Training dataset and Test dataset
Built the Multinomial Naïve Bayes Classifier using the Training dataset

I first started with considering just the ‘title’ field to build and classify documents (excluded Stop words and normalized the remaining words)
Next, I tried using ‘title’ and ‘categoryName’ fields to build and classify documents
Then, I tried using ‘title’, ‘categoryName’ and ‘author’ fields to build and classify documents
Lastly, I tried combinations of ‘title’, ‘author’ and ‘contents’ fields to build and classify documents
Also, I experimented by excluding the ‘categoryPriorProbability’ in the final computation
The Results of each of these are summarized below

Classified the documents in Test dataset using this classifier

Code:
https://github.com/aniketd006/NaiveBayes

Details:
Class ‘Category’:

Attributes:

categoryName – Name of the Category
categoryProbability – Prior Probability of the Category
wordProbability – HashMap of (word-probability) pair of the ‘title’ field of the documents in the category where probability is the category conditional probability of that word
authWordProb – HashMap of (word-probability) pair of the ‘author’ field of the documents in the category where probability is the category conditional probability of that word
contentWordProb – HashMap of (word-probability) pair of the ‘contents’ (table of contents from input2.txt) field of the documents in the category where probability is the category conditional probability of that word

Methods:

Methods to ‘set’ and ‘get’ each of these attributes
probabilityCalculation - Calculates P(feat(i)|C) - the probability of feat(i) occurring in that document class

Class ‘BayesClassifier’:

readData – to read in data from both input files and split the first into training and test datasets
buildClassifier – Builds the classifier. Creates an array of category objects, each of which has the Vocabulary of features (feat) and P(feat(i)|C) - the probability of feat(i) occurring in that document class (done in the class function) - for 'title', 'author' and 'table of contents'. Also calculates the category prior probabilities
createWordList – take inputs as String of words and they are tokenized and normalized using an English Analyzer
classifyDocuments – Classifies the documents into a category which has the class conditional probability. Words are selected from 'title', 'author', 'table of contents' columns separately (only words from 'title' used in the final implementation) for each document in the test dataset and corresponding (P(cat)*PROD(P(feat(i)|cat)) are calculated. Finally, classify(feat1,...,featN) = argmax(P(cat)*PROD(P(feat(i)|cat))

Results:

Category	Actual	Predicted	Difference
AMERICANHISTORY	8	10	2
BIOLOGY	7	4	-3
COMPUTERSCIENCE	4	4	0
CRIMINOLOGY	6	6	0
ENGLISH	12	11	-1
MANAGEMENT	4	5	1
MARKETING	6	7	1
NURSING	7	6	-1
SOCIOLOGY	6	7	1

The final implementation achieved an 86.67% accuracy (52/60)
This final model had only the ‘title’ field considered to build and classify the documents
The prior category probabilities was not included in the P(C(i)|D(k)) calculation
Started using the contents of the books but wasn't too helpful in improving the accuracy (more time and appropriate tweaking of the model might result in improvement of accuracy)

Trials:

Category	Documents	Prior Probability
AMERICANHISTORY	19	7%
BIOLOGY	22	8%
COMPUTERSCIENCE	18	7%
CRIMINOLOGY	28	10%
ENGLISH	55	20%
MANAGEMENT	22	8%
MARKETING	19	7%
NURSING	24	9%
SOCIOLOGY	63	23%

The above table is compiled from the training dataset

Trial #	Fields included						Accuracy
Trial #	Title	Author	Category Name	Prior Probabilities	Words Tokenized	Numeric values in fields	Accuracy
1	Yes	No	No	Yes	Yes	Yes	75%
2	Yes	No	Yes	Yes	Yes	Yes	75%
3	Yes	No	No	Yes	No	Yes	67%
4	Yes	Yes	No	Yes	Yes	Yes	63%
5	Yes	Yes	No	Yes	Yes	No	60%
6	Yes	No	No	No	Yes	Yes	87%

Findings:

The first trial was based only on ‘Title’ using the standard formula of Naïve Bayes model. When I observed a few misclassifications, I found that there were some documents which had the word “Historical” and yet wasn't categorized in “American History”. So I thought I could include the categoryNames as a part of the Vocabulary (Trial #2)
As we see there wasn't any significant change in the overall model accuracy. Hence decided not to use it
Also, in trial #1, I had Normalized (excluded stop words) all the words that appear in the title of the documents in the training dataset during building the model and classification of the test data. Hence I tried retaining the words as they were and the accuracy dipped. Hence Normalization helped
Including ‘Author’ field didn't improve the results. In fact deteriorated it further.
Exclusion of numeric occurrences in the fields doesn't improve accuracy either (numeric years help in prediction)
There we a few documents that were being misclassified narrowly and this was because of the prior category probabilities (one was far greater than the other and without prior probability the classification of that document would have been correct). Hence I decided try out by excluding prior category probabilities and the accuracy considerably improved and that was the best I could get from these experimentation (87%)

References:

My Canvas

Pages

Friday, March 13, 2015

Analytics Projects

Text Classification using Naive Bayes algorithm