This was a 3-day assignment that I worked on while I was in the Analytics program at Northwestern University. It is an implementation of the Multinomial Naive Bayes Algorithm in Java. Text Analytics was by far my favorite course at the program and I thoroughly enjoyed working on this one. Hope you guys like it and is helpful. Suggestions/comments/criticism are welcome!
Problem: Classify books based on their Title name, Author Name, and Content into pre-defined categories. The categories were:
AMERICANHISTORY |
BIOLOGY |
COMPUTERSCIENCE |
CRIMINOLOGY |
ENGLISH |
MANAGEMENT |
MARKETING |
NURSING |
SOCIOLOGY |
Input data format:
First line contains N, H where N = number of training data records and
H = list of headers. N lines of training data will follow this. Each
field in N lines is tab separated. The next line will have M, H where
M = number of test data records and H = list of headers. M lines of
test data will follow this, each field in a line is tab separated.
Training data has following columns:
categoryLabel, bookID, bookTitle, bookAuthor
Test data has following columns:
bookID, bookTitle, bookAuthor
Example of training data:
N=3 H=[categoryLabel, bookID, bookTitle, bookAuthor]
AMERICAN HISTORY
b9418230
American History Survey
Brinkley, Alan
SOCIOLOGY
b16316063
Life In Society
Henslin, James M.
ENGLISH
b14731993
Reading for Results
Flemming, Laraine E.
M=2 H=[bookID, bookTitle, bookAuthor]
b15140145
Efficient and Flexible Reading
McWhorter
b15857527
These United States
Unger, Irwin
Output:
A list of all books from the Test dataset with their Book Ids and their Predicted Category.
Solution introduction:
For the given document classification problem, I decided to implement Multinomial Naive Bayes model. Classification process: classify(feat1,...,featN) = argmax(P(cat))*PROD(P(featI|cat)). I implemented this in Java (using Eclipse). Here features are words.
- - Multinomial (a document is represented by a feature vector with integer elements whose value is the frequency of that word in the document) preferred over Bernoulli (a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present)
- - Laplace’s law of succession or add one smoothing included to eliminate possibility of zero probability
Design:
- Read the input data and split it into Training dataset and Test dataset
- Built the Multinomial Naïve Bayes Classifier using the Training dataset
- I first started with considering just the ‘title’ field to build and classify documents (excluded Stop words and normalized the remaining words)
- Next, I tried using ‘title’ and ‘categoryName’ fields to build and classify documents
- Then, I tried using ‘title’, ‘categoryName’ and ‘author’ fields to build and classify documents
- Lastly, I tried combinations of ‘title’, ‘author’ and ‘contents’ fields to build and classify documents
- Also, I experimented by excluding the ‘categoryPriorProbability’ in the final computation
- The Results of each of these are summarized below
- Classified the documents in Test dataset using this classifier
Code:
https://github.com/aniketd006/NaiveBayes
Details:
Class ‘Category’:
- Attributes:
- categoryName – Name of the Category
- categoryProbability – Prior Probability of the Category
- wordProbability – HashMap of (word-probability) pair of the ‘title’ field of the documents in the category where probability is the category conditional probability of that word
- authWordProb – HashMap of (word-probability) pair of the ‘author’ field of the documents in the category where probability is the category conditional probability of that word
- contentWordProb – HashMap of (word-probability) pair of the ‘contents’ (table of contents from input2.txt) field of the documents in the category where probability is the category conditional probability of that word
- Methods:
- Methods to ‘set’ and ‘get’ each of these attributes
- probabilityCalculation - Calculates P(feat(i)|C) - the probability of feat(i) occurring in that document class
Class ‘BayesClassifier’:
- readData – to read in data from both input files and split the first into training and test datasets
- buildClassifier – Builds the classifier. Creates an array of category objects, each of which has the Vocabulary of features (feat) and P(feat(i)|C) - the probability of feat(i) occurring in that document class (done in the class function) - for 'title', 'author' and 'table of contents'. Also calculates the category prior probabilities
- createWordList – take inputs as String of words and they are tokenized and normalized using an English Analyzer
- classifyDocuments – Classifies the documents into a category which has the class conditional probability. Words are selected from 'title', 'author', 'table of contents' columns separately (only words from 'title' used in the final implementation) for each document in the test dataset and corresponding (P(cat)*PROD(P(feat(i)|cat)) are calculated. Finally, classify(feat1,...,featN) = argmax(P(cat)*PROD(P(feat(i)|cat))
Results:
Category
|
Actual
|
Predicted
|
Difference
|
AMERICANHISTORY
|
8
|
10
|
2
|
BIOLOGY
|
7
|
4
|
-3
|
COMPUTERSCIENCE
|
4
|
4
|
0
|
CRIMINOLOGY
|
6
|
6
|
0
|
ENGLISH
|
12
|
11
|
-1
|
MANAGEMENT
|
4
|
5
|
1
|
MARKETING
|
6
|
7
|
1
|
NURSING
|
7
|
6
|
-1
|
SOCIOLOGY
|
6
|
7
|
1
|
- The final implementation achieved an 86.67% accuracy (52/60)
- This final model had only the ‘title’ field considered to build and classify the documents
- The prior category probabilities was not included in the P(C(i)|D(k)) calculation
- Started using the contents of the books but wasn't too helpful in improving the accuracy (more time and appropriate tweaking of the model might result in improvement of accuracy)
Trials:
Category
|
Documents
|
Prior Probability
|
AMERICANHISTORY
|
19
|
7%
|
BIOLOGY
|
22
|
8%
|
COMPUTERSCIENCE
|
18
|
7%
|
CRIMINOLOGY
|
28
|
10%
|
ENGLISH
|
55
|
20%
|
MANAGEMENT
|
22
|
8%
|
MARKETING
|
19
|
7%
|
NURSING
|
24
|
9%
|
SOCIOLOGY
|
63
|
23%
|
The above table is compiled from the training dataset
Trial #
|
Fields included
|
Accuracy
|
Title
|
Author
|
Category Name
|
Prior Probabilities
|
Words Tokenized
|
Numeric values in fields
|
1
|
Yes
|
No
|
No
|
Yes
|
Yes
|
Yes
|
75%
|
2
|
Yes
|
No
|
Yes
|
Yes
|
Yes
|
Yes
|
75%
|
3
|
Yes
|
No
|
No
|
Yes
|
No
|
Yes
|
67%
|
4
|
Yes
|
Yes
|
No
|
Yes
|
Yes
|
Yes
|
63%
|
5
|
Yes
|
Yes
|
No
|
Yes
|
Yes
|
No
|
60%
|
6
|
Yes
|
No
|
No
|
No
|
Yes
|
Yes
|
87%
|
Findings:
- The first trial was based only on ‘Title’ using the standard formula of Naïve Bayes model. When I observed a few misclassifications, I found that there were some documents which had the word “Historical” and yet wasn't categorized in “American History”. So I thought I could include the categoryNames as a part of the Vocabulary (Trial #2)
- As we see there wasn't any significant change in the overall model accuracy. Hence decided not to use it
- Also, in trial #1, I had Normalized (excluded stop words) all the words that appear in the title of the documents in the training dataset during building the model and classification of the test data. Hence I tried retaining the words as they were and the accuracy dipped. Hence Normalization helped
- Including ‘Author’ field didn't improve the results. In fact deteriorated it further.
- Exclusion of numeric occurrences in the fields doesn't improve accuracy either (numeric years help in prediction)
- There we a few documents that were being misclassified narrowly and this was because of the prior category probabilities (one was far greater than the other and without prior probability the classification of that document would have been correct). Hence I decided try out by excluding prior category probabilities and the accuracy considerably improved and that was the best I could get from these experimentation (87%)
References: