Friday, March 13, 2015

Analytics Projects

Text Analytics:

  • Why so serious...???This is a project that I worked on with Ling Jin (@ljin8118) and Peter Schmidt (@pjschmidt007) as a part of our Text Analytics course at Northwestern University. The aim was to analyze the script of a movie or a play and compare with the reality. We chose the Dark Knight script and text mined the script to observe a few trends. Read more to see whether we came out right or was the entire effort far from reality. Cheers :)
  • Text Classification Using Naive Bayes AlogorithmThis was a 3-day assignment that I worked on while I was in the Analytics program at Northwestern University. It is an implementation of the Multinomial Naive Bayes Algorithm in Java. Text Analytics was by far my favorite course at the program and I thoroughly enjoyed working on this one. Hope you guys like it and is helpful. Suggestions/comments/criticism are welcome!
         

Text Classification using Naive Bayes algorithm

This was a 3-day assignment that I worked on while I was in the Analytics program at Northwestern University. It is an implementation of the Multinomial Naive Bayes Algorithm in Java. Text Analytics was by far my favorite course at the program and I thoroughly enjoyed working on this one. Hope you guys like it and is helpful. Suggestions/comments/criticism are welcome!

Problem: Classify books based on their Title name, Author Name, and Content into pre-defined categories. The categories were:
AMERICANHISTORY
BIOLOGY
COMPUTERSCIENCE
CRIMINOLOGY
ENGLISH
MANAGEMENT
MARKETING
NURSING
SOCIOLOGY

Input data format:
First line contains N, H where N = number of training data records and
H = list of headers.  N lines of training data will follow this. Each
field in N lines is tab separated. The next line will have M, H where
M = number of test data records and H = list of headers. M lines of
test data will follow this, each field in a line is tab separated.

Training data has following columns:
categoryLabel, bookID, bookTitle, bookAuthor

Test data has following columns:
bookID, bookTitle, bookAuthor

Example of training data:

N=3 H=[categoryLabel, bookID, bookTitle, bookAuthor]
AMERICAN HISTORY b9418230 American History Survey Brinkley, Alan
SOCIOLOGY b16316063 Life In Society Henslin, James M.
ENGLISH b14731993 Reading for Results Flemming, Laraine E.
M=2 H=[bookID, bookTitle, bookAuthor]
b15140145 Efficient and Flexible Reading McWhorter
b15857527 These United States Unger, Irwin


Output:
A list of all books from the Test dataset with their Book Ids and their Predicted Category.

Solution introduction:
For the given document classification problem, I decided to implement Multinomial Naive Bayes model. Classification process: classify(feat1,...,featN) = argmax(P(cat))*PROD(P(featI|cat)). I implemented this in Java (using Eclipse). Here features are words.
  • -          Multinomial (a document is represented by a feature vector with integer elements whose value is the frequency of that word in the document) preferred over Bernoulli (a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present)
  • -          Laplace’s law of succession or add one smoothing included to eliminate possibility of zero probability
Design:

  1. Read the input data and split it into Training dataset and Test dataset
  2. Built the Multinomial Naïve Bayes Classifier using the Training dataset
    • I first started with considering just the ‘title’ field to build and classify documents (excluded Stop words and normalized the remaining words)
    • Next, I tried using ‘title’ and ‘categoryName’ fields to build and classify documents
    • Then, I tried using ‘title’, ‘categoryName’ and ‘author’ fields to build and classify documents
    • Lastly, I tried combinations of ‘title’, ‘author’ and ‘contents’  fields to build and classify documents
    • Also, I experimented by excluding the ‘categoryPriorProbability’ in the final computation
    • The Results of each of these are summarized below
  3. Classified the documents in Test dataset using this classifier
Code:
https://github.com/aniketd006/NaiveBayes

Details:
Class ‘Category’:

  • Attributes:
    • categoryName – Name of the Category
    • categoryProbability – Prior Probability of the Category
    • wordProbability – HashMap of (word-probability) pair of the ‘title’ field of the documents in the category where probability is the category conditional probability of that word
    • authWordProb – HashMap of (word-probability) pair of the ‘author’ field of the documents in the category where probability is the category conditional probability of that word
    • contentWordProb – HashMap of (word-probability) pair of the ‘contents’ (table of contents from input2.txt) field of the documents in the category where probability is the category conditional probability of that word
  • Methods:
    • Methods to ‘set’ and ‘get’ each of these attributes
    • probabilityCalculation - Calculates P(feat(i)|C) - the probability of feat(i) occurring in that document class
Class ‘BayesClassifier’:


  • readData – to read in data from both input files and split the first into training and test datasets
  • buildClassifier –     Builds the classifier. Creates an array of category objects, each of which has the Vocabulary of features (feat) and P(feat(i)|C) - the probability of feat(i) occurring in that document class (done in the class function) - for 'title', 'author' and 'table of contents'. Also calculates the category prior probabilities
  • createWordList – take inputs as String of words and they are tokenized and normalized using an English Analyzer
  • classifyDocuments – Classifies the documents into a category which has the class conditional probability. Words are selected from 'title', 'author', 'table of contents' columns separately (only words from 'title' used in the final implementation) for each document in the test dataset and corresponding (P(cat)*PROD(P(feat(i)|cat)) are calculated. Finally, classify(feat1,...,featN) = argmax(P(cat)*PROD(P(feat(i)|cat))

Results:

Category
Actual
Predicted
Difference
AMERICANHISTORY
8
10
2
BIOLOGY
7
4
-3
COMPUTERSCIENCE
4
4
0
CRIMINOLOGY
6
6
0
ENGLISH
12
11
-1
MANAGEMENT
4
5
1
MARKETING
6
7
1
NURSING
7
6
-1
SOCIOLOGY
6
7
1


  • The final implementation achieved an 86.67% accuracy (52/60)
  • This final model had only the ‘title’ field considered to build and classify the documents
  • The prior category probabilities was not included in the P(C(i)|D(k)) calculation
  • Started using the contents of the books but wasn't too helpful in improving the accuracy (more time and appropriate tweaking of the model might result in improvement of accuracy)
Trials:

Category
Documents
Prior Probability
AMERICANHISTORY
19
7%
BIOLOGY
22
8%
COMPUTERSCIENCE
18
7%
CRIMINOLOGY
28
10%
ENGLISH
55
20%
MANAGEMENT
22
8%
MARKETING
19
7%
NURSING
24
9%
SOCIOLOGY
63
23%
The above table is compiled from the training dataset

Trial #
Fields included
Accuracy
Title
Author
Category Name
Prior Probabilities
Words Tokenized
Numeric values in fields
1
Yes
No
No
Yes
Yes
Yes
75%
2
Yes
No
Yes
Yes
Yes
Yes
75%
3
Yes
No
No
Yes
No
Yes
67%
4
Yes
Yes
No
Yes
Yes
Yes
63%
5
Yes
Yes
No
Yes
Yes
No
60%
6
Yes
No
No
No
Yes
Yes
87%

Findings:

  • The first trial was based only on ‘Title’ using the standard formula of Naïve Bayes model. When I observed a few misclassifications, I found that there were some documents which had the word “Historical” and yet wasn't categorized in “American History”. So I thought I could include the categoryNames as a part of the Vocabulary (Trial #2)
  • As we see there wasn't any significant change in the overall model accuracy. Hence decided not to use it
  •  Also, in trial #1, I had Normalized (excluded stop words) all the words that appear in the title of the documents in the training dataset during building the model and classification of the test data. Hence I tried retaining the words as they were and the accuracy dipped. Hence Normalization helped
  • Including ‘Author’ field didn't improve the results. In fact deteriorated it further.
  • Exclusion of numeric occurrences in the fields doesn't improve accuracy either (numeric years help in prediction)
  • There we a few documents that were being misclassified narrowly and this was because of the prior category probabilities (one was far greater than the other and without prior probability the classification of that document would have been correct). Hence I decided try out by excluding prior category probabilities and the accuracy considerably improved and that was the best I could get from these experimentation (87%)
References: