Monday, August 4, 2014

Information: FAQs about the MSiA program

I have come up with this post so that it can serve as a source of information about the Northwestern University's MS in Analytics (MSiA) program beyond what is provided on the official university website: http://www.analytics.northwestern.edu/. If you are interested in this program, I recommend you to go through the official website thoroughly before going through this post.
I hope this post answers the questions you have in your mind and outside it. If there are still any questions which are not touched upon here, then please do leave a comment and I shall try answering it for you.

NOTE: ALL THE ANSWERS MENTIONED BELOW ARE STRICTLY PERSONAL BASED ON MY UNDERSTANDING, EXPERIENCE AND DISCUSSIONS WITH THE ADMINISTRATORS AND STUDENTS HERE.

1)         Background and skills relevant to this program
·          A background (education/work) in any of the fields like Economics/Econometrics, Math, Stats, Computer/Information Science/Technology, Business Administration, etc. is relevant to this program. This because Analytics is primarily a combination of Business, Math and Technology. To evaluate whether you are a right candidate for this program you could ask the following questions to yourself and research to get the answers for it:

  • What do I know about Analytics?
  • Why Analytics for me?
  • Where is it applied?
  • Is it aligned to my career goals?
  • Where do I see myself in future after this program?
  • What skill-set do I already possess and what would I need to develop to be successful in this field?
2)         What is acceptance criteria?
·            According to me the following are the criteria on which the Admission Committee would base their decision on:
  • Education or courses taken in under grad in a relevant field (Economics, Math, Stats, Computer/Information Science/Technology, Business Administration) class distribution can be found here: http://www.analytics.northwestern.edu/current-students/index.html
  • Performance in undergrad will add a lot of weight to your case (Good acads*, Good University*)
  • Relevant work experience – if any (any work-ex related to working with data, technology or business management)
  • Need some prior exposure to computer programming (for applicants from non-computer science background)
  • GRE/GMAT test scores (2015-16 batch will be the first one for which standardized test score will be considered. So hard to provide a benchmark for this one at this stage)
  • Needless to say that your SOP has to be top notch and very convincing; Resume very professional and of really high standards; Recommendations that are aligned with your case from reliable and from credible supervisors/colleagues 
*see next question

3)         But do they really focus of undergraduate grades/university for international students?
·            Since there is no direct conversion for international grades to US 4-pt GPA, I do not think they would focus too much on your grades from your under-grad as long as you are above the 3.0/4.0 cutoff or equivalent of that (there are various avenues where you can get this conversion done if you are an international student and I think if you have more than 50% from any of the Indian universities then you should be good). Their main focus is your suitability to the program through your under-grad courses and/or your work-experience and how your career goals align to this field etc. 

4)         How important is work experience to get into this program?
·            There are students in both the cohorts thus far who were just out of their under-grad school with minimal work experience (internships) when they joined this program. But all of them had relevant educational background required for this program. So work experience is not necessary but a relevant one would definitely add a lot of weight to your application.

5)         Pre-requisites courses or preparation required before joining MSiA program.
·            There are no pre-requisites as such since students come from really diverse background. But knowledge of elementary stats, probability and calculus is pretty important. So do it whenever you get time. More information can be found here: http://www.analytics.northwestern.edu/prospective-students/index.html

6)         Placements at MSiA
·            Job prospects are excellent after graduating from MSiA. Most of the guys from the first cohort of 2012-13 have got multiple good offers across industries like finance, technology, retail, insurance. If you have any specific questions or queries regarding jobs then you can try getting in touch with them or the program director/asst. director. This page will provide more information: http://www.analytics.northwestern.edu/current-students/career%20placement.html

7)         Program expenses (entire duration excluding internship period)
·            60k-61k (tuition fees) + 3k (Health insurance) + 10k (off campus expenses) + additional 2k (Text books, travel, partying, etc.) – rough estimates and varies from person to person. Tuition fee is $15,038 per quarter for 2013-14 (3 Qs). You can expect it to increase by another 5% may be for the Fall quarter of next year (cant bet on this!). So all together it will be around $60-61k as tuition fee. Additional costs would be a $3.4k annual health insurance for international students and living expenses (staying off-campus is cheaper average rent per month per person could be anywhere between $400 and $600). One can stay on-campus which generally is more expensive than staying off campus (http://www.northwestern.edu/living/). Most text books have either an e-version or are available in the library. So you can expect the buying of textbooks to be minimum. For more information: http://www.analytics.northwestern.edu/prospective-students/tuition-and-fees.html.

8)         Paid internships, on-campus jobs and assistant-ships.
·            There are on campus opportunities here which you can research on the University website. Also there are a few guys who work part-time for companies (paid). As far as my knowledge goes there are no assistant-ships available but you can always mail Lindsay (lindsaymontanari@northwestern.edu) and enquire about it. There are 7 scholarships (50% tuition waiver) also provided by the NU for this program but I am not sure what is the basis on which they provide these (I personally feel it is for the early applicants). For more information: http://www.analytics.northwestern.edu/prospective-students/tuition-and-fees.html

9)         Industry exposure and relevance in this program
·            MSiA is a professional program instituted to cater to the growing industry demand of skillful Analytics professionals - the dearth of which is plaguing small and large businesses alike. Hence this program is structured in a way so the industry exposure is maximized enabling students to directly apply the concepts learned in the classroom out in the real world. The following link gives more information: http://www.analytics.northwestern.edu/prospective-students/index.html expressed

10)      Eligible for OPT
·            Yes, it is eligible of OPT since it comes under the STEM category

11)     Acceptance rate at MSiA?
·            I do not have an exact figure to quote but all I can say is that there are very few established pure-analytics programs in this country (and probably the world), though this count is increasing year over year, and MSiA is one among them. But at the same time it’s a new field and yet to reach its peak in terms of popularity among candidates unlike, say, a computer science program. So if you have a great case for yourself on why you should be admitted to the program and if you satisfy the necessary pre-requisites mentioned above, then you can definitely get through.

12)      Which are the other universities that offer Analytic degree programs?
·         As mentioned earlier, there is a huge surge in Analytics related programs in the US. A few years back there were only a handful that offered a full-fledged degree concentrated purely on Analytics and candidates didn't have a lot of options to get a degree in this field. But this is changing at a rapid pace. The following link gives you a great overview of all the analytics degree programs in this country:

Wednesday, January 29, 2014

Why so serious....????

This is one of the projects that I worked on with Ling Jin (@ljin8118) and Peter Schmidt (@pjschmidt007)as a part of our Text Analytics course at Northwestern University. It was one of the most exciting projects to have worked on and in the process learnt the latest and cutting edge techniques used in the field of Text Analytics and Text Mining. Hope you will enjoy it as much as we enjoyed doing it! Cheers :)

Goal



Provide a textual analysis of the movie script, The Dark Knight, which was robbed of the best picture Oscar at the 81st Annual Academy Awards on February 22, 2009. All project team members are still bitter about this fact. This assignment hopes to resurrect the greatness that is The Dark Knight.


More seriously though, if given a script, the text analytics conducted in this assignment would be able to produce insights into the genre, mood, plot, theme and characters. Ideally the analysis is intended to understand and answer the who, what, when, where and why in regards to a movie.

Objectives 



Specifically, the objectives of the textual analysis of The Dark Knight will cover:

  • Determine the major characters in the script 
  • Show the character to character interaction 
  • Provide insights into sentiment by character 
  • Show how sentiment changes over time 
  • Determine major themes/topics of the script

Approach





Processing Steps

  1. Acquire the movie script of choice 
  2. Parse the script into lines by scenes and lines by character 
  3. Tokenize, normalize, stem the lines of dialogue as appropriate 
  4. Build an index based on available components for subsequent queries 
  5. Perform part of speech (POS) Tagging on the lines of dialogue 
  6. From the POS Tagging, perform sentiment scoring on the lines of dialogue 
  7. Perform named entity recognition 
  8. Perform co-reference 
  9. Perform topic modeling 
  10. Analyze results 
  11. Produce visualizations

Results

Character Identification







The above two visuals carry the same information, just two different representations, about the important characters in the movie. The first visual is a Bubble chart where the size of the bubble is proportional to the # of lines said by the character. 

The second one is a Heat map diagram where again, the area represents the quantity of lines of dialogue across scenes by characters. These two visuals help us identify the major characters of the movie. One can see that Harvey Dent (aka Two-Face), Gordon, The Joker, Bruce Wayne, Batman, Rachel, Fox, and Alfred were easily the major characters of the movie, with Lau, Chechen, Maroni and Ramirez all playing supporting roles. It is interesting to note that in the script, Two-Face is never named as a separate character, unlike Bruce Wayne and Batman. Combining Bruce Wayne and Batman’s line would have made him the most prominent character over Harvey Dent.


Character Interactions





Now that the major characters are established, the next obvious step would be to identify how these characters interact with each other.

The above visual gives us an insight into this. Each node is a character and each edge tells us that the two nodes connected by that edge have interacted at least once in the movie. Our definition of interaction is when two or more characters speak in a single scene. Hence more the number of interactions with distinct characters, bigger will be the size of the node.

The nodes (characters) marked in Red are the central characters. Most of the characters whom have a lot of dialogues also have more interactions with distinct characters. But are there some exceptions, i.e. are there characters who have a lot of lines but less interactions (may be someone like Alfred – having watched the movie) or vice versa? Let’s look further to see what was observed.


Sentiment over time

Below is a visual description of the sentiment of the scenes over time. The methodology to calculate the sentiment for each scene was to first split each scene into dialogues by individuals. Then each dialogue was run through the Design process explained above. At the end of it we get a score for each dialogue and an average of senti scores of all these dialogues gave us the senti score of the scene. As we can see, this was a dark dark movie.



We also looked how character sentiment varied over time. Again the methodology to calculate this was similar to the one above, but this is by character and not by scene.





BATMAN vs. JOKER

What does the Batman say?



As the superhero in this movie, Batman does not talk that much (based on the IDENTIFICATION OF IMPORTANT CHARACTERS and part the real movie). He does mention his opponents, all the killings and of course the word “hero

What does the Joker say?


The joker talks quite often actually, which was confirmed earlier. He talks about his scar /the smile, his childhood and the whole plan stuff. He also mentions all the names quite often. 


Design and Implementation Challenges

Script

One of the first things was to find an appropriate script, which turned out to be a little harder than expected. It was sort of like finding a needle in a haystack. But after some perseverance, The Dark Knight Script was found at:
http://www.pages.drexel.edu/~ina22/splaylib/Screenplay-Dark_Knight.HTM
There were 8704 actual lines that needed to be parsed and fit together in the above script.

Parsing

There were several nuances that needed to be taken into consideration for the actual script parsing. First of all, the script that was found was not broken down by multiple html tags representing the different portions of the script. Instead, the entire script was basically under one tag, which meant parsing was for an entire block of unstructured text. Hence we had to carefully find patterns and parse the script.

Tokenization and Lemmatization

The Standard Analyzer in the Stanford NLP was chosen to handle the tokenization and any normalization required. It also provided lower case and stop word filtering. As it was decided that stemming was not going to be necessary for the analysis that was to be conducted, the Standard Analyzer was chosen over the English Analyzer as the aggressive stemming performed by the PorterStemFilter was not necessary to support the other downstream pipeline processes. The Standard Analyzer was then used consistently across the pipeline to prevent any inconsistency concerns.


POS and Sentiment 

There were a couple of options available to perform sentiment mining on the dialogue in the script. 

Option 1:

The initial selection was to use SentiWordNet, http://sentiwordnet.isti.cnr.it SentiWordNet is a lexical resource that is based on WordNet 3.0, http://wordnet.princeton.edu and is used for opinion mining. SentiWordNet assigns a score to each synset, defined as a set of one or more synonyms, of a word for a particular part of speech. The parts of speech in SentiWordNet are defined as:

a = adjective


n = noun


r = adverb


v = verb

Obtaining the parts of speech from the Stanford NLP part of speech annotator would then require mapping from the parts of speed defined in the Penn Tree Bank Tag set, http://www.computing.dcu.ie/~acahill/tagset.htmlto the part of speech defined in SentiWordNet so that a sentiment score could be produced.

The SentiWordNet resource is constructed as follows:

POS = part of speech


ID = along with pos, uniquely identifies a WordNet (3.0) sunset.


PosScore = the positivity score assigned by SentiWordNet to the synset.


NegScore = the negativity score assigned by SentiWordNet to the synset.


SynsetTerms = terms, including the sense number, belonging to the sunset


Gloss = glossary

Note: The objectivity score can be calculated as: ObjScore = 1 - (PosScore + NegScore)

Option 2

Another option instead of SentiWordNet was to use the sentiment annotator in the Standford NLP pipeline. The team discovered this new addition to the Stanford NLP during the course of the project. It is recent "bleeding edge" sentiment technology that Stanford is now including the Stanford NLP. Excerpted from there website, as most sentiment prediction systems work just by looking at words in isolation, giving positive points for positive words and negative points for negative words and then summing up these points. Which by the way in essence is what we were doing. That way, the order of words is ignored and important information is lost. In contrast, the sentiment annotation that is part of the Stanford NLP institutes a new deep learning model that builds up a representation of whole sentences based on the sentence structure, computing the sentiment based on how words compose the meaning of longer phrases. There are 5 classes of sentiment classification: very negative, negative, neutral, positive, and very positive.


Sentiment  Scoring 

There were several methods available in calculating a sentiment for character lines in a given scene. This is due to the fact that the actual "sense" of the word was not known when passed to the parser to do the actual sentiment. So if “good” had n-senses in the lexical resource, it was not known which sense was used in the dialogue.

Method 1

The first method was to sum all the senti scores within the body of text then divide by the sum of all scores. This was the method that was provided in the demo on the SentiWordNet web site.

Method 2

The second method was to sum all the senti scores within the body of text then divide by the count of all scores.

Method 3

The third method was to just sum all the senti scores within the body of text. Although we actually implemented both options for obtaining sentiment (i.e SentiWordNet and Stanford NLP sentiment annotation), the option that was chosen to score the dialogue was SentiWordNet. The scoring method that was then used, although each were explored, was method 2 defined above. This is the method we finally decided to use in our sentiment analysis of the script.

Concluding Remarks

Text analytics is quite the involved process. As with most data analysis activities a major portion of the time is spent identifying, acquiring and cleansing the source data. The field of text analytics is quite broad with many best of breed components. However, text analytics does not have well integrated toolsets, so you can observe from the solution that was crafted in having to leverage several technologies (Java, R, Excel, Gephi, Tableau) using different libraries (Stanford NLP, Lucene) and various other packages to perform specific functions within the data pipeline. All in all, though, it has been shown that with some blood, sweat and tears (over 1200 lines of code were written for this assignment), and by all means time, a text analytics tool can be built to analyze movie scripts with a pretty accurate view when compared to the overall reality of the movie. And lastly, it should be mentioned, then the inherent complexities in the dialogue and the richness of the script should have guaranteed the Oscar for The Dark Knight!


WHY SO SERIOUS?