Xin Jin MSc Dissertation 2015/16

School of Computer Science

Xin Jin MSc Dissertation 2015/16

Automatic Sentiment Analysis of Chinese Social Media

Supervised by M.Hepple

Abstract

The coming of ``Big Data Era'' produces a large volume of data, leading people face both opportunities and challenges. It is one of the reasons why automatic sentiment analysis attracts an increasing attention these years.

People share ideas and attitudes in social media like Facebook and Twitter. Consequently, indicators of public mood can be extracted with sentiment analysis.

Overall sentiment classification is investigated by some people before. For example, Pang and Lee ever did sentiment classification using machine learning techniques based on English movie reviews. What they did is actually determining whether a review is positive or negative.

While to sentiment classification with Chinese movie reviews, it is a new and challenging area so far.

Lexicon based approaches are firstly considered to generate a baseline. Furthermore, machine learning based algorithms including Naive Bayes, support vector machines ( SVMs ) and maximum entropy are applied.

Stoplist and N-Gram methods are also considered in this project to improve the accuracy.

Unlike what the other Chinese classifiers did, in this project, it practices without Chinese word segmentation first. More specifically, individual Chinese characters are trained to build the models directly. Compared with what we get with segmentation, the accuracies without segmentation are a bit small but still makes sense with over 70% as results. It deserves to be specially noted that binary method always do worse with segmentation than other n-gram method applied in this project but it does work better without segmentation. Because it partly substitutes the Chinese words segmentation if there is no Chinese word segmentation.

In this project the best accuracy is generated by maximum entropy based approach with stoplist and a combined used of unigram and bigram . The accuracy reaches 84.44%.

In addition to accuracy, recall and precision are also called to assess the models built with different algorithms.