Social media platforms such as Twitter and the Internet Movie Database (IMDb) contain a vast amount of data which have applications in predictive sentiment analysis for movie sales, stock market fluctuations, brand opinion, or current events. Using a dataset taken from IMDb by Stanford, we identify some of the most significant phrases for identifying sentiment in a wide variety of movie reviews. Data from Twitter are especially attractive due to Twitter’s real-time nature through its streaming API. Effectively analyzing this data in a streaming fashion requires efficient models, which may be improved by reducing the dimensionality of input vectors. One way this has been done in the past is by using emoticons; we propose a method for further reducing these features through identifying common structure in emoticons with similar sentiment. We also examine the gender distribution of emoticon usage, finding tendencies towards certain emoticons to be disproportionate between males and females. Despite the roughly equal gender distribution on Twitter, emoticon usage is predominately female. Furthermore, we find that distributed vector representations, such as those produced by Word2Vec, may be reduced through feature selection. This analysis was done on a manually labeled sample of 1000 tweets from a new dataset, the Large Emoticon Corpus, which consisted of about 8.5 million tweets containing emoticons and was collecting over a five day period in May 2015. Additionally, using the common structure of similar emoticons, we are able to characterize positive and negative emoticons using two regular expressions which account for over 90% of emoticon usage in the Large Emoticon Corpus.
In the past few years, the growth of social media platforms has brought with it a wealth of unprocessed data, containing interesting information that may be inferred from the interactions of users. The astronomical amount of information renders manual human analysis unfeasible, and has thus stimulated the development of machine learning algorithms to aid in gleaning information from this data. These algorithms may be running in either a batch processing environment, where a data set is stored and analyzed, or in a streaming environment, responding to changes in the source content. There are many different types of information that may be extracted from social media datasets; one type is that of sentiment. When a user makes a post, it often contains an opinion or feeling that he or she has for something; this opinion may be either explicit or implicit.
One of the most prominent social media platforms for sentiment analysis is Twitter. Over time, Twitter has become a vast source of data for different machine learning analyses, including natural language processing and sentiment classification. Enormous quantities of tweets are processed by Twitter every second, many of which contain users’ feelings towards products, brands, events, media, and a wide array of other things. The main difficulty with using these tweets to extract sentiment is that they lack direct labels which convey the overall feeling or leaning of the text. This has led to many different ways to classify the tweets, involving both supervised and unsupervised machine learning. Analyzing Twitter using natural language processing also introduces interesting problems; each tweet is limited to a maximum of 140 characters, meaning that normal grammatical and syntactical conventions are mostly ignored or circumvented. The consequence of this is that models of vocabulary and sentence structure must be rebuilt according to the unique style which has developed on Twitter.
In addition to Twitter, another large source of sentiment information is the Internet Movie Database (IMDb). IMDb allows reviewers with an account to post reviews on different movies, which includes both text and a quantitative rating. The net result is a large data source for sentiment analysis, where each review has a manual label from the review’s author according to the overall sentiment of the review; the advantage of such data is that the quality of these labels allows for isolation of prediction inaccuracy, as the labels themselves are already extremely accurate. The type of text in this data source is significantly different from the type in Twitter; the users are not limited to a certain character count, meaning that the text is more syntactically and contextually coherent. The natural consequence of this is that words with similar denotations or connotations generally occur in the same context, which oftentimes will improve the word model built from a dataset.
This paper starts with a discussion of relevant literature on Twitter analysis, and then moves into a discussion of the datasets used in this study, which are derived from data taken from either Twitter or IMDb. After that, the methods utilized to obtain an abstract representation of a body of text as a vector are discussed in detail, as well as operations on this representation. A section dictating the results of these operations is next, followed by concluding remarks and possible avenues of future research.
There have been several studies demonstrating the value of data the can be obtained through Twitter. Sentiment analysis of Twitter data has interesting applications such as analyzing movie sales [
Another valuable feature of Twitter data is the popularity of user annotations on their posts. This includes both hashtags (e.g. #smile) and emoticons (e.g. =D). Previous studies have used emoticons as self-labels for labeling tweets based on sentiment [
As previously stated, there are many benefits to be gained in treating Twitter data as a live stream, including real-time analysis of events, but there are also many challenges. Some of these include time restrictions, unbalanced classes, informal structure, and highly compressed meaning [
In this study, we used three different datasets, two of which contain tweets (Sentiment Analysis Dataset, and the Large Emoticon Corpus) and another which contains movie reviews (Stanford Movie Review Dataset). The Large Emoticon Corpus is a new dataset introduced with this study. For this reason, we discuss the collection and cleaning methods for this dataset in greater detail below.
The Twitter Sentiment Analysis Training Corpus consists of about 1.5 million labeled tweets, and is a compilation of the Sentiment 140 [
The Large Emoticon Corpus, introduced in this paper, is a dataset of tweets which contain one or more emoticons from a unique set of 115 emoticons. It contains roughly 8.5 million tweets, collected over the interval May 13 - 18, 2015 using the Twitter 4J interface to the Twitter Streaming API. For analysis, mentions and retweet tags were removed from the data. A sample of 1000 tweets from this dataset was manually labeled with binary sentiment corresponding to positive and negative emotion. Additionally, a subset of 500 tweets from this sample was manually labeled with user gender.
The tweets in the Large Emoticon Corpus have time and date information associated with them, allowing time-based analytics to be performed. One such inquiry, which we present later, is the frequency of a certain emoticon in tweet text over time. All times are in UTC, so they are not matched with local time, thus preventing time of day analysis. It does, however, open an opportunity for analysis of overall global emoticon usage over time. Additionally, a subset of about 38,000 tweets is tagged with geographic coordinates, which would enable time of day analysis on the subset.
The Large Movie Review Dataset from Stanford consists of fifty thousand labeled movie reviews, and an equal number of unlabeled reviews, taken from IMDB. The labeled reviews combine the text of movie reviews with their associated numerical rating. Reviews with a rating of seven or more out of ten are considered positive, while those with a rating of four or lower are considered negative. No more than 30 reviews are allowed for a given movie; additionally, a majority of the reviews are unprofessional. During analysis, stop words (taken from the NLTK Stopwords Corpus [
The problem of representing natural language text in vector form for machine learning analysis has been around for decades, with early models using the Bag-of-Words technique. Textual representation can have a significant impact on the effectiveness of machine learning models as it determines what information is extracted from the original text. The goal is to retain as much pertinent information as possible. For this reason Bag-of-Words models, which lose all of the contextual and structural information of the text, are being replaced by distributed vector models such as Word2Vec and Sent2Vec; these are trained to understand the multifaceted relationships that exist between various words. The following sections highlight the different techniques we used to convert text into vectors.
Word2Vec [
The Bag-of-Words model of a text relates the frequency of a word or phrase in the text to an element in the resulting representative vector. This leads to sparse vectors in high dimensional space. For this study, phrases with anywhere from 2 to 5 words were used to represent the text in the Stanford Movie Review Dataset. Unlike Word2Vec and Sent2Vec, the Bag-of-Words model does not take into account the similarity in meaning of closely related words. Likewise the n-gram representation of a text is an alternate vector representation which is based on the number occurrences of a “gram” (a sequence of characters). We selected the most commonly occurring grams from the dataset; these were then used to convert each text to a vector by counting the number of occurrences of each gram in the given text. Each position in the resulting vector corresponds to a gram selected from the original text.
Feature selection (a type of dimensionality reduction) is a way to reduce the dimensionality of a vector representation of a dataset, removing redundant features and reducing over fitting. Generally, the classification accuracy remains constant or is improved as a result of this process for sparse vectors. The Correlation Based Feature Selection [
In addition to manually labeling the sentiment of 1000 tweets, about 500 of the sentiment-labeled tweets were also manually labeled with user gender. A survey of this data is presented in
Representative vectors of the Movie Review dataset were constructed using Word2Vec, Sent2Vec (both CDSSM and DSSM), and Bag-of-Words models. Features were then selected from each of the vector sets to create new, reduced vector sets. Naive Bayes, Random Forest, and SVM algorithms were run on both the full and reduced vectors sets for each model. The results are shown in
Performing feature selection on the Bag-Of-Words model of yielded the phrases most correlated with review
sentiment. A selection of the top 20 of these is shown in
The analysis of emoticon use over time shows small fluctuations over small periods of time (e.g. minutes or hours) but is relatively constant over longer periods of time (
Top 1 - 10 | Top 11 - 20 |
---|---|
worst movie | worst movie ever |
one best | worst movies |
one worst | highly recommended |
highly recommend | bad film |
bad movie | well worth |
worst film | looks like |
bad acting | fast forward |
movie bad | one worst movies |
really bad | acting bad |
must see | piece crap |
As shown in
In natural language processing, it is common to perform stemming on the input text. This operation usually involves removing the suffixes from different tenses of a word, leaving just the root [
The structural similarity between emoticons may be used to reduce the amount of processing required to predict the sentiment of a given tweet. For example, the input text may be preprocessed by replacing emoticons with their base element. This reduces the number of unique features which correlate to sentiment, further reducing the computational expense of classifying a given text. An advantage of using this approach is that a regular expression may be used to match positive or negative tweets; this may be computationally more efficient than storing a set of emoticons and comparing each word each word in the set.
In many cases, vectors produced using Word2Vec contain many irrelevant or redundant features, which may increase the computational expense of classification or clustering. They may additionally add noise, which decreases accuracy. One common way to reduce this is by performing feature reduction after calculating representative vectors for text. However, it may be inefficient to generate vectors with excessive dimensionality, only to reduce or discard features.
One way to remedy this is to build a model which minimizes dimensionality without sacrificing predictive quality. To test this theory, we built multiple models based on the Large Emoticon Corpus that cover the spectrum of dimensionality, and tested the predictive ability for sentiment given two datasets. Vectors were obtained using multiple Word2Vec models; labels for this dataset were obtained by using a Random Forest model [
Base | Positive Emoticons | Negative Emoticons |
---|---|---|
) | :) ;) :-) ;-) =) | |
( | (: | :( :-( =( |
/ | :/ =/ | |
D | XD :D ;D =D | |
] | :] =] | |
Regular Expression | ([;:\-=X]{1,2}[)\]D])|([(/D][;:\-=X]{1,2}) | [;:\-=X]{1,2}[(/o] |
Depending on the data and model chosen, dimensionality can be significantly reduced without adversely affecting the predictive quality. If the dataset is balanced, it may be possible to reduce the dimensions further than if the dataset is biased toward one label. If the dataset is biased, an SVM model, which is designed to optimize accuracy, may be unstable with too few dimensions; it may begin to base its predictions entirely on prior probability. In this case, more features may be necessary. This depends on model type, though, as the Naïve Bayes model appears to be much more stable at high bias and low dimensions, for metrics other than accuracy.
This study introduced the Large Emoticon Corpus, which had provided insights into emoticon usage on Twitter, both in the frequency of individual emoticon use and usage based on user gender. We had also shown that positive and negative emoticons might be deconstructed into more basic forms and potentially used to classify positive and negative sentiment by simple regular expressions. Additionally, we found a bias in emoticon usage toward female Twitter users, despite the relatively balanced gender partitioning of the platform as a whole. Also, we ranked the most influential phrases in the Large Movie Review Dataset, finding that the smaller phrases could better represent a set of longer phrases with similar meaning. Finally, we found dimension reduction to be an effective technique on distributed vectors; feature selection might be used to reduce dimensionality of models such as Word2Vec in order to improve computational efficiency when performing real-time analysis with data streams.
A possible area for future research was deeper investigation into gender bias on Twitter with a larger dataset; by using a larger set of labeled gender data, it would be possible to determine to what extent male and female Twitter users utilized different emoticons. Another possible research topic was the optimization of dimension reduction for other labeled datasets, including sets with continuous inputs, compared to the binary inputs used in this study.
We would like to thank the Summer Research Institute at Houghton College for providing financial support for this study.
BrianDickinson,MichaelGanger,WeiHu, (2015) Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis. Journal of Data Analysis and Information Processing,03,153-162. doi: 10.4236/jdaip.2015.34010