In the following exercise we will perform sentiment analysis,
using the sentimentr package. As an example, we introduce three short
sentences that are stored as mytext
. As always, we start by
loading the required packages.
p_load(tidyverse,textclean, textstem, sentimentr, lexicon)
mytext <- c(
'Do you like analytics? But I really hate programming.',
'Google is my best friend.',
'Do you really like data analytics? I\'m a huge fan'
)
mytext %>% get_sentences() %>% sentiment()
element_id sentence_id word_count sentiment
1: 1 1 4 0.2500000
2: 1 2 5 -1.0230011
3: 2 1 5 0.5813777
4: 3 1 6 0.3674235
5: 3 2 4 0.0000000
It outputs the sentiment per sentence, together with the word count. Sentiment is a score based on lexicon::hash_sentiment_jockers_rinker, ranging from -2 (very negative) to 1 (very positive). The first sentence of the example contains 4 words and has a sentiment of 0.25
To aggregate by row (doc), we use the sentiment_by
function.
You can also specify your own ‘by’ variable.
mytext %>% get_sentences() %>% sentiment_by()
element_id word_count sd ave_sentiment
1: 1 9 0.9001477 -0.3865005
2: 2 5 NA 0.5813777
3: 3 10 0.2598076 0.2004980
By default, the sentiment_by
function downweights the zero for averaging.
The reason is that you don’t want the neutral sentences to have a strong influence.
Other options are the average_weighted_mixed_sentiment
, that upweights the
negatives and downweights the neutrals, and the average_mean
functions.
mytext %>% get_sentences() %>% sentiment_by(averaging.function = average_weighted_mixed_sentiment)
element_id word_count sd ave_sentiment
1: 1 9 0.9001477 -1.9210022
2: 2 5 NA 0.5813777
3: 3 10 0.2598076 0.2004980
mytext %>% get_sentences() %>% sentiment_by(averaging.function = average_mean)
element_id word_count sd ave_sentiment
1: 1 9 0.9001477 -0.3865005
2: 2 5 NA 0.5813777
3: 3 10 0.2598076 0.1837117
Let’s add some emoticons, word elongations, and exclamation marks to mytext
.
mytext2 <- c(
'Do you like analytics? But I really hate programming :(.',
'Google is my beeeeeeeest friend!',
'Do you really like data analytics? I\'m a huge fan.'
)
Notice that the emoticon is not detected and the word beeeeest is not seen as an intensifier.
mytext2 %>% get_sentences() %>% sentiment_by()
element_id word_count sd ave_sentiment
1: 1 9 0.9001477 -0.3865005
2: 2 5 NA 0.3577709
3: 3 10 0.2598076 0.2004980
Emoticons and word elongations have to be replaced first.
mytext2 %>% replace_emoticon()
[1] "Do you like analytics? But I really hate programming frown ." "Google is my beeeeeeeest friend!"
[3] "Do you really like data analytics? I'm a huge fan."
mytext2 %>% replace_word_elongation()
[1] "Do you like analytics? But I really hate programming :(." "Google is my best friend!"
[3] "Do you really like data analytics? I'm a huge fan."
We extract sentiment of the adapted text.
mytext2 %>% replace_emoticon() %>% replace_word_elongation() %>% get_sentences() %>% sentiment_by()
element_id word_count sd ave_sentiment
1: 1 10 1.2773506 -0.6532233
2: 2 5 NA 0.5813777
3: 3 10 0.2598076 0.2004980
The exclamation mark is still not detected by the sentiment function.
Therefore, we replace ‘!’ with the corresponding meaning ‘exclamation’
and add this to the valence shifter table. To do so, we
Update the hash_valence_shifter
with the new amplifier.
The function valence_shifters_dt
takes a 2 column data.frame
(named x and y) with the first column being character and containing the
words the second column being integer corresponding to:
valence_shifters_updated <-update_valence_shifter_table(key = hash_valence_shifters,
x = data.frame(x = 'exclamation', y = 2))
Now let’s see whether this is added to our sentiment function.
mytext2 %>%
str_replace_all('!', ' exclamation') %>%
replace_emoticon() %>%
replace_word_elongation() %>%
get_sentences() %>%
sentiment_by(valence_shifters_dt = valence_shifters_updated)
element_id word_count sd ave_sentiment
1: 1 10 1.2773506 -0.6532233
2: 2 6 NA 0.9553010
3: 3 10 0.2598076 0.2004980
We see that the sentiment gets more positive with our updated lexicon.
mytext2 %>% str_replace_all('!', ' exclamation') %>%
replace_emoticon() %>%
replace_word_elongation() %>%
get_sentences() %>%
sentiment_by()
element_id word_count sd ave_sentiment
1: 1 10 1.2773506 -0.6532233
2: 2 6 NA 0.5307228
3: 3 10 0.2598076 0.2004980
The exclamation mark is not detected in this case.
Let's work with three tweets about fruit. Compute the sentiment per sentence,
the average sentiment per row using the average_weighted_mixed_sentiment
function, and the mean of the sentiment per row using the average_mean
function and store
it as sentiment_sentence
, sentiment_row_avg
, and
sentiment_row_mean
, respectively.
Remove the emoticons and word elongations from the tweets and compute
the sentiment per sentence. Store it as sentiment_sentence_cleaned
.
Assume that: