In the next step, the dictionary
file is read. To do so,
we use the read_csv
function. Recall that this is an optimized
version of the base R read.csv
function.
dictionary <- read_csv("dictionary.csv")
dictionary
# A tibble: 13,915 x 4
Word VALENCE AROUSAL DOMINANCE
<chr> <dbl> <dbl> <dbl>
1 aardvark 6.26 2.41 4.27
2 abalone 5.3 2.65 4.95
3 abandon 2.84 3.73 3.32
4 abandonment 2.63 4.95 2.64
5 abbey 5.85 2.2 5
6 abdomen 5.43 3.68 5.15
7 abdominal 4.48 3.5 5.32
8 abduct 2.42 5.9 2.75
9 abduction 2.05 5.33 3.02
10 abide 5.52 3.26 5.33
# ... with 13,905 more rows
The dictionary consists out of a list of words, together with three characteristics of the words:
First, we will recode all columns so that 0 = neutral, -4 = negative, and 4 = positive.
dictionary <- dictionary %>% mutate(across(where(is.numeric),function(x) x-5 ))
This can also be done using the map
function.
dictionary[,2:4] <- dictionary[,2:4] %>% map_df(., function(x) x-5)
Let’s have a look at the new version of the dictionary.
dictionary
# A tibble: 13,915 x 4
Word VALENCE AROUSAL DOMINANCE
<chr> <dbl> <dbl> <dbl>
1 aardvark 1.26 -2.59 -0.73
2 abalone 0.300 -2.35 -0.0500
3 abandon -2.16 -1.27 -1.68
4 abandonment -2.37 -0.0500 -2.36
5 abbey 0.850 -2.8 0
6 abdomen 0.430 -1.32 0.15
7 abdominal -0.520 -1.5 0.32
8 abduct -2.58 0.9 -2.25
9 abduction -2.95 0.33 -1.98
10 abide 0.520 -1.74 0.33
# ... with 13,905 more rows
Let’s make a nice summary of the lexicon, using the skimr package.
p_load(skimr)
skim(dictionary)
-- Data Summary ------------------------
Values
Name dictionary
Number of rows 13915
Number of columns 4
_______________________
Column type frequency:
character 1
numeric 3
________________________
Group variables None
-- Variable type: character -----------------------------------------------------------------------------------------------------------
# A tibble: 1 x 8
skim_variable n_missing complete_rate min max empty n_unique whitespace
* <chr> <int> <dbl> <int> <int> <int> <int> <int>
1 Word 0 1 2 21 0 13915 0
-- Variable type: numeric -------------------------------------------------------------------------------------------------------------
# A tibble: 3 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 VALENCE 0 1 0.0638 1.27 -3.74 -0.75 0.2 0.95 3.53 ▁▃▇▆▁
2 AROUSAL 0 1 -0.789 0.896 -3.4 -1.44 -0.890 -0.24 2.79 ▁▇▇▂▁
3 DOMINANCE 0 1 0.185 0.938 -3.32 -0.420 0.260 0.840 2.9 ▁▂▇▇▁
Which of the first ten words of the dictionary shown above (aardvark, ..., adbide) has the most negative sentiment?
To download the dictionary
dataset click
here1.