Exploring the Lexicon

In the next step, the dictionary file is read. To do so, we use the read_csv function. Recall that this is an optimized version of the base R read.csv function.

dictionary <- read_csv("dictionary.csv")
dictionary
# A tibble: 13,915 x 4
   Word        VALENCE AROUSAL DOMINANCE
   <chr>         <dbl>   <dbl>     <dbl>
 1 aardvark       6.26    2.41      4.27
 2 abalone        5.3     2.65      4.95
 3 abandon        2.84    3.73      3.32
 4 abandonment    2.63    4.95      2.64
 5 abbey          5.85    2.2       5   
 6 abdomen        5.43    3.68      5.15
 7 abdominal      4.48    3.5       5.32
 8 abduct         2.42    5.9       2.75
 9 abduction      2.05    5.33      3.02
10 abide          5.52    3.26      5.33
# ... with 13,905 more rows

The dictionary consists out of a list of words, together with three characteristics of the words:

Recoding the Dictionary

First, we will recode all columns so that 0 = neutral, -4 = negative, and 4 = positive.

dictionary <- dictionary %>% mutate(across(where(is.numeric),function(x) x-5 ))

This can also be done using the map function.

dictionary[,2:4] <- dictionary[,2:4] %>% map_df(., function(x) x-5)

Let’s have a look at the new version of the dictionary.

dictionary
# A tibble: 13,915 x 4
   Word        VALENCE AROUSAL DOMINANCE
   <chr>         <dbl>   <dbl>     <dbl>
 1 aardvark      1.26  -2.59     -0.73  
 2 abalone       0.300 -2.35     -0.0500
 3 abandon      -2.16  -1.27     -1.68  
 4 abandonment  -2.37  -0.0500   -2.36  
 5 abbey         0.850 -2.8       0     
 6 abdomen       0.430 -1.32      0.15  
 7 abdominal    -0.520 -1.5       0.32  
 8 abduct       -2.58   0.9      -2.25  
 9 abduction    -2.95   0.33     -1.98  
10 abide         0.520 -1.74      0.33  
# ... with 13,905 more rows

Summary of the Lexicon

Let’s make a nice summary of the lexicon, using the skimr package.

p_load(skimr)
skim(dictionary)
-- Data Summary ------------------------
                           Values    
Name                       dictionary
Number of rows             13915     
Number of columns          4         
_______________________              
Column type frequency:               
  character                1         
  numeric                  3         
________________________             
Group variables            None      

-- Variable type: character -----------------------------------------------------------------------------------------------------------
# A tibble: 1 x 8
  skim_variable n_missing complete_rate   min   max empty n_unique whitespace
* <chr>             <int>         <dbl> <int> <int> <int>    <int>      <int>
1 Word                  0             1     2    21     0    13915          0

-- Variable type: numeric -------------------------------------------------------------------------------------------------------------
# A tibble: 3 x 11
  skim_variable n_missing complete_rate    mean    sd    p0    p25    p50    p75  p100 hist 
* <chr>             <int>         <dbl>   <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <chr>
1 VALENCE               0             1  0.0638 1.27  -3.74 -0.75   0.2    0.95   3.53 ▁▃▇▆▁
2 AROUSAL               0             1 -0.789  0.896 -3.4  -1.44  -0.890 -0.24   2.79 ▁▇▇▂▁
3 DOMINANCE             0             1  0.185  0.938 -3.32 -0.420  0.260  0.840  2.9  ▁▂▇▇▁

Multiple Choice

Which of the first ten words of the dictionary shown above (aardvark, ..., adbide) has the most positive sentiment?

  1. aardvark
  2. abduct
  3. abide

To download the dictionary dataset click here1.