STEP 2. Removing spam

In step 2, we will investigate how much spam the clickstream data contains.

1-(nrow(all_logs_cl)/nrow(alllogs))
[1] 0.2564185

From this result, we learn that the data contains 26% spam. Since we do not want to analyze this spam, we will remove it.

all_logs_cl_2 <- all_logs_cl %>%
  filter(!str_detect(str_to_lower(origin), "sports|dayfair|pestweb|trrescue|massage|seopower|ledzeppelin|getpocket|eurovids|kamagra|erolove")) %>%
  filter(!ip %in% c("216.151.130.179","112.111.46.134")) #Known ip addresses that refer to spam 

all_logs_cl_2
# A tibble: 520,392 x 11
   ip     file_name        status_code object_size origin        user_agent      data_origin time                 diff group session.id
   <chr>  <chr>                  <int>       <int> <chr>         <chr>           <chr>       <dttm>              <dbl> <dbl>      <int>
 1 110.3~ GET /wp-login.p~         404         509 -             mozilla/5.0 (w~ mma         2017-07-31 06:25:39     0     0          1
 2 110.3~ GET / HTTP/1.1           200       11799 -             mozilla/5.0 (w~ mma         2017-07-31 06:25:40     0     0          1
 3 40.77~ GET / HTTP/1.1           200        4044 -             mozilla/5.0 (w~ mma         2017-08-01 23:10:20     0    11         83
 4 104.2~ GET /predictive~         200       12836 -             mozilla/5.0 (x~ blog        2017-07-31 06:28:54     0     0        180
 5 104.2~ GET /predictive~         200        1304 http://www.m~ mozilla/5.0 (x~ blog        2017-07-31 06:28:54     0     0        180
 6 104.2~ GET /predictive~         200        4422 http://www.m~ mozilla/5.0 (x~ mma         2017-07-31 06:28:54     0     0        180
 7 104.2~ GET /predictive~         200        7830 http://www.m~ mozilla/5.0 (x~ mma         2017-07-31 06:28:54     0     0        180
 8 104.2~ GET /predictive~         200        2558 http://www.m~ mozilla/5.0 (x~ mma         2017-07-31 06:28:54     0     0        180
 9 104.2~ GET /predictive~         200       36654 http://www.m~ mozilla/5.0 (x~ mma         2017-07-31 06:28:54     0     0        180
10 104.2~ GET /predictive~         200         971 http://www.m~ mozilla/5.0 (x~ blog        2017-07-31 06:28:54     0     0        180
# ... with 520,382 more rows

Exercise

How much spam does the logs data contain? Store your result as spam_percentage. Remove the spam like it done above and store this as logs_cleaned2.

To download the all_logs_ugent dataset click here¹.

To download the logs dataset click here².

Assume that:

The logs_cleaned variable that was calculated in the previous exercise is given.
The bots1 and bots2 data is given.
The stringr and dplyr packages are loaded.