In step 2, we will investigate how much spam the clickstream data contains.
1-(nrow(all_logs_cl)/nrow(alllogs))
[1] 0.2564185
From this result, we learn that the data contains 26% spam. Since we do not want to analyze this spam, we will remove it.
all_logs_cl_2 <- all_logs_cl %>%
filter(!str_detect(str_to_lower(origin), "sports|dayfair|pestweb|trrescue|massage|seopower|ledzeppelin|getpocket|eurovids|kamagra|erolove")) %>%
filter(!ip %in% c("216.151.130.179","112.111.46.134")) #Known ip addresses that refer to spam
all_logs_cl_2
# A tibble: 520,392 x 11
ip file_name status_code object_size origin user_agent data_origin time diff group session.id
<chr> <chr> <int> <int> <chr> <chr> <chr> <dttm> <dbl> <dbl> <int>
1 110.3~ GET /wp-login.p~ 404 509 - mozilla/5.0 (w~ mma 2017-07-31 06:25:39 0 0 1
2 110.3~ GET / HTTP/1.1 200 11799 - mozilla/5.0 (w~ mma 2017-07-31 06:25:40 0 0 1
3 40.77~ GET / HTTP/1.1 200 4044 - mozilla/5.0 (w~ mma 2017-08-01 23:10:20 0 11 83
4 104.2~ GET /predictive~ 200 12836 - mozilla/5.0 (x~ blog 2017-07-31 06:28:54 0 0 180
5 104.2~ GET /predictive~ 200 1304 http://www.m~ mozilla/5.0 (x~ blog 2017-07-31 06:28:54 0 0 180
6 104.2~ GET /predictive~ 200 4422 http://www.m~ mozilla/5.0 (x~ mma 2017-07-31 06:28:54 0 0 180
7 104.2~ GET /predictive~ 200 7830 http://www.m~ mozilla/5.0 (x~ mma 2017-07-31 06:28:54 0 0 180
8 104.2~ GET /predictive~ 200 2558 http://www.m~ mozilla/5.0 (x~ mma 2017-07-31 06:28:54 0 0 180
9 104.2~ GET /predictive~ 200 36654 http://www.m~ mozilla/5.0 (x~ mma 2017-07-31 06:28:54 0 0 180
10 104.2~ GET /predictive~ 200 971 http://www.m~ mozilla/5.0 (x~ blog 2017-07-31 06:28:54 0 0 180
# ... with 520,382 more rows
How much spam does the logs
data contain? Store your result as
spam_percentage
. Remove the spam like it done above and store this as
logs_cleaned2
.
To download the all_logs_ugent
dataset click
here1.
To download the logs
dataset click
here2.
Assume that:
logs_cleaned
variable that was calculated in the previous exercise is given.bots1
and bots2
data is given.