In the following exercises, we will apply Markov Chains to real clickstream data. First, we load the required packages and read in the data. The latter contains the web logs of the MMA website. We will take a look at a subset of the data.
if (!require("pacman")) install.packages("pacman") ; require("pacman")
p_load(tidyverse, clickstream)
alllogs <- load("all_logs_ugent.Rdata")
glimpse(alllogs)
Rows: 699,852
Columns: 11
$ ip <chr> "110.36.224.134", "110.36.224.134", "66.249.75.15", "66.249.75.15", "66.249.75.15", "66.249.75.15", "66.249.75.15~
$ file_name <chr> "GET /wp-login.php HTTP/1.1", "GET / HTTP/1.1", "GET /predictive_analytics/customer_intelligence/Blog/rss.xml HTT~
$ status_code <int> 404, 200, 200, 304, 304, 304, 200, 304, 304, 200, 304, 304, 304, 200, 304, 304, 200, 200, 304, 200, 304, 200, 304~
$ object_size <int> 509, 11799, 2915, 183, 183, 182, 2915, 183, 182, 107413, 183, 182, 183, 2915, 183, 183, 6444, 1102, 182, 134664, ~
$ origin <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "http://www.mma.ugent.be/pre~
$ user_agent <chr> "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1", "Mozilla/5.0 (Windows NT 6.1; WOW64; ~
$ data_origin <chr> "mma", "mma", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "bl~
$ time <dttm> 2017-07-31 06:25:39, 2017-07-31 06:25:40, 2017-07-31 06:26:24, 2017-07-31 06:26:40, 2017-07-31 06:35:56, 2017-07~
$ diff <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ group <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ session.id <int> 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
To remove the bots, we load the botIP
data, containing the IP addresses
of various bots, and the botlist
data, containing the user agent name.
bots1 <- read_csv("botIP.csv", col_names = FALSE)
bots1
# A tibble: 16,095 x 7
X1 X2 X3 X4 X5 X6 X7
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 192.comAge~ 74.81.199~ 0 UNITED STATES Louisvill~ 192.com~ 192.comAgent
2 200PleaseB~ 46.137.98~ ec2-46-137-98-159.eu-west-1.com~ (Unknown Cou~ (Unknown ~ 200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
3 200PleaseB~ 50.112.12~ ec2-50-112-126-117.us-west-2.co~ (Unknown Cou~ (Unknown ~ 200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
4 200PleaseB~ 54.225.23~ ec2-54-225-231-180.compute-1.am~ UNITED STATES Unknown 200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
5 200PleaseB~ 54.232.10~ ec2-54-232-100-158.sa-east-1.co~ UNITED STATES Unknown 200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
6 200PleaseB~ 54.249.24~ ec2-54-249-240-15.ap-northeast-~ UNITED STATES Unknown 200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
7 200PleaseB~ 54.251.45~ ec2-54-251-45-250.ap-southeast-~ UNITED STATES Unknown 200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
8 200PleaseB~ 54.252.97~ ec2-54-252-97-95.ap-southeast-2~ AUSTRALIA Sydney 200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
9 4seohuntBot 91.237.3.4 0 (Unknown Cou~ (Unknown ~ 4seohun~ Mozilla/5.0 (compatible; 4SeoHuntBot; +ht~
10 50.nu/0.01 69.72.255~ ded1.innovateit.com UNITED STATES Montville~ 50.nu 50.nu/0.01 ( +http://50.nu/bot.html )
# ... with 16,085 more rows
bots2 <- read_csv("botlist.csv")
bots2
# A tibble: 307 x 1
user_agent
<chr>
1 /1.0
2 AcoiRobot/1.0 libwww/5.3.2
3 Acoon Robot v1.01 (www.acoon.de)
4 AgentName/0.1 libwww-perl/5.50
5 AlkalineBOT/1.4 (1.4.0326.0 RTM)
6 AltaVista Intranet*
7 AnzwersCrawl/2.0 (anzwerscrawl@anzwers.com.au; http://faq.anzwers.com.au/anzwerscrawl.html)
8 Apache-HttpAsyncClient/4.0-beta4 (java 1.5)
9 Apache-HttpClient/4.2.5 (java 1.5)
10 AppEngine-Google; (+http://code.google.com/appengine; appid: s~dcde-hrd)
# ... with 297 more rows
We clean the alllogs
by performing the following steps.
all_logs_cl <- alllogs %>%
filter(!ip %in% bots1[,2]) %>% #Remove the ip adresses in bots1
filter(!user_agent %in% bots2) %>% #Remove the agent names in bots2
mutate(user_agent = str_to_lower(user_agent)) %>% #Transform the agent names to lower case
filter(!str_detect(user_agent,"bot|spider|crawler|flipboard|gnip|feedly")) %>% #Remove all agents with bot|spider|crawler|flipboard|gnip|feedly in it
filter(!str_detect(file_name,"/robots.txt")) #Remove /robots.txt from the file nam
all_logs_cl
# A tibble: 520,397 x 11
ip file_name status_code object_size origin user_agent data_origin time diff group session.id
<chr> <chr> <int> <int> <chr> <chr> <chr> <dttm> <dbl> <dbl> <int>
1 110.3~ GET /wp-login.p~ 404 509 - mozilla/5.0 (w~ mma 2017-07-31 06:25:39 0 0 1
2 110.3~ GET / HTTP/1.1 200 11799 - mozilla/5.0 (w~ mma 2017-07-31 06:25:40 0 0 1
3 40.77~ GET / HTTP/1.1 200 4044 - mozilla/5.0 (w~ mma 2017-08-01 23:10:20 0 11 83
4 104.2~ GET /predictive~ 200 12836 - mozilla/5.0 (x~ blog 2017-07-31 06:28:54 0 0 180
5 104.2~ GET /predictive~ 200 1304 http://www.m~ mozilla/5.0 (x~ blog 2017-07-31 06:28:54 0 0 180
6 104.2~ GET /predictive~ 200 4422 http://www.m~ mozilla/5.0 (x~ mma 2017-07-31 06:28:54 0 0 180
7 104.2~ GET /predictive~ 200 7830 http://www.m~ mozilla/5.0 (x~ mma 2017-07-31 06:28:54 0 0 180
8 104.2~ GET /predictive~ 200 2558 http://www.m~ mozilla/5.0 (x~ mma 2017-07-31 06:28:54 0 0 180
9 104.2~ GET /predictive~ 200 36654 http://www.m~ mozilla/5.0 (x~ mma 2017-07-31 06:28:54 0 0 180
10 104.2~ GET /predictive~ 200 971 http://www.m~ mozilla/5.0 (x~ blog 2017-07-31 06:28:54 0 0 180
# ... with 520,387 more rows
Remove the bots from the logs
data with the steps that were performed
above. Save your result as logs_cleaned
.
To download the all_logs_ugent
dataset click
here1.
To download the botIP
dataset click
here2.
To download the botlist
dataset click
here3.
To download the logs
dataset click
here4.
Assume that:
logs
, bots1
, and bots2
data is given.