In the following exercises, we will apply Markov Chains to real clickstream data. First, we load the required packages and read in the data. The latter contains the web logs of the MMA website. We will take a look at a subset of the data.

if (!require("pacman")) install.packages("pacman") ; require("pacman")
p_load(tidyverse, clickstream)

alllogs <- load("all_logs_ugent.Rdata")
glimpse(alllogs)
Rows: 699,852
Columns: 11
$ ip          <chr> "110.36.224.134", "110.36.224.134", "66.249.75.15", "66.249.75.15", "66.249.75.15", "66.249.75.15", "66.249.75.15~
$ file_name   <chr> "GET /wp-login.php HTTP/1.1", "GET / HTTP/1.1", "GET /predictive_analytics/customer_intelligence/Blog/rss.xml HTT~
$ status_code <int> 404, 200, 200, 304, 304, 304, 200, 304, 304, 200, 304, 304, 304, 200, 304, 304, 200, 200, 304, 200, 304, 200, 304~
$ object_size <int> 509, 11799, 2915, 183, 183, 182, 2915, 183, 182, 107413, 183, 182, 183, 2915, 183, 183, 6444, 1102, 182, 134664, ~
$ origin      <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "http://www.mma.ugent.be/pre~
$ user_agent  <chr> "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1", "Mozilla/5.0 (Windows NT 6.1; WOW64; ~
$ data_origin <chr> "mma", "mma", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "blog", "bl~
$ time        <dttm> 2017-07-31 06:25:39, 2017-07-31 06:25:40, 2017-07-31 06:26:24, 2017-07-31 06:26:40, 2017-07-31 06:35:56, 2017-07~
$ diff        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ group       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ session.id  <int> 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~

STEP 1. Removing bots

To remove the bots, we load the botIP data, containing the IP addresses of various bots, and the botlist data, containing the user agent name.

bots1 <- read_csv("botIP.csv", col_names = FALSE)
bots1
# A tibble: 16,095 x 7
   X1          X2         X3                               X4            X5         X6       X7                                        
   <chr>       <chr>      <chr>                            <chr>         <chr>      <chr>    <chr>                                     
 1 192.comAge~ 74.81.199~ 0                                UNITED STATES Louisvill~ 192.com~ 192.comAgent                              
 2 200PleaseB~ 46.137.98~ ec2-46-137-98-159.eu-west-1.com~ (Unknown Cou~ (Unknown ~ 200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
 3 200PleaseB~ 50.112.12~ ec2-50-112-126-117.us-west-2.co~ (Unknown Cou~ (Unknown ~ 200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
 4 200PleaseB~ 54.225.23~ ec2-54-225-231-180.compute-1.am~ UNITED STATES Unknown    200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
 5 200PleaseB~ 54.232.10~ ec2-54-232-100-158.sa-east-1.co~ UNITED STATES Unknown    200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
 6 200PleaseB~ 54.249.24~ ec2-54-249-240-15.ap-northeast-~ UNITED STATES Unknown    200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
 7 200PleaseB~ 54.251.45~ ec2-54-251-45-250.ap-southeast-~ UNITED STATES Unknown    200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
 8 200PleaseB~ 54.252.97~ ec2-54-252-97-95.ap-southeast-2~ AUSTRALIA     Sydney     200Plea~ Mozilla/5.0 (compatible; 200PleaseBot/1.0~
 9 4seohuntBot 91.237.3.4 0                                (Unknown Cou~ (Unknown ~ 4seohun~ Mozilla/5.0 (compatible; 4SeoHuntBot; +ht~
10 50.nu/0.01  69.72.255~ ded1.innovateit.com              UNITED STATES Montville~ 50.nu    50.nu/0.01 ( +http://50.nu/bot.html )     
# ... with 16,085 more rows

bots2 <- read_csv("botlist.csv")
bots2
# A tibble: 307 x 1
   user_agent                                                                                 
   <chr>                                                                                      
 1 /1.0                                                                                       
 2 AcoiRobot/1.0 libwww/5.3.2                                                                 
 3 Acoon Robot v1.01 (www.acoon.de)                                                           
 4 AgentName/0.1 libwww-perl/5.50                                                             
 5 AlkalineBOT/1.4 (1.4.0326.0 RTM)                                                           
 6 AltaVista Intranet*                                                                        
 7 AnzwersCrawl/2.0 (anzwerscrawl@anzwers.com.au; http://faq.anzwers.com.au/anzwerscrawl.html)
 8 Apache-HttpAsyncClient/4.0-beta4 (java 1.5)                                                
 9 Apache-HttpClient/4.2.5 (java 1.5)                                                         
10 AppEngine-Google; (+http://code.google.com/appengine; appid: s~dcde-hrd)                   
# ... with 297 more rows

We clean the alllogs by performing the following steps.

all_logs_cl <- alllogs %>%
  filter(!ip %in% bots1[,2]) %>%                                                  #Remove the ip adresses in bots1
  filter(!user_agent %in% bots2) %>%                                              #Remove the agent names in bots2
  mutate(user_agent = str_to_lower(user_agent)) %>%                               #Transform the agent names to lower case
  filter(!str_detect(user_agent,"bot|spider|crawler|flipboard|gnip|feedly")) %>%  #Remove all agents with bot|spider|crawler|flipboard|gnip|feedly in it
  filter(!str_detect(file_name,"/robots.txt"))                                    #Remove /robots.txt from the file nam 

all_logs_cl
# A tibble: 520,397 x 11
   ip     file_name        status_code object_size origin        user_agent      data_origin time                 diff group session.id
   <chr>  <chr>                  <int>       <int> <chr>         <chr>           <chr>       <dttm>              <dbl> <dbl>      <int>
 1 110.3~ GET /wp-login.p~         404         509 -             mozilla/5.0 (w~ mma         2017-07-31 06:25:39     0     0          1
 2 110.3~ GET / HTTP/1.1           200       11799 -             mozilla/5.0 (w~ mma         2017-07-31 06:25:40     0     0          1
 3 40.77~ GET / HTTP/1.1           200        4044 -             mozilla/5.0 (w~ mma         2017-08-01 23:10:20     0    11         83
 4 104.2~ GET /predictive~         200       12836 -             mozilla/5.0 (x~ blog        2017-07-31 06:28:54     0     0        180
 5 104.2~ GET /predictive~         200        1304 http://www.m~ mozilla/5.0 (x~ blog        2017-07-31 06:28:54     0     0        180
 6 104.2~ GET /predictive~         200        4422 http://www.m~ mozilla/5.0 (x~ mma         2017-07-31 06:28:54     0     0        180
 7 104.2~ GET /predictive~         200        7830 http://www.m~ mozilla/5.0 (x~ mma         2017-07-31 06:28:54     0     0        180
 8 104.2~ GET /predictive~         200        2558 http://www.m~ mozilla/5.0 (x~ mma         2017-07-31 06:28:54     0     0        180
 9 104.2~ GET /predictive~         200       36654 http://www.m~ mozilla/5.0 (x~ mma         2017-07-31 06:28:54     0     0        180
10 104.2~ GET /predictive~         200         971 http://www.m~ mozilla/5.0 (x~ blog        2017-07-31 06:28:54     0     0        180
# ... with 520,387 more rows

Exercise

Remove the bots from the logs data with the steps that were performed above. Save your result as logs_cleaned.

To download the all_logs_ugent dataset click here¹.

To download the botIP dataset click here².

To download the botlist dataset click here³.

To download the logs dataset click here⁴.

Assume that:

The logs, bots1, and bots2 data is given.
The stringr and dplyr packages are loaded.