STEP 4. Creating clickstream

As a first of the fourth phase, we create a full clickstream path.

all_logs_cl_3 <- all_logs_cl_3 %>% mutate(fullpath = paste0(data_origin, str_replace_all(str_replace_all(file_name,"GET ","")," .*","")))

all_logs_cl_3
# A tibble: 30,571 x 12
   ip     file_name    status_code object_size origin    user_agent   data_origin time                 diff group session.id fullpath  
   <chr>  <chr>              <int>       <int> <chr>     <chr>        <chr>       <dttm>              <dbl> <dbl>      <int> <chr>     
 1 104.2~ GET /predic~         200       12836 -         mozilla/5.0~ blog        2017-07-31 06:28:54     0     0        180 blog/pred~
 2 104.2~ GET /predic~         200       12836 -         mozilla/5.0~ blog        2017-07-31 06:37:28     0     0        180 blog/pred~
 3 52.17~ GET /predic~         200       12836 -         mozilla/5.0~ blog        2017-07-31 06:33:24     0     0        183 blog/pred~
 4 5.248~ GET /predic~         200       47401 http://w~ mozilla/5.0~ blog        2017-07-31 07:03:01     0     0        306 blog/pred~
 5 113.1~ GET /faq.ht~         200        4548 http://w~ mozilla/5.0~ mma         2017-07-31 07:04:20     0     0        308 mma/faq.h~
 6 112.6~ GET /predic~         200       14682 -         mozilla/5.0~ blog        2017-07-31 07:17:30     0     0        398 blog/pred~
 7 112.6~ GET /predic~         200       14681 -         mozilla/5.0~ blog        2017-07-31 07:17:33     0     0        398 blog/pred~
 8 112.6~ GET /predic~         200       14682 -         mozilla/5.0~ blog        2017-07-31 10:19:33     1     1        399 blog/pred~
 9 112.6~ GET /predic~         200       14681 -         mozilla/5.0~ blog        2017-07-31 10:19:35     0     1        399 blog/pred~
10 112.6~ GET /predic~         200       14682 -         mozilla/5.0~ blog        2017-07-31 15:17:24     1     2        400 blog/pred~
# ... with 30,561 more rows

We create another variable, full2, that renames some of the paths. Basically, we are creating more general categories. Therefore, we rename everything that has blog in it to a new label ‘blog’.

all_logs_cl_3 <- all_logs_cl_3 %>% mutate(full2 = if_else(str_detect(fullpath,"blog|Blog|About_Me"), 'blog',fullpath)) 

all_logs_cl_3
# A tibble: 30,571 x 13
   ip     file_name  status_code object_size origin  user_agent  data_origin time                 diff group session.id fullpath  full2
   <chr>  <chr>            <int>       <int> <chr>   <chr>       <chr>       <dttm>              <dbl> <dbl>      <int> <chr>     <chr>
 1 104.2~ GET /pred~         200       12836 -       mozilla/5.~ blog        2017-07-31 06:28:54     0     0        180 blog/pre~ blog 
 2 104.2~ GET /pred~         200       12836 -       mozilla/5.~ blog        2017-07-31 06:37:28     0     0        180 blog/pre~ blog 
 3 52.17~ GET /pred~         200       12836 -       mozilla/5.~ blog        2017-07-31 06:33:24     0     0        183 blog/pre~ blog 
 4 5.248~ GET /pred~         200       47401 http:/~ mozilla/5.~ blog        2017-07-31 07:03:01     0     0        306 blog/pre~ blog 
 5 113.1~ GET /faq.~         200        4548 http:/~ mozilla/5.~ mma         2017-07-31 07:04:20     0     0        308 mma/faq.~ mma/~
 6 112.6~ GET /pred~         200       14682 -       mozilla/5.~ blog        2017-07-31 07:17:30     0     0        398 blog/pre~ blog 
 7 112.6~ GET /pred~         200       14681 -       mozilla/5.~ blog        2017-07-31 07:17:33     0     0        398 blog/pre~ blog 
 8 112.6~ GET /pred~         200       14682 -       mozilla/5.~ blog        2017-07-31 10:19:33     1     1        399 blog/pre~ blog 
 9 112.6~ GET /pred~         200       14681 -       mozilla/5.~ blog        2017-07-31 10:19:35     0     1        399 blog/pre~ blog 
10 112.6~ GET /pred~         200       14682 -       mozilla/5.~ blog        2017-07-31 15:17:24     1     2        400 blog/pre~ blog 
# ... with 30,561 more rows

For now, we will assume that we are mainly interested in the MMA website. We will look only at the clickstreams that contain MMA as an item somewhere in the stream (see further). We will re-label the items that are not MMA to more general names (i.e., everything before the first slash). We also shorten the MMA links (for plotting) to just 15 characters. Next, we order the data.

all_logs_cl_3 <- all_logs_cl_3 %>% mutate(full2 = if_else(str_detect(full2, "mma"),str_trunc(full2,15),str_replace_all(full2, "/.*",""))) 
all_logs_cl_3 <- all_logs_cl_3 %>% arrange(ip,time)

all_logs_cl_3
# A tibble: 30,571 x 13
   ip     file_name  status_code object_size origin  user_agent  data_origin time                 diff group session.id fullpath  full2
   <chr>  <chr>            <int>       <int> <chr>   <chr>       <chr>       <dttm>              <dbl> <dbl>      <int> <chr>     <chr>
 1 1.186~ GET /Keyb~         200        4239 http:/~ mozilla/5.~ mma         2017-12-12 04:47:59     0     0      58442 mma/Keyb~ mma/~
 2 1.186~ GET /faq.~         200        4548 http:/~ mozilla/5.~ mma         2017-10-29 21:35:02     0     0      87839 mma/faq.~ mma/~
 3 1.186~ GET /Guid~         200        7410 http:/~ mozilla/5.~ mma         2017-10-29 21:35:16     0     0      87839 mma/Guid~ mma/~
 4 1.214~ GET /pred~         200        9487 https:~ mozilla/5.~ blog        2017-11-29 07:41:34     0     0      54463 blog/pre~ blog 
 5 1.215~ GET /pred~         200        9282 http:/~ mozilla/5.~ blog        2017-08-14 06:41:18     0     0      21257 blog/pre~ blog 
 6 1.22.~ GET /pred~         200       10138 https:~ mozilla/5.~ blog        2017-11-17 08:38:00     0     0      51081 blog/pre~ blog 
 7 1.225~ GET /pred~         200        9487 https:~ mozilla/5.~ blog        2018-01-19 14:31:34     0     0      64820 blog/pre~ blog 
 8 1.34.~ GET /pred~         200        9487 https:~ mozilla/5.~ blog        2017-11-29 15:50:09     0     0      54844 blog/pre~ blog 
 9 1.36.~ GET /pred~         200        4607 https:~ mozilla/5.~ blog        2017-10-26 18:51:54     0     0      87516 blog/pre~ blog 
10 1.46.~ GET /cont~         200        4128 http:/~ mozilla/5.~ mma         2017-12-12 06:17:49     0     0      58452 mma/cont~ mma/~
# ... with 30,561 more rows
STEP 1. Change the datatable into a list

Split the data on every session to see the streams per session.

splits <- all_logs_cl_3 %>% group_split(session.id)

head(splits,3)
<list_of<
  tbl_df<
    ip         : character
    file_name  : character
    status_code: integer
    object_size: integer
    origin     : character
    user_agent : character
    data_origin: character
    time       : datetime<local>
    diff       : double
    group      : double
    session.id : integer
    fullpath   : character
    full2      : character
  >
>[3]>
[[1]]
# A tibble: 2 x 13
  ip     file_name    status_code object_size origin user_agent  data_origin time                 diff group session.id fullpath  full2
  <chr>  <chr>              <int>       <int> <chr>  <chr>       <chr>       <dttm>              <dbl> <dbl>      <int> <chr>     <chr>
1 104.2~ GET /predic~         200       12836 -      mozilla/5.~ blog        2017-07-31 06:28:54     0     0        180 blog/pre~ blog 
2 104.2~ GET /predic~         200       12836 -      mozilla/5.~ blog        2017-07-31 06:37:28     0     0        180 blog/pre~ blog 

[[2]]
# A tibble: 1 x 13
  ip     file_name    status_code object_size origin user_agent  data_origin time                 diff group session.id fullpath  full2
  <chr>  <chr>              <int>       <int> <chr>  <chr>       <chr>       <dttm>              <dbl> <dbl>      <int> <chr>     <chr>
1 52.17~ GET /predic~         200       12836 -      mozilla/5.~ blog        2017-07-31 06:33:24     0     0        183 blog/pre~ blog 

[[3]]
# A tibble: 1 x 13
  ip     file_name   status_code object_size origin   user_agent data_origin time                 diff group session.id fullpath  full2
  <chr>  <chr>             <int>       <int> <chr>    <chr>      <chr>       <dttm>              <dbl> <dbl>      <int> <chr>     <chr>
1 5.248~ GET /predi~         200       47401 http://~ mozilla/5~ blog        2017-07-31 07:03:01     0     0        306 blog/pre~ blog
STEP 2. Make one row: transpose the table and select the correct row

Our items/categories are in the column named full2, this is the 13th column. If you would like to make a more general analysis, you could use the column data_origin. Since we are only interested in the clickstreams containing MMA, we will only look at these streams.

clstr <- splits %>% map(., function (x) t(x)[13,])
#Or: clstr <- lapply(splits,function(x) t(x)[13,])

clstr <- clstr[map_int(clstr, function (x) sum(str_detect(x,"mma"))) > 0] 

head(clstr)
[[1]]                                             [[4]]
        full2                                             full2 
"mma/faq.htm"                                     "mma/faq.htm" 

[[2]]                                             [[5]]
        full2                                             full2
"mma/faq.htm"                                     "mma/faq.htm"

[[3]]                                             [[6]]
[1] "mma/CI_start.htm" "mma/index.htm"                    full2
                                                  "mma/index.htm"

Exercise

Let's work further on the logs dataset. Create a new variable fullpath and save your result as logs_cleaned3. Next, split the data on every user_agent and store your result as splits. Finally, create the variable clstr, based on the seventh column data_origin.

To download the all_logs_ugent dataset click here1.

To download the logs dataset click here2.


Assume that: