As a first of the fourth phase, we create a full clickstream path.
all_logs_cl_3 <- all_logs_cl_3 %>% mutate(fullpath = paste0(data_origin, str_replace_all(str_replace_all(file_name,"GET ","")," .*","")))
all_logs_cl_3
# A tibble: 30,571 x 12
ip file_name status_code object_size origin user_agent data_origin time diff group session.id fullpath
<chr> <chr> <int> <int> <chr> <chr> <chr> <dttm> <dbl> <dbl> <int> <chr>
1 104.2~ GET /predic~ 200 12836 - mozilla/5.0~ blog 2017-07-31 06:28:54 0 0 180 blog/pred~
2 104.2~ GET /predic~ 200 12836 - mozilla/5.0~ blog 2017-07-31 06:37:28 0 0 180 blog/pred~
3 52.17~ GET /predic~ 200 12836 - mozilla/5.0~ blog 2017-07-31 06:33:24 0 0 183 blog/pred~
4 5.248~ GET /predic~ 200 47401 http://w~ mozilla/5.0~ blog 2017-07-31 07:03:01 0 0 306 blog/pred~
5 113.1~ GET /faq.ht~ 200 4548 http://w~ mozilla/5.0~ mma 2017-07-31 07:04:20 0 0 308 mma/faq.h~
6 112.6~ GET /predic~ 200 14682 - mozilla/5.0~ blog 2017-07-31 07:17:30 0 0 398 blog/pred~
7 112.6~ GET /predic~ 200 14681 - mozilla/5.0~ blog 2017-07-31 07:17:33 0 0 398 blog/pred~
8 112.6~ GET /predic~ 200 14682 - mozilla/5.0~ blog 2017-07-31 10:19:33 1 1 399 blog/pred~
9 112.6~ GET /predic~ 200 14681 - mozilla/5.0~ blog 2017-07-31 10:19:35 0 1 399 blog/pred~
10 112.6~ GET /predic~ 200 14682 - mozilla/5.0~ blog 2017-07-31 15:17:24 1 2 400 blog/pred~
# ... with 30,561 more rows
We create another variable, full2
, that renames some of the paths.
Basically, we are creating more general categories.
Therefore, we rename everything that has blog in it to a new label ‘blog’.
all_logs_cl_3 <- all_logs_cl_3 %>% mutate(full2 = if_else(str_detect(fullpath,"blog|Blog|About_Me"), 'blog',fullpath))
all_logs_cl_3
# A tibble: 30,571 x 13
ip file_name status_code object_size origin user_agent data_origin time diff group session.id fullpath full2
<chr> <chr> <int> <int> <chr> <chr> <chr> <dttm> <dbl> <dbl> <int> <chr> <chr>
1 104.2~ GET /pred~ 200 12836 - mozilla/5.~ blog 2017-07-31 06:28:54 0 0 180 blog/pre~ blog
2 104.2~ GET /pred~ 200 12836 - mozilla/5.~ blog 2017-07-31 06:37:28 0 0 180 blog/pre~ blog
3 52.17~ GET /pred~ 200 12836 - mozilla/5.~ blog 2017-07-31 06:33:24 0 0 183 blog/pre~ blog
4 5.248~ GET /pred~ 200 47401 http:/~ mozilla/5.~ blog 2017-07-31 07:03:01 0 0 306 blog/pre~ blog
5 113.1~ GET /faq.~ 200 4548 http:/~ mozilla/5.~ mma 2017-07-31 07:04:20 0 0 308 mma/faq.~ mma/~
6 112.6~ GET /pred~ 200 14682 - mozilla/5.~ blog 2017-07-31 07:17:30 0 0 398 blog/pre~ blog
7 112.6~ GET /pred~ 200 14681 - mozilla/5.~ blog 2017-07-31 07:17:33 0 0 398 blog/pre~ blog
8 112.6~ GET /pred~ 200 14682 - mozilla/5.~ blog 2017-07-31 10:19:33 1 1 399 blog/pre~ blog
9 112.6~ GET /pred~ 200 14681 - mozilla/5.~ blog 2017-07-31 10:19:35 0 1 399 blog/pre~ blog
10 112.6~ GET /pred~ 200 14682 - mozilla/5.~ blog 2017-07-31 15:17:24 1 2 400 blog/pre~ blog
# ... with 30,561 more rows
For now, we will assume that we are mainly interested in the MMA website. We will look only at the clickstreams that contain MMA as an item somewhere in the stream (see further). We will re-label the items that are not MMA to more general names (i.e., everything before the first slash). We also shorten the MMA links (for plotting) to just 15 characters. Next, we order the data.
all_logs_cl_3 <- all_logs_cl_3 %>% mutate(full2 = if_else(str_detect(full2, "mma"),str_trunc(full2,15),str_replace_all(full2, "/.*","")))
all_logs_cl_3 <- all_logs_cl_3 %>% arrange(ip,time)
all_logs_cl_3
# A tibble: 30,571 x 13
ip file_name status_code object_size origin user_agent data_origin time diff group session.id fullpath full2
<chr> <chr> <int> <int> <chr> <chr> <chr> <dttm> <dbl> <dbl> <int> <chr> <chr>
1 1.186~ GET /Keyb~ 200 4239 http:/~ mozilla/5.~ mma 2017-12-12 04:47:59 0 0 58442 mma/Keyb~ mma/~
2 1.186~ GET /faq.~ 200 4548 http:/~ mozilla/5.~ mma 2017-10-29 21:35:02 0 0 87839 mma/faq.~ mma/~
3 1.186~ GET /Guid~ 200 7410 http:/~ mozilla/5.~ mma 2017-10-29 21:35:16 0 0 87839 mma/Guid~ mma/~
4 1.214~ GET /pred~ 200 9487 https:~ mozilla/5.~ blog 2017-11-29 07:41:34 0 0 54463 blog/pre~ blog
5 1.215~ GET /pred~ 200 9282 http:/~ mozilla/5.~ blog 2017-08-14 06:41:18 0 0 21257 blog/pre~ blog
6 1.22.~ GET /pred~ 200 10138 https:~ mozilla/5.~ blog 2017-11-17 08:38:00 0 0 51081 blog/pre~ blog
7 1.225~ GET /pred~ 200 9487 https:~ mozilla/5.~ blog 2018-01-19 14:31:34 0 0 64820 blog/pre~ blog
8 1.34.~ GET /pred~ 200 9487 https:~ mozilla/5.~ blog 2017-11-29 15:50:09 0 0 54844 blog/pre~ blog
9 1.36.~ GET /pred~ 200 4607 https:~ mozilla/5.~ blog 2017-10-26 18:51:54 0 0 87516 blog/pre~ blog
10 1.46.~ GET /cont~ 200 4128 http:/~ mozilla/5.~ mma 2017-12-12 06:17:49 0 0 58452 mma/cont~ mma/~
# ... with 30,561 more rows
Split the data on every session to see the streams per session.
splits <- all_logs_cl_3 %>% group_split(session.id)
head(splits,3)
<list_of<
tbl_df<
ip : character
file_name : character
status_code: integer
object_size: integer
origin : character
user_agent : character
data_origin: character
time : datetime<local>
diff : double
group : double
session.id : integer
fullpath : character
full2 : character
>
>[3]>
[[1]]
# A tibble: 2 x 13
ip file_name status_code object_size origin user_agent data_origin time diff group session.id fullpath full2
<chr> <chr> <int> <int> <chr> <chr> <chr> <dttm> <dbl> <dbl> <int> <chr> <chr>
1 104.2~ GET /predic~ 200 12836 - mozilla/5.~ blog 2017-07-31 06:28:54 0 0 180 blog/pre~ blog
2 104.2~ GET /predic~ 200 12836 - mozilla/5.~ blog 2017-07-31 06:37:28 0 0 180 blog/pre~ blog
[[2]]
# A tibble: 1 x 13
ip file_name status_code object_size origin user_agent data_origin time diff group session.id fullpath full2
<chr> <chr> <int> <int> <chr> <chr> <chr> <dttm> <dbl> <dbl> <int> <chr> <chr>
1 52.17~ GET /predic~ 200 12836 - mozilla/5.~ blog 2017-07-31 06:33:24 0 0 183 blog/pre~ blog
[[3]]
# A tibble: 1 x 13
ip file_name status_code object_size origin user_agent data_origin time diff group session.id fullpath full2
<chr> <chr> <int> <int> <chr> <chr> <chr> <dttm> <dbl> <dbl> <int> <chr> <chr>
1 5.248~ GET /predi~ 200 47401 http://~ mozilla/5~ blog 2017-07-31 07:03:01 0 0 306 blog/pre~ blog
Our items/categories are in the column named full2
, this is the 13th column.
If you would like to make a more general analysis, you could use the column data_origin
.
Since we are only interested in the clickstreams containing MMA, we will only look at these
streams.
clstr <- splits %>% map(., function (x) t(x)[13,])
#Or: clstr <- lapply(splits,function(x) t(x)[13,])
clstr <- clstr[map_int(clstr, function (x) sum(str_detect(x,"mma"))) > 0]
head(clstr)
[[1]] [[4]]
full2 full2
"mma/faq.htm" "mma/faq.htm"
[[2]] [[5]]
full2 full2
"mma/faq.htm" "mma/faq.htm"
[[3]] [[6]]
[1] "mma/CI_start.htm" "mma/index.htm" full2
"mma/index.htm"
Let's work further on the logs
dataset. Create a new variable
fullpath
and save your result as logs_cleaned3
. Next,
split the data on every user_agent
and store your result as splits
.
Finally, create the variable clstr
, based on the seventh column data_origin
.
To download the all_logs_ugent
dataset click
here1.
To download the logs
dataset click
here2.
Assume that:
logs_cleaned3
variable that was calculated in the previous exercise is given.