In a third cleaning step, we will only keep real views, called pageviews, and no images etc. We do this by looking for files with htm(l) in the filename. However, pdf can also be included.
all_logs_cl_3 <- all_logs_cl_2 %>% filter(str_detect(file_name, "htm") | file_name == "/")
all_logs_cl_3
# A tibble: 30,571 x 11
ip file_name status_code object_size origin user_agent data_origin time diff group session.id
<chr> <chr> <int> <int> <chr> <chr> <chr> <dttm> <dbl> <dbl> <int>
1 104.2~ GET /predictive~ 200 12836 - mozilla/5.0 (x1~ blog 2017-07-31 06:28:54 0 0 180
2 104.2~ GET /predictive~ 200 12836 - mozilla/5.0 (x1~ blog 2017-07-31 06:37:28 0 0 180
3 52.17~ GET /predictive~ 200 12836 - mozilla/5.0 (x1~ blog 2017-07-31 06:33:24 0 0 183
4 5.248~ GET /predictive~ 200 47401 http://www.~ mozilla/5.0 (wi~ blog 2017-07-31 07:03:01 0 0 306
5 113.1~ GET /faq.htm HT~ 200 4548 http://www.~ mozilla/5.0 (li~ mma 2017-07-31 07:04:20 0 0 308
6 112.6~ GET /predictive~ 200 14682 - mozilla/5.0 (ma~ blog 2017-07-31 07:17:30 0 0 398
7 112.6~ GET /predictive~ 200 14681 - mozilla/5.0 (ip~ blog 2017-07-31 07:17:33 0 0 398
8 112.6~ GET /predictive~ 200 14682 - mozilla/5.0 (ma~ blog 2017-07-31 10:19:33 1 1 399
9 112.6~ GET /predictive~ 200 14681 - mozilla/5.0 (ip~ blog 2017-07-31 10:19:35 0 1 399
10 112.6~ GET /predictive~ 200 14682 - mozilla/5.0 (ma~ blog 2017-07-31 15:17:24 1 2 400
# ... with 30,561 more rows
Next, we will calculate some general insights.
uniqueVis <- unique(all_logs_cl_3$ip)
length(uniqueVis)
[1] 6816
nrow(all_logs_cl_3)/length(uniqueVis)
[1] 4.485182
viewWebpage <- all_logs_cl_2 %>% group_by(data_origin) %>% summarise(views = length(data_origin))
g1 <- ggplot(data = viewWebpage, aes(x = data_origin, y = views))
g1 + geom_bar(stat = 'identity')
With only pageviews
viewWebpage2 <- all_logs_cl_3 %>% group_by(data_origin) %>% summarise(views = length(data_origin))
g2 <- ggplot(data = viewWebpage2, aes(x = data_origin, y = views))
g2 + geom_bar(stat = 'identity')
This shows that the blog probably contains more pictures that have to be loaded.
Extract the pageviews from the logs
data and store it as logs_cleaned3
.
Compute the number of unique visitors, the number of page views per visitor, and
the views per webpage of the pageviews and store it as unique_visitors
,
page_views_per_visitor
, and views_per_webpage
, respectively.
To download the all_logs_ugent
dataset click
here1.
To download the logs
dataset click
here2.
Assume that:
logs_cleaned2
variable that was calculated in the previous exercise is given.bots1
and bots2
data is given.