In this section, we will simulate survival data using the sim.survdata() function, which is part of the coxed library. Our simulated data will represent the observed wait times (in seconds) for 2,000 customers who have phoned a call center. In this context, censoring occurs if a customer hangs up before his or her call is answered.

There are three covariates: Operators (the number of call center operators available at the time of the call, which can range from 5 to 15), Center (either A, B, or C), and Time of day (Morning, Afternoon, or Evening). We generate data for these covariates so that all possibilities are equally likely: for instance, morning, afternoon and evening calls are equally likely, and any number of operators from 5 to 15 is equally likely.

set.seed(4)
N <- 2000
Operators <- sample(5:15, N, replace = T)
Center <- sample(c("A", "B", "C"), N, replace = T)
Time <- sample(c("Morn.", "After.", "Even."), N, replace = T)
X <- model.matrix(~ Operators + Center + Time)[, -1]

It is worthwhile to take a peek at the design matrix X, so that we can be sure that we understand how the variables have been coded.

> X[1:5,]
  Operators CenterB CenterC TimeEven. TimeMorn.
1        12       1       0         0         1
2        15       0       0         0         0
3         7       0       1         1         0
4         7       0       0         0         0
5        11       0       1         0         1

Next, we specify the coefficients and the hazard function.

true.beta <- c(0.04, -0.3, 0, 0.2, -0.2)
h.fn <- function(x) return(0.00001 * x)

Here, we have set the coefficient associated with Operators to equal 0.04; in other words, each additional operator leads to a $e^{0.04} = 1.041$-fold increase in the “risk” that the call will be answered, given the Center and Time covariates. This makes sense: the greater the number of operators at hand, the shorter the wait time! The coefficient associated with Center = B is −0.3, and Center = A is treated as the baseline. This means that the risk of a call being answered at Center B is 0.74 times the risk that it will be answered at Center A; in other words, the wait times are a bit longer at Center B.

We are now ready to generate data under the Cox proportional hazards model. The sim.survdata() function allows us to specify the maximum possible failure time, which in this case corresponds to the longest possible wait time for a customer; we set this to equal 1,000 seconds.

> library(coxed)
> queuing <- sim.survdata(N = N, T = 1000, X = X, beta = true.beta, hazard.fun = h.fn)
Warning in FUN(X[[i]], ...) :
  9 additional observations right-censored because the user-supplied hazard function
                                  is nonzero at the latest timepoint. To avoid these extra censored observations, increase T
> names(queuing)
[1] "data"             "xdata"            "baseline"        
[4] "xb"               "exp.xb"           "betas"           
[7] "ind.survive"      "marg.effect"      "marg.effect.data"

The “observed” data is stored in queuing$data, with y corresponding to the event time and failed an indicator of whether the call was answered (failed = T) or the customer hung up before the call was answered (failed = F). We see that almost 90% of calls were answered.

> head(queuing$data)
  Operators CenterB CenterC TimeEven. TimeMorn.   y failed
1        12       1       0         0         1 344   TRUE
2        15       0       0         0         0 241   TRUE
3         7       0       1         1         0 187   TRUE
4         7       0       0         0         0 279   TRUE
5        11       0       1         0         1 954   TRUE
6         7       1       0         0         1 455   TRUE
> mean(queuing$data$failed)
[1] 0.89

Questions

Create two Kaplan-Meier survival curves, one stratified by Center, one stratified by Time.
MC1 (only consider what you can see visually, do not test these statements yet with statistical significance):
A) It seems that calls at Call Center B take longer to be answered than calls at Centers A and C.
B) It appears that wait times are longest in the morning and shortest in the evening hours.
- 1) Both statements are true.
- 2) Both statements are false.
- 3) A is true and B is false.
- 4) A is false and B is true.
Now statistically test whether both statements below are true by using a log-rank test.
MC2:
A) There is a significant difference between centers.
B) There is a significant difference between times of day.
- 1) Both statements are true.
- 2) Both statements are false.
- 3) A is true and B is false.
- 4) A is false and B is true.
Next fit a Cox’s proportional hazards model to the data (use all predictors).
MC3: Does the model do a good job estimating the true coefficients of the predictors (which we know since we generated the data)?
- 1) Yes, the estimated coefficients lie close to the true coefficients.
- 2) No, the estimated coefficients lie far away form the true coefficients.

Assume that:

The ISLR2, survival, and coxed libraries has been loaded