In this section, we will simulate survival data using the sim.survdata()
function, which is part of the coxed
library. Our simulated data will represent
the observed wait times (in seconds) for 2,000 customers who have
phoned a call center. In this context, censoring occurs if a customer hangs
up before his or her call is answered.
There are three covariates: Operators
(the number of call center operators
available at the time of the call, which can range from 5 to 15), Center
(either A, B, or C), and Time
of day (Morning, Afternoon, or Evening). We
generate data for these covariates so that all possibilities are equally likely:
for instance, morning, afternoon and evening calls are equally likely, and
any number of operators from 5 to 15 is equally likely.
set.seed(4)
N <- 2000
Operators <- sample(5:15, N, replace = T)
Center <- sample(c("A", "B", "C"), N, replace = T)
Time <- sample(c("Morn.", "After.", "Even."), N, replace = T)
X <- model.matrix(~ Operators + Center + Time)[, -1]
It is worthwhile to take a peek at the design matrix X
, so that we can be
sure that we understand how the variables have been coded.
> X[1:5,]
Operators CenterB CenterC TimeEven. TimeMorn.
1 12 1 0 0 1
2 15 0 0 0 0
3 7 0 1 1 0
4 7 0 0 0 0
5 11 0 1 0 1
Next, we specify the coefficients and the hazard function.
true.beta <- c(0.04, -0.3, 0, 0.2, -0.2)
h.fn <- function(x) return(0.00001 * x)
Here, we have set the coefficient associated with Operators
to equal 0.04; in
other words, each additional operator leads to a \(e^{0.04} = 1.041\)-fold increase
in the “risk” that the call will be answered, given the Center
and Time
covariates. This makes sense: the greater the number of operators at hand,
the shorter the wait time! The coefficient associated with Center = B
is
−0.3, and Center = A
is treated as the baseline. This means that the risk
of a call being answered at Center B is 0.74 times the risk that it will be
answered at Center A; in other words, the wait times are a bit longer at
Center B.
We are now ready to generate data under the Cox proportional hazards
model. The sim.survdata()
function allows us to specify the maximum
possible failure time, which in this case corresponds to the longest possible
wait time for a customer; we set this to equal 1,000 seconds.
> library(coxed)
> queuing <- sim.survdata(N = N, T = 1000, X = X, beta = true.beta, hazard.fun = h.fn)
Warning in FUN(X[[i]], ...) :
9 additional observations right-censored because the user-supplied hazard function
is nonzero at the latest timepoint. To avoid these extra censored observations, increase T
> names(queuing)
[1] "data" "xdata" "baseline"
[4] "xb" "exp.xb" "betas"
[7] "ind.survive" "marg.effect" "marg.effect.data"
The “observed” data is stored in queuing$data
, with y
corresponding to
the event time and failed
an indicator of whether the call was answered
(failed = T
) or the customer hung up before the call was answered (failed = F
).
We see that almost 90% of calls were answered.
> head(queuing$data)
Operators CenterB CenterC TimeEven. TimeMorn. y failed
1 12 1 0 0 1 344 TRUE
2 15 0 0 0 0 241 TRUE
3 7 0 1 1 0 187 TRUE
4 7 0 0 0 0 279 TRUE
5 11 0 1 0 1 954 TRUE
6 7 1 0 0 1 455 TRUE
> mean(queuing$data$failed)
[1] 0.89
Center
, one stratified by Time
.Assume that:
ISLR2
, survival
, and coxed
libraries has been loaded