
Sutradhar R and Cook RJ. A conditional frailty model for bivariate interval-truncated failure time data: an application to a study on siblings diagnosed with schizophrenia. J Med Stat Inform. 2016; 4:1. http://dx.doi.org/10.7243/2052-6237-4-1
Rinku Sutradhar1,2* and Richard J. Cook3
*Correspondence: Rinku Sutradhar rinku.sutradhar@ices.on.ca
1. Institute for Clinical Evaluative Sciences, Toronto, Ontario M4N 3M5, Canada.
2. Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario M5T 3M7, Canada.
3. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background: Paired failure time data often arise in medical studies involving familial information. When each member of a pair is subject to interval-truncation, there is lack of literature on methodologies for analyzing such data. The aim of this paper is to develop an approach for examining the association between paired failure times under the presence of interval-truncation.
Methods: A conditional frailty model is described and an expectation-maximization algorithm is developed for estimation of the model parameters. Simulation studies are conducted to examine the performance of the proposed algorithm, and an application of the methods is illustrated using data from a familial study on pairs of siblings diagnosed with schizophrenia.
Results: The results from the simulation studies show that all model parameters can be estimated with negligible bias, even under finite sample size. In the motivating dataset, our proposed model and method of estimation reveal a strong presence of dependence among times to diagnosis of schizophrenia within pairs of siblings.
Discussion: The proposed conditional frailty model is easy to implement since the expectation step of the algorithm is relatively straightforward. The maximization step also provides closed form expressions for the parameter estimates of the hazard function.
Keywords: Frailty model, interval-truncation, bivariate failure time data, monte carlo EM algorithm
Interval truncated failure time data arise when only those individuals with a failure time that lies within a certain interval are observed. When a failure time falls outside this truncation interval, no information regarding the subject is available. One may view this as a screening procedure in which all observations outside the interval known as the truncation region are removed. It is the concept of "who is included" that distinguishes truncation from censoring. In the case of censoring, we have partial information on all observations in a sample in that subjects are known to have failure times greater (or less) than some value. However in the case of truncation, individuals with failure times outside the truncation interval are not considered for inclusion in the study. This important difference between truncation and censoring is reflected in the construction of the data likelihood.
Approaches for dealing with censoring in the presence of interval truncation were first discussed by Turnbull [1]. He proposed a method for obtaining the nonparametric maximum likelihood estimate of the distribution function under interval truncation that also accommodates left, right, and interval censoring. Turnbull described a self-consistent approach to estimation that turns out to be a version of the EM algorithm [2]. Frydman noted that the method proposed by Turnbull was not applicable in the case of both truncation and censoring, and suggested appropriate modifications [3]. Alioum and Commenges provided a further detailed correction of Turnbull's approach and offered an extension to regression analysis [4].
We are interested developing methodology for examining interval-truncated failure times in a bivariate setting. That is, when available information consists of paired failure times (e.g., we consider families of size two), where, due to the observation scheme, paired failure times are known only if they both fall within their respective truncation intervals. Such data arises in settings where families are observed through family disease registries, for example, or where data are grouped by family by design. When paired data are collected subject to truncation, it is frequently of interest to investigate associations among failure times within families. The existing literature on methods for analyzing interval-truncated failure times in a bivariate setting is very limited and mostly refers to the case when only one component of the bivariate vector is subject to truncation. Exceptions to this assumption include the studies by van der Laan and Gurler [5,6], in which nonparametric estimation of the bivariate survival function is discussed where both components of the bivariate vector are randomly left or right-truncated, but not interval-truncated. This paper proposes a model and estimation approach for examining interval-truncated bivariate failure time data. Straight forward extensions are possible to deal with the multivariate setting.
Notation
Suppose a pair of individuals is included in a study if both
members' event times are observed to fall in the calendar
time interval [L,R]. Let Yij represent the year of birth for the
jth member in the ith family and let Xij denote the year of
event for that individual. Let the age at the year of the event
for the jth subject of the ith pair be denoted by Tij=Xij-Yij, and
let Bij=[Lij,Rij]=[L-Yij,R-Yij] be the corresponding truncation
interval in terms of age, where Lij is the age at calendar time
L and Rij is the age at calendar time R. Although we observe
Tij if it belongs to its corresponding truncation interval Bij,
most information regarding familial association is available
if two or more event times within a family meet truncation
requirements.
For the ith family with two members or siblings j=1,2, under the assumption of independent truncation, the joint failure times are sampled from a conditional distribution.
(1)
The likelihood for bivariate truncated failure times with observed data (tij, Bij=[lij,rji]; j=1,2, i=1,...n) can be expressed simply in terms of the bivariate survivor function:
(2)
where
and
is the probability of inclusion in the truncation rectangle Bi=Bi1XBi2.
Model formulation: joint distributions using a frailty
model
A conditional formulation is a common approach adopted
for modeling bivariate failure time data. Joint distributions
are routinely formulated by assuming responses are conditionally
independent given a common scalar random effect.
Integrating the conditional joint distribution with respect to
the unobserved random effect provides the joint distribution
of the failure times. Suppose (T1,T2) are bivariate failure
times for a particular pair. The joint survivor function under
a conditional model can be written as
(3)
and the marginal survivor function for Tj is
(4)
Paired failure times T1 and T2 are independent given the pairspecific random effect α, which has distribution G(.) indexed by parameter φ.
Various models have been proposed in the literature for the distribution of the random effects: one-parameter gamma distribution [8], the log-normal distribution [13], the positive stable distribution [9], and the inverse gamma distribution [10]. We assume the random effects arise from a one-parameter gamma distribution with mean 1 and variance φ-1. Furthermore, consider a frailty model in which random effects act multiplicatively on the hazard rate for each member of a family. Conditional on the random effect, the survivor function for Tj is
where ^*0j (t) is the cumulative baseline hazard function for Tj (j=1,2) under the conditional model. These assumptions provide closed form expressions for the joint and marginal survivor function, respectively.
(5)
and
(6)
Note that the marginal survivor function Sj*(t), under the conditional formulation, depends on both the parameter φ and the parameters of the conditional baseline hazard function Λ0j*(t). Moreover, as ϕ→∞, Sj*(t) approaches e-Λ0j* (t) for j=1,2 and S*(t1,t2) approaches.
Thus, as the variance of the random effect distribution tends to zero, S* (t1,t2 )= S1* (t1)S2* (t2), indicating that T1 and T2 are independent.
It is important to interpret the parameters of the model appropriately. Parameter φ measures both the lack of fit of the conditional hazard function, as well as the association among paired failure times. Large values of φ could indicate that standard survival models may be inadequate by failing to sufficiently account for variation in the data; this can arise simply from poor fit of a model. That is, large values of φ may be an indication of unaccounted for dependence, overdispersion, or other forms of model misspecification including violations of the proportional hazards assumption. Moreover, as the variance approaches zero, failure times among individuals within a pair approach independence.
Proposed estimation approach: an EM algorithm
Methodological complications arise when both failure times
in a pair are observed subject to interval truncation. The inclusion
of shared random effects among individuals within a
pair, in addition to the presence of interval truncation for each
member in the pair, results in a non-trivial likelihood function.
Direct maximization of this likelihood function requires
integrating the conditional joint distribution with respect to
the unobserved random effect, which is not straightforward
particularly if strong distributional assumptions are to be
avoided. In addition, if a piecewise constant model is used,
direct maximization becomes more challenging as the number
of pieces (and hence parameters) increases. Instead of trying
to directly maximize over a high dimensional space, the EM
algorithm offers an alternative method of estimation that
is sometimes slow to converge but avoids occasional divergence
problems. We develop an EM algorithm for bivariate
interval-truncated failure times under a conditional (random
effect) formulation. We use the concept of "ghosts", which
was first discussed by Turnbull [1] with the aim of obtaining
a nonparametric estimate of a truncated distribution in the
univariate setting. Here the notion of "ghost pairs" is used
for the analysis of interval-truncated failure times in the
bivariate case.
Let ti=(ti1,ti2) denote the vector of observed failure times for the subjects of the ith pair and let Bi=(Bi1XBi2)=[Li1,Ri1]X[Li2, Ri2] be the corresponding truncation rectangle. The observed data is represented as X=(X1,....,,Xn) with Xi=(ti,Bi) being the observed data for the ith pair (i=1,...,n). To define the complete data, we focus on the complement of the bivariate truncation rectangle Bic. Figure 1 provides an illustration of the data for a particular pair of individuals. The truncation region is represented by the center rectangle and the point ti is observed because both Ti1∈Bi1 and Ti2∈Bi2.
Figure 1 : Bivariate truncation rectangle for a pair of failure
times.
Due to truncation, the pair giving the observations ti=(ti1,ti2) may be considered a remnant of an unknown number of pairs from the same "birth cohort". Pairs are said to be in the same birth cohort if they have the same dates of birth. We define "ghost pairs" corresponding to pair i as pairs of individuals in the same birth cohort who were not observed because their failure times did not fall in the truncation region Bi. Thus these "ghost pairs" would have been observed had there been no truncation.
To define the complete data, let Ji be the number (unknown)
of ghost pairs for the ith pair of observed event times.
Suppose ui=(ui1,ui2) for j=1,...,Ji are the corresponding unknown
failure times for these ghost pairs. These are the paired
times that fall outside the truncation rectangle Bi for the ith observed pair, and are hence considered to arise from the
unobserved ghosts. Times within an observed pair or ghost
pair are considered independent given their corresponding
random effect, arising from the distribution G(.) indexed by
parameter φ. Here the complete data is given by Y=(Y1....,Yn)
with Yi=(Xi,αi,Ji,{uj,αj;j=1,...,Ji}) being the complete data for the
ith pair. Notice here we consider the times of the events
for the ghost pairs, as well as the random effects for the
observed pair and corresponding ghost pairs, as part of the
complete data. The reason for this is that the logarithm of the
resulting complete data likelihood is relatively convenient to
work with and, as will be seen, maximization is particularly
straight forward.
The complete data likelihood and corresponding loglikelihood for the ith pair of individuals are constructed as
(7)
and
(8)
respectively, where
(9)
We further assume a gamma distribution for the random effect distribution with p.d.f. given as
where E(α)=1 and Var (α)=φ-1. Based on these assumptions, the complete data log-likelihood can now be re-expressed -the parameters of this function consist of the parameters of the baseline hazard functions and the parameter of the random effects distribution. These parameters will be estimated via an Expectation-Maximization algorithm, where a weakly parametric approach will be taken by assuming a piecewise constant model for the baseline hazard function. The mathematical details of this algorithm are provided in the Appendix. We show in the maximization step that the parameter estimates of the baseline hazard function have a closed simple form and the parameter estimate of the random effects distribution can be easily obtained using standard Newton-Raphson procedures.
Evaluating performance of proposed methodology via
simulations
We conducted simulations for investigating the performance
of our proposed EM algorithm. We generated random effects
α1,...,αn from a gamma distribution with p.d.f.
where E(α)=1 and Var(α)=φ-1. Bivariate failure times (T1,T2) were simulated from the joint distribution
with
where the cumulative baseline hazard function Λ0* (t)=(ψt)γ had a Weibull form with shape parameter γ and scale parameter ψ-1. We generated n=1000 pairs of failure times. The median of the marginal distribution
was specified at n=40. Note that T1 and T2 were taken to have the same marginal distribution. The true value of the parameter φ was assumed to be 1/0.25 and 1/0.75 (log φ=1.386 and 0.288, respectively), and the true value of the shape parameter γ was taken to be 1.0 and 1.2. Each failure time was truncated over the same interval B=[L,R]. The left endpoint L was taken to be 0.0 and the right endpoint R was calculated at the 95th and 90th percentiles of the true marginal distribution. In summary, we generated data under 8 scenarios, each of which was simulated M=100 times.
The value of ψ was expressed in terms of φ and γ. It was obtained from the marginal survival function as follows:
which gave
The truncation percentiles were calculated in a similar manner. They depended on both parameters φ and γ:
where P(T≤R)=p for p=0.95 and 0.90. Upon generating uniform random variables U~ U(0,1) and V~ U(0,1), the bivariate failure times were derived from
as
and
To accommodate for bivariate interval truncation, (t1,t2) was retained only if it belonged to rectangle B1XB2, that is if both times satisfied their truncation requirements. This procedure was continued until n=1000 pairs were generated from the truncation region.
The generated data of paired failure times were fit assuming a gamma distribution for the random effects and a cumulative baseline hazard function of the following forms: (i) 1-piece or exponential model (true model when γ=1) (ii) 2-piece model, and a (iii) 4-piece model.
The cut points were determined based on the percentiles of the true marginal distribution. That is,
where P(T≤q)=p. For a 2-piece model, the cut point was computed for p=0.50, and for a 4-piece model, the cut points were computed for p=0.25,0.50 and 0.75. Parameter estimates under the bivariate truncation conditions were obtained using the proposed EM algorithm with ghost pairs discussed above. Note that the the crude rates of events during each of the piecewise constant intervals from our simulated data were taken to be the initial values of the piecewise constant rate parameters for the EM algorithm, and the tolerance level for convergence was specified at 1e-04.
Application of proposed methodology to data on siblings
with schizophrenia
L.S Penrose (1945) discussed a survey on the cases of familial
mental illness, conducted in 1944 by the Division of Psychiatric
Research for the Ontario Department of Health. An investigation
was made on all families in which two or more members
had been admitted to a mental hospital in Ontario. The records
date back to 1926, and thus a period of 18 years was covered.
Of the families that were admitted to a mental hospital, only
those families in which two siblings were admitted for a diagnosis
of schizophrenia were included in our study. The age
at diagnosis of schizophrenia was taken to be the age at first
admission to a mental hospital for schizophrenia.
The primary objective was to determine whether there was an association within pairs in diagnosis times among schizophrenics. Estimating the cumulative hazard function and the marginal survival function of the event times were also of interest. Various forms for the baseline hazard function were examined, including: (i) Weibull with shape parameter γ and scale parameter ψ-1 i.e.,
and (ii) piecewise constant models with two, three, and four pieces i.e.,
where ρr is the rth unknown rate parameter corresponding to the rth piecewise constant interval, and wr(t) is the length of the intersection of interval (0,t) and the rth piecewise constant interval. When aiming for a flexible weakly parametric model using piecewise constant rates, the selection of cut points should be considered carefully. The choice of cut points a1,...,aR-1) are typically determined either i) a priori (for example, if there is prior evidence indicating the event rate may begin to change after a specific time) or ii) based on the percentiles from the estimated marginal survivor function of the data. It may also be useful to begin with a small number of cut points (say 2) and then examine the effect on parameter estimates as the number of cut points increases. In our data, the breakpoints for the piecewise constant models were selected based on the estimated marginal survivor function under the Weibull fit. The two-piece model had a breakpoint at t=20; the three-piece model had breakpoints at t=20 and 40; and the four-piece model had breakpoints at t=16,28, and 48. Upon constructing the complete data log-likelihood, the parameters of the model were estimated using the EM algorithm with ghost pairs under the conditional formulation, as proposed above.
Results from simulations
Results from the simulation study under the true value γ=1
and γ=1.2 are provided in Tables 1 and 2, respectively. The
mean and simulation-based standard error are given for the
estimates of the model parameters. Interest lied in examining
how the average bias and standard error for φ and ρ vary with
respect to (i) the number of pieces (ii) the amount of truncation,
and (iii) the true value of the variance of the random effect
distribution. Recall that as the variance φ-1 approaches zero,
there was no association among failure times within a pair.
Table 1 : Summary of analysis of interval-truncated bivariate failure time data under a conditional model with marginal survivor function.
Table 2 : Summary of analysis of interval-truncated bivariate failure time data under a conditional model with marginal survivor function.
From Table 1, when increasing the number of pieces in the model of the baseline hazard function, there was no notable effect on the bias of the estimates, however the standard error increased. The same occurred as the degree of truncation increased. Furthermore, a larger value for the variance of the random effect distribution resulted in an increase in both the bias and the standard error of the estimate for φ. Table 2 showed a strong decreasing trend in the bias of the estimate for φ as the number of pieces increased. Furthermore, a rise in the degree of truncation resulted in a notable increase in the standard error of the estimate for φ. Note that the magnitude of the bias of the estimates for all parameters in Tables 1 and 2 were in fact very small, irrespective of any change in conditions, and the standardized bias also indicated that the bias was negligible throughout.
Results from application to data on siblings with
schizophrenia
The data was comprised of 173 pairs of siblings, in which
each member of a pair was hospitalized with schizophrenia
between the years 1926 and 1944. The age at first hospitalization
was considered as the age of diagnosis (the event time).
This observation was left and right truncated between the
age at 1926 and the age at 1944, respectively. The ages at first
hospitalization for schizophrenia for the first-born and secondborn
siblings are represented by T1 and T2, respectively. The
ages at 1926 for the first-born and second-born siblings are
denoted by L1 and L2, respectively. And the ages at 1944 for
the first-born and second-born siblings are represented by R1 and R2, respectively. Note that in this particular case, the length
of the truncation interval is the same across individuals, and
thus the area of the truncation square is the same across pairs.
Data for the first 20 pairs of siblings can be found in Table 3.
The data is further illustrated in Figure 2. The x-axis provides
the age of the first-born sibling and the y-axis represents the age of the second-born sibling. The truncation intervals
for individuals in a pair create a truncation square, and their
event times are given by the corresponding point, which is
only observed because it falls within their truncation square.
Table 3 : Schizophrenia Data (first 20 sibling pairs).
Figure 2 : Truncation rectangles for sibling pairs with
schizophrenia.
Table 4 provides the results from our analysis. The estimates and their standard errors are given for the parameters of the Weibull, 2-piece, 3-piece, and 4-piece models, along with the corresponding observed log-likelihood at the maximum. Once these estimates were obtained, the 25th, 50th, and 75th percentiles under each model were computed by solving
for tq, where q=0.25,0.50,0.75. Furthermore, the natural log of the odds ratio and its standard error were computed at t=21.0 under each model. The focus on the age of 21 was determined by our clinical collaborators, who define this age to be an indicator of early versus late schizophrenia onset; it is also believed that genetic factors are strongly associated with onset of schizophrenia prior to the age of 21 [14]. Note that the natural log of the odds ratio at time t was calculated as:
which was simply a function of marginal and joint survival functions. The estimate of the variance of the log(OR(t)) estimator was computed using a sandwich estimator. The results of the odds ratio under all models led to similar conclusions. Specifically, under a Weibull baseline hazard model, the 95% confidence interval for the log(OR(t=21)) was (0.080, 3.780), indicating a strong presence of association among diagnosis times for schizophrenia within pairs of siblings.
Table 4 : Results of the schizophrenia data analysis under a conditional formulation.
The results in Table 4 were based on the joint modeling of paired failure times. For comparison purposes, however, similar measures may be obtained based on independence assumptions among event times within a pair. With a naive matched-pairs analysis under a binary response (age of onset ≤21 or age of onset >21), the estimate of the log(OR(t=21)) is 0.788 with standard error 0.381, and the 95% confidence interval for the log(OR(t=21)) was (0.035, 1.525). Furthermore, McNemar's test of marginal association for such matched-pair data gave a chi-square test statistic of 3.781 with a resulting p-value of 0.052. These naive results agreed with our more formal analysis, indicating a notable presence of association among diagnosis times for schizophrenia within pairs of siblings.
An illustration of the performance of the piecewise constant models can be seen from the plot of the estimated marginal survivor functions given in Figure 3. The 95% confidence bands are also provided under the Weibull fit at years 20, 40, and 60. The results from the piecewise models agree with that from the Weibull model. However, during the first two decades in which very few events are observed to occur, the four-piece model provides a more suitable fit than the two-piece or three-piece model.
Figure 3 : Estimated marginal survivor functions under various
fits to the schizophrenia data based on a conditional model.
Although the piecewise model fit could be carried out using direct maximization, the EM algorithm was used to avoid divergence problems often arising when there is a large number of pieces. In the case of a 2-piece model, the computing time using a standard maximization function was faster than implementing an EM approach. However, if one wanted to fit an 8-piece or 10-piece model, the EM algorithm was attractive since it did not require obtaining the derivative functions, and thus avoided possible problems with convergence.
This paper was concerned with inference for bivariate failure time data that were interval-truncated. A conditional (random effect) formulation was discussed for modeling the marginal and joint distribution of the failure times, where particular interest lied in measuring the association within pairs. Under the conditional model, the marginal distributions were determined by both the variance parameter of the random effect distribution and the parameters of the conditional hazard function. To minimize the need for strong distributional assumptions and facilitate a likelihood analysis based on parametric models, the baseline hazard functions were assumed to be of a piecewise constant form.
The literature offers limited approaches in estimation when both failure times in a pair are observed subject to interval truncation. This paper developed an EM approach for estimation based on a conditional formulation. The idea of "ghosts" was introduced in the bivariate setting so that the complete data log-likelihood could be easily maximized. The formulation was convenient to implement since the E-step was relatively straightforward, and it also provided closed form expressions for the parameter estimates of the piecewise constant conditional hazard function in the M-step. Results from the simulation studies indicated that all model parameters could be estimated with negligible bias in finite samples. The data simulated for Table 1 was based on an exponential baseline hazard function (as the shape parameter γ=1), which is why more pieces were not needed to improve model fit. On the other hand, as the baseline hazard function was not constant for the data simulated in Table 2, we see that increasing the number of pieces provided a better approximation to the true marginal survivor function. In the motivating schizophrenia data set, our proposed model and method of estimation revealed a strong presence of dependence among schizophrenia diagnosis times within pairs of siblings. In addition, since the rate of onset for schizophrenia was quite low in the early years, the four-piece model was more sensitive to this shift and was able to detect this shift, thus providing different results than the two-piece and three-piece models. Moreover, as the estimate of the shape parameter under the Weibull distribution was much greater than 1.0, it was reasonable that increasing the number of pieces provided a better fit, as seen by the plots of the estimated marginal survivor functions in Figure 3.
Although we considered the scenario where there was no censoring, our proposed methods could be naturally extended to handle the case where the event times were interval-censored within their respective truncation intervals. We also considered the scenario in which there were paired failure times; our conditional approach could be easily extended to handle a multivariate setting with more than 2 individuals within a group.
The authors declare that they have no competing interests.
Authors' contributions | RS | RJC |
Research concept and design | √ | √ |
Collection and/or assembly of data | √ | √ |
Data analysis and interpretation | √ | √ |
Writing the article | √ | √ |
Critical revision of the article | √ | √ |
Final approval of article | √ | √ |
Statistical analysis | √ | √ |
This work was supported by the Institute for Clinical Evaluative Sciences. Richard Cook is a Canada Research Chair in Statistical Methods for Health Research. We would also like to thank Dr. Janice Husted from the University of Waterloo, Canada, and Dr. Anne Bassett from the University of Toronto, Canada, for their valuable input and guidance regarding the data.
EICs: Jimmy Efird, East Carolina University, USA.
Max K. Bulsara, University of Notre Dame, Australia.
Received: 17-Nov-2015 Final Revised: 31-Dec-2015
Accepted: 13-Jan-2016 Published: 29-Jan-2016
Sutradhar R and Cook RJ. A conditional frailty model for bivariate interval-truncated failure time data: an application to a study on siblings diagnosed with schizophrenia. J Med Stat Inform. 2016; 4:1. http://dx.doi.org/10.7243/2052-6237-4-1
Copyright © 2015 Herbert Publications Limited. All rights reserved.