Joint modeling of longitudinal and event time data: application to HIV study

Many clinical studies generate a dataset having longitudinal repeated biomarker measurement data and time to an event data, which often depend on each other. In such studies, characteristics of the pattern of a biomarker change, and the association between the primary survival endpoint and features of the longitudinal profiles are commonly of interest. Often separate analyses using a mixed effects model for the longitudinal outcome and a survival model for the time to event outcome are performed. However, separate models are overly simplified because they do not consider the association between two components of such data and so produce misleading conclusions. An alternative approach is two-stage modeling which allows a separate biomarker pathway for each patient but the parameter estimates are still biased. Joint modeling is the most sophisticated complex approach but enables the repeated biomarker measurements and survival processes to be modelled while accounting for the interrelationship between the two processes. We demonstrate the use of joint modeling in analysis of an HIV dataset with CD4+ count measurements and survival time. In the joint model, we combine a linear Gaussian random effects sub-model for the repeated CD4+ count measurements and Cox or Weibull survival sub-model, linked through their shared dependence on the latent variable. Our study showed that the hazard rate of death depended on the longitudinal progression of CD4+ counts, i.e., a patient’s baseline CD4+ count and the rate of change in CD4+ counts significantly impact on his or her survival time.


Introduction
Patients in many clinical studies are followed over a period of time and biomarkers are collected repeatedly at multiple time points. The data also indicates the time, at which an event of particular interest occurred, eg., relapse of a disease or death. For example, in HIV studies patients are measured for biomarkers such as the CD4+ lymphocyte count or the viral RNA (viral load) until specific outcomes such as AIDS or death. Other applications include cognitive performance and survival in geriatric studies, systolic blood pressure and a coronary event, prostate-specific antigen biomarker and prostate cancer recurrence, and hemoglobin level and survival in type 2 diabetes [1][2][3][4][5][6][7]. In such studies, a common primary objective is to investigate the effect of treatment on survival and/or biomarker process. But patterns of change in a biomarker or the association between the primary endpoint and features of the longitudinal profiles are also of interest. The traditional approach uses two models separately, a random effects model that describes the process of the repeated measurements over time and a survival model that examines time to the event. However, separate models are overly simplified because they do not consider the association between the two components of such data, and in failing to take into account all the available information in an integrated manner, provide invalid inferences.
One of the alternative approaches is two-stage modeling. In the first stage of two-stage modeling, the repeated biomarker measurements are modeled using a random effect model so that subject-specific values of the covariate may be estimated. The modeled value of individual prediction from the random effects model is substituted into the survival model as timevarying covariate in the second stage. This two-stage approach accounts for some of the measurement error and allows the pathway of the covariates for each patient to be estimated [8]. However, parameter estimates of the two-stage model are still biased and inefficient, even though much of the bias of the uncorrected values is reduced [9]. Since either separate models or the two-stage approach is inappropriate when the longitudinal repeated measurements and survival outcome have an association, others have proposed joint modeling and its extensions, which have obtained great attention in the literature [9][10][11][12][13].
Joint modeling is a sophisticated, complex approach but it enables both longitudinal repeated biomarker measurements and survival processes to be modelled together while taking account of association between them. By including the random effects model for longitudinal data in the survival model, the patterns of a biomarker's performance and the relationship between its progression and survival time can be characterized. In this way, joint modeling provides less biased estimates and more efficient inferences than separate models or the two-stage approach [9,12,13]. The main advantage of joint modeling is that the effect of a covariate on the longitudinal process can be separated from its effect on survival. Modeling of a covariate effect (eg., treatment) can be described and visualized as in Figure 1.

Medical Statistics and Informatics
For jointly modeling longitudinal data and event time data, many have proposed a linear or non-linear random effects model for longitudinal measurements and a semiparametric or parametric survival model for event time data, where a set of random effects is assumed to induce their interdependence [9,[14][15][16][17][18]. Wulsohn and Tsiatis [9] proposed a two-step procedure for fitting their model in HIV study (Wulsohn-Tsiatis model). First, they assumed a growth curve random components model with normal errors for true CD4+ counts using the modified EM algorithm for estimation. They then substituted these estimates into the Cox proportional hazards model to obtain estimates of the survival parameters. Other authors have also assumed piecewise constant hazards, parametric Weibull, or accelerated failure time models [17,[19][20][21]. Henderson et al., developed a more general version of the Wulsohn-Tsiatis model and proposed the use of an unobserved or latent bivariate Gaussian process to link longitudinal data and survival data, assuming that the longitudinal and event processes were conditionally independent given the one process covariates [15,22].
Our paper focuses on the application of available methodologies and on interpretation of the results when these are utilized in the analysis of an HIV dataset. In our HIV study, the main objective is to use longitudinal CD4+ measurements to improve prediction of survival prognosis. In the next section, we introduce a description of our HIV study. In Section 3, methods of joint modeling for analysis are reviewed. In Section 4, the conventional separate models and the joint models are applied to the HIV dataset to measure the relative effects of covariates. We examine whether the baseline CD4+ value and the slope of CD4+ counts reveal association with mortality risk. It should be noted, however, our intent is to illustrate the application of joint modeling rather than give universally valid estimates for HIV study outcomes. Section 5 contains a discussion about the results and conclusions.

Study description
Clinical background CD4+ counts and viral load are considered important biomarkers of HIV disease progression. CD4+ cells are an important part of the immune system, which begin to deplete as the virus infects the body; the amount of decline indicates the degree of immunosuppression. Thus CD4+ count is considered as the primary indicator for prognostic information and a guide for antiretroviral therapy in HIV infected individuals [23,24]. Successful treatment for HIV patients depends on the therapy's ability to restore immune functions. Studies have shown that disease progression among patients with HIV infection is delayed substantially when CD4+counts increase in response to treatment [24,25]. Identification of factors influencing disease progression is vital to effectively care for patients and to improve their survival and quality of life. Detection of not only clinical features and risk factors but also characteristics associated with a more rapid progression of CD4+ counts can help identify patients who may benefit from closer and more frequent clinical follow-up and earlier treatment intervention [24]. We anticipate that changes in biomarkers (CD4+ counts or viral load) will have an impact on the risk of major HIV complications.

Study population
This study was a retrospective, longitudinal case-based investigation. Individual CD4+ count profiles of 10 randomly selected patients are shown in Figure 2. Plotting of CD4+ counts was grouped according to their date of measurement. The first measurement done within 180 days after study enrollment (first clinic visit) was considered the baseline value. We created a person-period data set for each patient covering 3-month intervals for the first year and then 6-month intervals afterward as CD4+ counts are generally measured every 3 ~ 6 months in clinical practice as the standard of medical care. Our choice of the 3-month or 6-month period for CD4+ counts was supported by clinical practice guidelines and CD4+ data availability in our study. When the CD4+ count was not observed during any particular interval period, we considered the value as missing. We did not anticipate that missing CD4+ counts would depend on the nature of the unobserved CD4+ counts, and thus they are assumed missing at random. For any patients with 2 or more CD4+ counts in a given interval, the average of those CD4+ count measurements was used for that interval. Subjects were followed from the time of entry into the study until death, loss to follow-up, or the end of the study (August 31, 2011). CD4+ count measurements were limited to 12 per patient in the subsequent analysis as very few subjects had more than 12 measurements. The maximum of 12 measurements is a practical limitation of the data, rather than a methodological one. The other outcome variable considered was time to death, which was defined as time between date of entry into the cohort and date of death, date last seen, or the end of the study. In our study, a 0.05 alpha level was used for statistical significance. Analyses in this study were performed using SAS version 9.2 (The SAS Institute, Cary, NC) or R package. The study received ethical approval from the University of Saskatchewan institutional review board committee.

Models
Let T be a random variable representing survival time that has the distribution function, F(t). The survival function S(t) at time t is defined to be the probability that the survival time is greater than t, where S(t) = P(T > t) = 1 -F(t). Let T i be the time when the event occurs for the ith subject and C i be the corresponding censoring time. When T i is subject to right censoring, X i is the observed time which is a minimum of (T i , C i ), i.e., X i is equal to T i if the event was observed and is equal to C i if it was censored. Let i an indicator function and takes the value 1 when T i ≤ C i and 0 when it is otherwise.
Let Z ij be the observed measurement for the ith subject at the jth time point, where i = 1, …, n and j = 1,… , K i . Assume that individual i have K i measurements of the covariate over time and the number of measurements can be different for each individual. The covariate is measured at time A joint model is comprised of two sub-models, one for the longitudinal process Z i (u) and the other for the failure time T i . We assume that the model for the longitudinal process and the hazard for survival time are usually taken to depend jointly on shared, underlying random effects. In the first step of modeling, evolution of the longitudinal biomarker measurements are estimated separately using a random effects model of the linear/polynomial function of time. With this approach, we can estimate changes in biomarkers for each patient at the time of death or censoring. In the second step, a joint modeling approach with Cox or Weibull parametric hazards model is utilized.
The subject-specific random effects model has become a popular tool to analyse various types of longitudinal data as it adequately describes the pattern of repeated measurements over time. In this model, the average progression is described using some function of time, and subject-specific deviations from this average evolution are introduced by using random effects [26,27]. We assume that the longitudinal measurement of the covariate Z i follows a linear mixed effects model (growth curve model) with random intercept θ 0 and random slope θ 1 . So, at times t ij , the linear mixed effect has the structure, where α is the regression coefficient of covariate vector W. θU i incorporate subject-specific random effects and is sometimes called the trajectory function of the model. ε ij 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42  is a random error term that accounts for the unexplained variation in the data. ε i is assumed to be normally distributed with N(0, 2 ε σ ) and cov (ε ij , ε il ) = 0, where j ≠ l. We also assume that the error ε ij is independent of the random intercept θ 0i and random slope θ 1i . The intercept and slope, θ i = (θ 0i , θ 1i ), are typically assumed to be random and have a multivariate normal distribution as iid N(θ, Σ) Here the random effect θ i account for the heterogeneity among the subjects and describes how the profile of the ith subject deviates from the average profile. Conditional on bivariate Gaussian latent variable θ i = (θ 0i , θ 1i ) associated with subject i, the corresponding longitudinal measurements Z ij at time t ij are mutually independent and The coefficient α assesses the covariate effect (for example, treatment) on the longitudinal biomarker. Possible extensions from a linear mixed effects model allow for more complex relationships such as polynomial specifications of Z i and W i . The longitudinal and survival components of the joint model are linked through the random effects θ. Assuming the true longitudinal measurement value is given by the growth curve model with random intercept (θ 0i ), random slope (θ 1i ) and the covariate W i , the Cox proportional hazards model is where h 0 (t) is an unspecified baseline hazard and α is the direct covariate effect on survival time. The parameter β measures the association between the longitudinal biomarker and survival time. The form of β (θ 0i + θ 1i t ij ) is subject-specific covariate effects which is often called a random effect frailty model [28]. The survival and growth curve densities in this model are both conditional on the unobserved random effect. The hazard function of the survival model can also be a parametric model such as an exponential or a Weibull model. The Weibull hazard model (η, µ i ) at time t is given as , which is an at risk indicator. Note that if η < 1, the hazard is decreasing and if η > 1, the hazard is increasing in t. When η = 1, the hazard is a constant which survival time reduces to the exponential distribution.
The likelihood method is a widely used approach for the parameter estimation in the joint model [9,10,16,18,21]. Assuming that censoring and timing of longitudinal measurements are non-informative, the likelihood function L with the observed data for each subject (X i , i δ , Z i , t i ), is given by [9].
The parameters θ, V, 2 ε σ , and β are estimated using parametric likelihood and h 0 (u) using nonparametric maximum likelihood. To fit the joint models, the EM algorithm is used for likelihood inferences [29]. The EM algorithm which requires twodimensional numerical integration corresponding to the dimensionality of θ = (θ 0 , θ 1 ) can be used to estimate the parameters of interest by maximizing the likelihood of the observed data. Monte Carlo method in the E-step to approximate the conditional expectation and the Newton-Raphson method in the M-step is used to estimate the parameters [9,10]. The closed-form maximum likelihood estimates (MLE) are The complete parameter estimation of the EM algorithm was provided in Wulfsohn and Tsiatis's paper [9]. Baysian sampling method using MCMC technique was also used to estimate the parameters [20,30].

Results
A total of 321 HIV infected patients were eligible for the study ( Table 1). Of them, 165 (51.4%) were males and 216 (71%) were Aboriginal. The mean age was 35 years (SD = 10.5). At baseline, 30% had a history of incarceration, 79% had Hepatitis C antibodies (HCV), 81% had history of injection-drug use (IDU), and 58% had ever been on antiretroviral therapy (ART). The mean CD4+ count at baseline was 388 cells/mm 3 (SD= 229; median=353; interquartile range: 234-536); the mean viral load was 4.3 log 10 copies/ml (SD= 1.0; median=4.4; interquartile range: 3.9-4.9). The median follow-up time was 20 months (interquartile range: 9-36 months) and 25 (7.8%) patients died in the study cohort. The overall mean CD4+ counts measurements showed decreasing trend. The overall survival probability was 0.97 at 1 year, 0.95 at 2 years and 0.92 at 3 years, respectively. To normalize CD4+ counts, we used the square root of the CD4+ counts in the models. Figure 3 shows the observed mean CD4+ counts profile as a function of time within separate ART strata. The ART treatment group started with a lower mean CD4+ count than the ART naïve group, but the ART naïve groups had a decreasing mean response while the ART treatment group mean was sustained. The mean response towards the end of the study was unstable because of the small number of patients. At each time point, mean responses were calculated only among those patients who have not yet dropped out of the study. The correct interpretation of the two mean response profiles is subtle because of the potential interdependence between a patient's predisposition to drop-out and their associated CD4+ counts. From the univariate model for CD4+ count measurement, HCV and ART treatment were significant (p-values are 0.02 and <0.0001, respectively) while gender, age, ethnicity, clinic site, history of incarceration, and history of IDU were not. Since HCV and ART treatment were significantly associated with longitudinal CD4+ counts measurement, these covariates were included in the random effect model. The random effect model for longitudinal CD4+ count repeated measurements is:

Results of random effects model
The results of the random effect model using the unstructured covariance matrix Σ are summarized in Table 2. The estimated average regression coefficient of Time for the ART treatment naïve group is −0.088 (95% CI: -0.14, -0.04; p=0.0008), suggesting a significant decrease in CD4+ count over the   study period. Average regression coefficient for the ART treatment group is 0.001 and indicates that CD4+ did not decrease over the study period. However, the difference in slope between two groups is statistically significant (95% CI: 0.03 -0.15; p=0.005). Patients who were co-infected with HCV at study entry had significantly lower CD4+ counts than those without HCV co-infection (p=0.002). Figure 4 shows the Kaplan-Meier survival curve estimates by ART groups. The two groups are similar during the first nine months but subsequent survival in the ART treatment group is significantly higher than in the ART naïve group.    were included in the model because both were significantly associated with survival time as well as CD4+ repeated measurements. No interaction was observed between ART treatment and HCV. The Cox regression model in the absence of random effects is

Results of survival model
Under this Cox model, as expected, the ART treatment group had significantly lower hazard rates than the ART naïve group after controlling for HCV (HR=0.34; 95% CI: 0.145 -0.786; p=0.01). HCV-positive status also resulted in a worse survival than HCV-negative status in the adjusted model, but the difference is marginally significant (p=0.07). The results from the Cox model are similar to the results obtained from the Weibull model. In the Weibull model, the relative hazard rate for patients with ART treatment to ART naïve is exp(-1.109) = 0.33 as compared to 0.34 under the Cox model ( Table 2).

Results of joint model
Assuming the CD4+ longitudinal value is given by the growth curve model with random intercept and slope, the Cox proportional hazards model for subject i is where h 0 (t) is either unspecified or Weibull baseline hazard. The estimated association parameter in the joint model is -0.102 and is statistically significant in the Cox sub-model (p=0.001). This indicates that there is strong evidence of association between two sub-models. Further investigation of this association showed that both initial level and slope of CD4+ count is negatively associated with the hazard of death (the intercept and slope estimates are -0.06 and -4.45, respectively). Under this joint model, a patient's survival is associated with his/her longitudinal data pattern of the rate at which CD4+ decrease ( Table 3). The data show that both HCV and ART are associated with CD4+ count over the study period. The rates of CD4+ change between ART groups are significantly different, suggesting a significant decrease in CD4+ count for the ART naïve group compared to the ART group over the study period after adjusting for HCV (p=0.0001). This finding is clinically predictable since patients with more rapid decline in CD4+ would have poorer survival. As expected, the ART treatment was significantly associated with survival after taking into account estimated CD4+ count changes (p= 0.001).
With the Weibull sub-model, most results are similar to those determined from the Cox sub-model (

Discussion
In our study we addressed the relationship between the trajectory of CD4+ counts over time and the risk of death among HIV patients using joint modeling with a longitudinal random effects sub-model and a Cox or Weibull survival submodel. The joint model revealed that the hazard of death depended on a longitudinal process of an intercept and a slope, i.e., a patient's baseline CD4+ count and the rate of change in CD4+ counts significantly impact on his or her survival time. This observation suggested that the rate of CD4+ count change confers risks of mortality, even after adjustment for HCV co-infection as a risk factor. The parameter estimates of the joint model were different from those of the separate models; this difference might be due to the model correction accounting for the correlation between the longitudinal CD4 + counts and survival time.
Simple methods like separate modeling or two-stage modeling are useful when we explore optional models and select potential covariates. With simple methods, if survival for patients is far longer than the last follow-up time, interpretation should be cautious because extrapolation of the predicted value beyond the observed date may be risky. However, the joint model is able to account for the healthy survivor effect by exploiting these random effects, which account for the dispersion among individual times to death [10,33]. As long as there is sufficient information on the longitudinal repeated measurements, the estimates of the joint model are robust and efficient [Hsieh 2006]. Thus, even if the association with the longitudinal measurements is not of interest, under heavy censoring a joint modelling approach exploits the association to give more efficient inferences for parameters of the survival distribution [34].
As Henderson et al., [22] mentioned, joint modeling is flexible methodology for handling combined longitudinal and event history data. But when the goals of a study concern population-level inferences, and especially when the timeto-event process is not of direct interest, random effect joint modelling may be an over-elaboration [34]. When the focus of interest is the time-to-event process, joint modelling enables longitudinal measurements on each subject to be used as timevarying explanatory variables while recognizing that they may be measured with a non-negligible error. When the association parameter between the longitudinal and survival data is not significant, the joint model analysis should have the same results as would be obtained from separate analyses for each component. However, joint modelling is a valuable technique not only in its efficient use of all available data and its ability to obtain accurate inference, but it is highly recommended when survival time and longitudinal measurements have the same clinical meaning. It is especially applicable to problems involving biomarkers where the focus is on using longitudinal measurements to improve prediction of survival prognosis. If survival time and longitudinal measurements have a different clinical meaning or are not comparable, joint modeling is not appropriate because the result may lead to the conclusion that a covariate effect with worse survival is superior [35].
One of the concerns many authors have raised in making inference on the longitudinal process is that occurrence of the event may induce an informative censoring [36][37][38][39]. Subjects with advanced disease may have fewer CD4+ count measurements because they experience death earlier, potentially leading to biased estimation. Valid inference requires a framework in which potential underlying relationships between the death event and the longitudinal process are explicitly acknowledged [10]. The joint model assumes that censoring is independent of the random effects. But if the underlying relationship and censoring process that leads to drop-out is not correctly modeled, the estimate of parameters may be biased [9,20]. Much more work remains to be done for joint models with informative censoring and missing data.
A limitation of this study is the short duration of follow-up time, which might affect the estimates of the covariates. In the data, the median follow-up was 20 months and only 8% of the study patients died. When the follow-up duration is not long enough, it has an impact on the number of CD4+ measurements, possibly leading to less reliable estimation of the random effect model [40]. The second limitation is that our study was a retrospective cohort, limiting the availability and quality of data collected. We only included HIV positive individuals who attended clinics. Thus the results are likely biased towards a more stable study population and could be potentially biased against rapid disease progressors. For subjects with 2 or more CD4+ values in a given interval, the average of those CD4+ values for that interval was used for simplicity. We did not anticipate that missing CD4+ counts would depend on the unobserved CD4+ counts and thus we assumed missing at random in order to analyze the data. We used generalized linear mixed models which are valid under the assumption of missingness at random [27,41] and believe this assumption to be appropriate for our study population. In our study, a linear growth function was also assumed for CD4+ count trajectories so that the characteristics of intercept and slope could be related to the survival outcome; however, a linear trend may not be adequate to describe the time course of the CD4+ counts. If the CD4+ count trajectories were not linear, model misspecification may lead to biased results. With no general, user-friendly commercial software available as of yet for this technique, the use of joint modeling is limited in practice.
In summary, the analysis of the data using a joint model with latent variables that link the longitudinal models and the survival models together will result in unbiased and more efficient estimates. Joint model is intuitive and enable prognostic information to be collected from longitudinal measurements. With such prognostic information, clinicians make better informed decisions for specific patients based on their longitudinal biomarkers of disease. Thus we recommend that the joint modeling of CD4+ count progression and survival time should be performed to obtain a clear picture of the effect of specific covariate. This model provides useful predictive information on future CD4+ counts when the current value and the rate change of CD4+ count are known. As a flexible modeling approach, joint modeling can be used for patientspecific treatment strategies and future clinical interventions.