Joint Modeling Analysis of Multivariate Skewed-longitudinal and Time-to-event Data with Application to Primary Biliary Cirrhosis Study

Background: Many clinical and public health researches collect data including multiple longitudinal measures and time-to-event outcomes, where characteristics of the pattern of exposure change and the association between features of longitudinal biomakers and the primary survival endpoint are of interest. Methods: Many existing statistical models for longitudinal-survival data might not provide robust inference when more than one longitudinal exposures which were significantly correlated and longitudinal measurements exhibit skewness and/or heavy tails; ignoring these data features may lead to biased estimation. In this article, we offered a multivariate joint model with the skew-normal (SN) distribution with application to the Mayo clinic primary biliary cirrhosis (PBC) study to assess simultaneous effects. Results: With the multivariate joint modeling associated with the skew-normal (SN) distribution, the subject-specific baseline (HR=2.390 with 95% CI: (1.429, 4.112)) and change rate (HR=2.588 with 95% CI: (1.845, 3.967)) of Bilirubin in natural log scale were positively associated with the risk of death; the higher the subject-specific change rate (HR=0.191 with 95% CI: (0.037, 0.915)) of Albumin in natural log scale was associated with a decrease in mortality rate; the subject-specific of SGOT levels in natural log scale did not affect the risk of death for PBC patients significantly. The results of the skewness parameters of natural log-transformed Bilirubin (δ1=0.42), Albumin (δ2=−0.03) and SGOT (δ3=0.095) were estimated to be significant, indicating the skewness of three biomarkers existed. Conclusions: Our results revealed the Bilirubin and Albumin levels may be involved in predicting risk of death for PBC patients, except for SGOT. The multivariate joint modeling associated with SN distribution provides better fit to the data, gives less biased parameter estimates for those longitudinal biomarkers in comparison with its counterpart where the normal distribution is assumed (data not shown here). The introduced modeling approach is generally applicable to other situations where longitudinal measurements and time-to-event outcomes are available.


Introduction
Primary biliary cirrhosis (PBC), recently known as primary biliary cholangitis, is a relatively rare disease caused by inflammatory destruction of the small bile ducts from the liver. Eventually, this pressure build-up will harm the bile ducts leading to liver cell damage and cirrhosis. The cause of PBC is unknown, but because the body's immune system attacks its own cells, it is most likely thought to be an auto immune disease. In this disease, the bile ducts are under attack and are destroyed [16]. Women are more likely than men to have PBC, it is most often in the woman above the age of 40 [26,39].
Mathematical models based on PBC clinical study have been developed to predict disease progression. Cox proportional hazards model for survival analysis was performed to identify the two significant biomarkers, Alkaline Phosphatase and CrossMark ← Click for updates doi: 10.7243/2053-7662-9-2 Serum Bilirubin, regarding the risk of an event (death or liver trans-plantation) for patients diagnosed with PBC [26,27]. An increase in the levels of these biomarkers was positively associated with the hazards of PBC patients. Typically, serum bilirubin concentration was the best prognostic biomarker from all the other laboratory measurements. When the serum bilirubin exceeded 6.0 (mg/dl), the survival time was around 25 months [40]. Many other risk covariates such as age, sex, ascites, prothrombin were also used to prognostic models for PBC. However, there were several potential limitations of the previous approach-based survival analysis when the covariates were repeatedly measured over time. Firstly, only capture the biomarker observation until a certain time point, mean value or at particular time point was taken into account, but not all observations over time were take into account; using only one observation of the biomarker obviously discarded useful information about the biomarker and its trajectory. Secondly, the inter relationships between longitudinal and time-to-event processes were ignored. Thirdly, longitudinal growth trajectory as a time-varying covariate was not fully considered to assess the effect of longitudinal measures on the risk of the event. The longitudinal processes will affect the hazard of survival; therefore, an appropriate statistical model is needed to capture the unobservable quantities of the growth profile and overall growth trajectory for association with risk of death for PBC patients. The joint modeling had been applied to remedy the deficiencies.
The joint modeling approach can capture the rate of change in the biomarker levels, which contains the differences between patients and also the difference over time for the same patient. Joint modeling of longitudinal and survival data is an active area of scientific fields such as in biology, biomedical and clinical research, since it allows simultaneous analysis of longitudinal (repeated) measurements and time-to-event (survival) outcome [4,17,23,34,36,44,46]. For example, Allen et al. [2] inspected the relationship between five longitudinally collected cytokines (Interleukin (IL)-6, IL-8, growth-related oncogene-1 (GRO-1), vascular endothelial growth factor (VEGF), and hepatocyte growth factor (HGF)) measured from serum plasma and survival, focused on whether the values of these multiple cytokines were associated with survival. Survival outcome is always associated with multiple longitudinal outcomes. Important features in clinical studies of this type are that there might be a relatively large number of biomarkers [1,5,7,8,14,24,28,32,33,37,42,43], these biomarkers are subject to sizable measurement error due to laboratory error and biological variation and which may be significantly correlated, like the PBC study from the Mayo clinic [31], patients with PBC were followed longitudinally and multiple longitudinal biomarkers were measured. The PBC data collected at the Mayo Clinic between 1974 to 1984 [31] have been widely analyzed using joint modeling methods [1,3,9,11,18,12,13,35]; researchers that risk of death was significantly impacted by the logarithm of serum bilirubin levels. However, multiple longitudinal biomarkers were collected in the PBC study, it was important to understand the relationship among those biomakers' growth over time and the risk of PBC death. The adoption of a multivariate joint modeling (MVJM) to reassess the impact of multiple serum levels on the risk of death will reduce potential bias imposed by ignoring the correlation be-tween the longitudinal exposures pervading in the more commonly used univariate joint modeling (UVJM) approach. Joint modeling considering multiple longitudinal biomarkers of prognosis simultaneously can provide more accurate prediction of survival, modeling their interrelationship, correlation and uncertainty. Moreover, in traditional linear mixed-effects models, random errors are often under a normality assump-tion due to the mathematical tractability and computational convenience. However, normality assumption may not be realistic. Alternatively, skew-normal (SN) distribution should be more appropriate to model the skewed data [22,38]. The aim of this paper is to demonstrate an introductory overview on MVJM approach for longitudinal exposures and time-to-death in a specific application to Mayo Clinic PBC data, which enables fitting of such model have high dimensional longitudinal exposures. Here, we examined the association of the three longitudinal biomarkers serum bilirubin (SB), serum albumin (SA) and serum Glutamic-Oxaloacetic transaminase (SGOT) with skew-normal (SN) distribution (i.e., estimated bilirubin at baseline and change rate over time) with the risk of death simultaneously; the result of UVJM with SN distribution was presented in Supplemental for comparison.

Primary biliary cirrhosis (PBC) background and datasets
This dataset is from Mayo Clinic trail in PBC of the liver conducted between 1974 to 1984 [15,31]. The data were collected to examine the progress of PBC patients. A total of 424 PBC patients met the eligibility criteria for the randomized placebo controlled trail of the drug D-penicillamine, referred to Mayo Clinic during that ten-year interval. This dataset contained multiple laboratory results, but only on the first 312 patients in the dataset participated in the randomized trail and obtain large complete data. The additional 112 cases did not participate in the clinical trial, but consented to have basic measurements recorded and to be followed for survival.
The original clinical protocol for these patients specified visits at 6 months, 1 year, and annually thereafter. This is an ideal dataset to illustrate the various features of MVJM. The dataset collected clinical, demographic and biochemical risk factors for each patient. Demographic factors: age and sex of patients; biochemical factors: drug (D-penicillin and placebo group), ascites (accumulation of wa-ter in the doi: 10.7243/2053-7662-9-2 abdomen due to liver failure, presence of ascites 0=No, 1=Yes), hepatomogaly (liver growth status, presence of hepatomogaly 0=No, 1=Yes), edema (presence of edema, 0=No, 1=Yes: edema present without diuretics or edema despite diuretic therapy) and histological stage(≥3 is yes).
SB (mg/dl), SA (mg/dl) and SGOT (U/ml) values were taken as biochemical properties. Sample sizes and descriptive statistics for key variables at study entry were shown in Table 1. Only the patients who had three or more measurements of all the biomarkers were considered in the analysis due to the feature of our longitudinal modeling (quadratic of year was included). Thus, 259 patients who had sufficient repeated biomarkers observations were included in the analysis, where 111(42.9%) died during the study. Table 1 summarized the demographic characteristics of patients in whom died and censored. Of the 259 patients in the PBC dataset, mean age at baseline of this PBC dataset was 45.53±10.42 years and mean age of death for PBC patients was 52.50±10.30 years. Baseline value of natural logarithms of SB, SA and SGOT were 0.47±0.95(mg/dl), 1.21±0.16(mg/ dl) and 4.70±0.45(U/ml), respectively. Figure 1 displays the histogram of repeated SB, SA and SGOT measurements(in log scale) for 259 subjects enrolled in Mayo Clinic trail study [15,31]. It is seen that for this data set to be analyzed in this paper, these longitudinal data (even after logtransformation) are relatively skewed, and thus a normality assumption may be violated.

Longitudinal exposures and survival outcome
Patients with PBC had abnormalities in several blood tests, such as elevated levels of SB. Several laboratory tests had a baseline measurement and were followed longitudinally at 6 months and at yearly intervals thereafter. Data collected at each lab visit include: total SB, SA, SGOT, gender, presence of ascites and other covariates for the patients. Due to the skewness of the observed biomarkers, specially SB, we took the natural logarithm of them and used the logarithms of those biomarkers for the remainder of this analysis. Traditionally, the longitudinal measurements are modeled using the linear mixed-effects model for a continuous and normally distributed outcome [25]. The subject-specific   random-effects in this mixed-effects model are included in the relative risk model. The various extensions under this topic have recently been investigated including, but not limited to, Dai and Pan [10] introduced nonparametric random mixture effect models to solve such joint modeling problems; Huang et al. [19,20,21,30] considered the various mixed-effects models with skewed distributions-based joint analysis for skewed-longitudinal and survival data. The random-effects not only account for the association between the longitudinal and survival out-comes, but also the correlation between the repeated measurements in the longitudinal process. In the PBC dataset, multiple longitudinal outcomes had been recorded. Extending the UVJM to accommodate those multiple longitudinal exposures allows us to incorporate more information and thereby improves the prognostic ability of the modeling. In this analysis, we were interested in the association between the three longitudinal biomarkers (SB, SA, SGOT) and the risk of death.
We observed the heterogeneity of each of the biomarkers for those patients. In clinical study, longitudinal exposures of the measurements of SB, SA and SGOT were usually measured with substantial errors. Meanwhile, after taking the natural logarithms of the three longitudinal exposures, they were still skewed. Various covariate mixed-effects models were discussed in the literature, but most of them assumed normal distribution for the error term which may lack the robustness against departures from normality in practice [6,29,45]. In order to relax the normality assumption and make a robust inference, the model was assumed to follow SN distribution. We offered an MVJM with subject-specific random intercept, random slope and quadratic of year and gender of the three serum levels for the longitudinal sub-model; and those six subject-specific random quantities were served as surrogate covariates in the survival sub-model. Therefore, an appropriate statistical model was needed to capture all the different biomarkers trajectories with the risk of death.

Multivariate joint modeling
There are two basic components of a joint modeling: the longitudinal component and the time-to-event (survival) component. In this regard, longitudinal measurements and time-to-event outcomes need to be modeled simultaneously in order to account for all information and uncertainty from both components, and understand the relationship between the underlying longitudinal data and the hazard for the event. Figure 2 presented the underlying causal diagram for joint modeling mechanism.
The MVJM with multiple longitudinal exposures and SN distribution here was considered as an extension of a traditional UVJM. The MVJM consisted of a multivariate linear mixed-effects model for longitudinal SB, SA and SGOT exposures (the longitudinal sub-model) and a Cox proportional hazards model for the time of death as the outcome (the survival sub-model), which linked through the common subject-specific random-effects (intercepts and slopes) from the three functional forms to bring these two data types together into a single (multivariate joint) model, enabling better inference of the correlation, interplay and association between multiple longitudinal and time-to-event data. Moreover, in order to relax the normality assumption and make a robust inference, we assumed the multivariate longitudinal model with SN distribution. Based on both clinical significance and model selection criteria, we included the following covariates in the longitudinal sub-model: quadratic of year and gender (male as reference). In the survival sub-model, we assessed the simultaneous associations of the six subject-specific random baseline and change rate estimated from the longitudinal sub-model as time-varying covariates with risk of death. The survival sub-model was also adjusted for age at baseline, gender(male as reference), ascites(yes or no), hepatomegaly(yes or no), edema(yes or no) and histologic stage ≥3 (yes or no).
The MVJM with SN distribution was applied to assess the simultaneous effects of the three longitu-dinal biomarkers on risk of death. The fully Bayesian methodology using Markov Chain Monte Carlo (MCMC) techniques was adopted for the MVJM fitting and data analysis using the R software with as-sociated R2Winbugs [41]. The simultaneous statistical inference on all unknown population parameters θ = (β, α, γ, Σ 3 , Σ b , δ 3 ) T can capture the underlying association between the longitudinal responses and the time-to-event data. To carry out the Bayesian inference, we need to specify prior distributions with the values of the hyper-parameters for all population parameters. Due to the absence of historical data, we applied weakly informative prior distributions for the unknown population parameters in MVJM. In particular, we specified the values for the hyper-parameters in the prior distributions as follows. (i) each element of the population coefficient vectors β, α, and γ was taken to be the normal distribution N (0, 100); (ii) the priors for the variance covariance matrices Σ 3 and Σ b were taken to be inverse Wishart distribu-tions IW (diag(0.01, 0.01, 0.01), 4) and IW (diag(0.01, 0.01, 0.01, 0.01, 0.01, 0.01), 7), respectively; (iii) each of the skewness parameters δ 1 , δ 2 and δ 3 , which represent skewness of SB, SA and SGOT measurements, respectively, follows the normal distribution N (0, 100). Figure 4 presented the results of joint analysis for three longitudinal SB, SA and SGOT levels in natural log-transformed and time-to-death of PBC data based on the MVJM with SN distribution. We fitted multivariate linear mixed-effects model for each biomarkers with a patient-specific baseline value and patient-specific change rate with covariates quadratic term of year and gender. It was shown from the results summarized in the upper half of Table 2 that in the longitudinal sub-model, the estimated results indicated the skewness in SB (δ 1 =0.42), SA (δ 2 =−0.03) and SGOT (δ 3 =0.095) after taking natural logarithms were estimated to be significant. This indicated after the natural log-transformed of the three biomarkers, the skewness with lightly right tail of the SB and SGOT, and fair lightly left tail of SA was still remain. Thus, it might suggest that accounting for an MVJM with the skewness distribution assumption provided a better fit to the data which exhibit skewness and, in turn, gave more reliable estimates of the parameters in comparison with its counterpart where the normal distribution is assumed. The estimated results of fixed-effects presented in Table 2 indicated Figure 3. Causal diagram of the underlying mechanism of MVJM for the longitudinal and survival processes: Rectangles denote observed data; ellipse denotes unobserved quantities; circles denote un-observed (including subject-specific random bi) terms and unknown parameters. The arrows direct statistical dependencies. The causal chain of interest runs from longitudinal exposure y ik (k = 1, 2, 3) to hazard λ i (ti) for event, whereas y ik (k = 1, 2, 3) does not "cause"event, but is statistically related to event through its dependence on the unobserved random quantities bi.  For the parameters of the survival sub-model,  Table 2

Concluding Discussion
In this article, we have introduced the MVJM approach with SN distribution for simultaneously analyzing three longitudinal exposures and time-to-event data. Although joint modeling is widely recognized in the biostatistical literature and important in many application areas, allowing accurate inference of the dependency and association between these two types of data, this paper extended the traditonal joint modeling application to consider joint modeling of multivariate longitudinal exposures with skewness distribution and Cox proportional hazard model, accounting for multiple data features simultaneously. The principal benefits of this MVJM with SN distribution over traditional UVJM analysis is its efficient use of helping practitioners to analyze complicated multiple longitudinal exposures and time-to-event data under a wide range of considerations, which enables preciser inference. By using this advanced MVJM technique, we were able to obtain more accurate and reliable estimates by taking into consideration the highly corre-lated nature of the different serum levels and infer insights into the complex relationships among multiple longitudinal exposures and the risk of death for PBC patients.
We have reported the MVJM with SN distribution of a dataset from examining the progress of PBC in 259 patients in the Mayo Clinic to assess the dependency and association between the multiple longitudinal exposures over time and time-to-death outcome. The results from this MVJM revealed that both the baseline and change rate of SB trajectory over time were positively associated with the risk of death; both the baseline and change rate of SA trajectory over time were negatively associated with risk of death; but no significant association was found between the SGOT trajectory and death. From clinical perspective, these estimates were broadly in line, that higher SB was associated with higher risk of mortality, whereas higher SA was associated with lower risk of mortality. This study provides physicians a more flexible and dynamic model to discriminate patients using multiple biomarkers. Improving the knowledge about the course of PBC and its biomarkers is essential for the development and approval of new therapies.
To our knowledge, under the framework of Bayesian joint modeling for longitudinal and time-to-event data, no studies have examined multivariate longitudinal data based on SN distribution of biomarkers trajectory over time to predict survival event for the PBC study. In spite of these strengths, the following notes should be made in our study. First, including a second order polynomial in the longitudinal component, so we only select patients with at least 3 or more observations. As a result, the sample size was different from that in the analysis done by other researchers. Second, from the trajectory plots, the model specification in linear fashion may be inadequate to describe the time course of the repeated SA, SB and SGOT measurements, and the more complex models, such as the piecewise and nonlinear function-based models, should be considered. Finally, male patients were weighted with only light underwear on and thus their biomarkers value would be affected by lighter weight measurement, while undressing was not requested in female. This might introduce a small systematic bias. However, since 88.03% PBC patients were female in this study, which might indicate the majority of PBC patients were female.
Using an UVJM with SN distribution where only one longitudinal exposure serum was assessed as compared to MVJM with SN distribution, the estimated association of the time-varying serum levels with risk of death were biased which generally attenuates the risk estimates to the null, because it failed to take into account the uncertainty caused by the variations and correlations in the three longitudinally measured serum levels. We further explored the UVJM approach in the PBC data by fitting three UVJMs, separately, for each of the three serum (SB, SA and SGOT) levels and time to death (see Supplemental Material in detail). We found that, for the survival sub-model, the results showed considerably quantitative differences between the UVJM and MVJM analyses for most of the estimated parameters. This indicated the importance of recognizing the uncertainty caused by the additional variability and correlations in the observed values of SB, SA and SGOT levels.
In summary, the PBC data analysis using this MVJM with SN distribution has certain advantages compared to traditional joint modeling. First, the majority of joint modeling only focus on univariate longitudinal outcome associated with the time-to-event endpoint. However, in practice, multiple longitude outcomes are likely to be collected together and they may be highly correlated in clinical and observational studies. The MVJM analysis can reduce the bias and increase the efficiency in parameters estimation. The predictive capability will be improved by incorporating all sources of the data. These interesting findings have important clinical indications and lead to more informative inferences for the purpose of medical decision-making. Our results suggested that the biomarkers (SA, SB) had significant association with the risk of death. The MVJM doi: 10.7243/2053-7662-9-2 analysis is applicable in any situation where one wishes to investigate association of longitudinal outcomes and survival outcome. Secondly, since it was of importance to measure SB, SA and SGOT appropriately when they exhibited skewness and heavy tails, even though after we took the natural logarithms, this analysis considered the estimation of skewness parameters δ 1 , δ 2 and δ 3 were statistically significant for ln(SB), ln(SA) and ln(SGOT), indicating that the skewness exists in those biomarkers measurements. The MVJM approach with the SN distribution provided more efficient and accurate parameter estimation, as compared to existing joint modeling for the PBC study.
To conclude, the MVJM with skewness distribution analysis is an improvement over traditional joint modeling and survival model, because they consider all the longitudinal associated observations of covariates that are predictive to the survival outcome. This MVJM method can be applied to wide variaties of research setting to obtain better parameter estimation for future prediction in the medical research in comparison with traditioanl UVJM, since they are considered to account for individual variability. These predictions can provide relatively accurate characterizations of individual disease progression, which might be important for the timing of interventions, qualification for appropriate treatments, and additional other analysis. Although the application of MVJM is limited uptake by medical researchers, but it is useful and applicable to a broader field in clinical practice when multivariate longitudinal measurements and timeto-event data are available.