Correcting for sampling variation and measurement error in Cox regression with circulating tumour cells

Background: Circulating tumour cell counts are being used increasingly often in oncology for prognosis, stratification and the assessment of response to treatment in phase II trials. Cell counts have typically been based on visual or computer assisted counting of abnormal cells in blood samples drawn at the point of recruitment to a study. The number of cells in a given sample is subject to sampling variation and the recorded number of tumour cells in the sample is subject to measurement error. Methods: A complete data likelihood was developed for a proportional hazards regression model which recognizes these sources of variation in the observed circulating tumour cell count when it is used as a prognostic covariate. Results: The performance of the expectation-maximization algorithm was assessed in simulation studies and found to yield estimators with good frequency properties. The asymptotic bias of estimators based on the observed circulating tumour cell count was also examined. Conclusions: In the situations examined, the proposed estimator was consistent and had considerably smaller empirical bias than estimators arising from analyses based on the observed cell count. The estimated standard error gave confidence intervals with coverage probabilities compatible with the nominal level.


Introduction
There has been considerable interest in the prognostic utility of circulating tumour cell (CTC) counts in oncology research in recent years [1][2][3][4]. Early research was directed at the use of circulating tumour cell counts as a variable prognostic for disease progression, progression-free survival, or overall survival [5], but these data are being used increasingly often for stratification in clinical trials [5], as a basis for assessing response to treatment in phase II trials [6], and may have the potential to guide treatment selection [7]. Their use in these important settings warrants a careful consideration of the various scientific objectives and issues in statistical analyses.
Miller et al., [8] tabulate data on circulating tumour cell counts in cohorts of normal donors, patients with benign disease, and patients with metastatic breast, colorectal, and prostate cancer. These data reveal that many metastatic cancer patients do not have circulating tumour cells detectable in their blood samples but among those that do, there can be considerable variability in the number detected. Most recent analyses have classified individuals into groups with unfavourable or favourable prognoses depending on whether the observed count was above a specified threshold [9,10]. In metastatic breast and prostate cancer the threshold typically adopted is five so those with a circulating tumour cell count of at least 5 in a 7. 5 sample are considered to have a poorer prognosis than those with a count less than 5; in metastatic colorectal cancer the threshold is 3 [5,11,12]. More recently, others have found an association between overall survival and both a baseline CTC count used as a continuous variable and post-treatment change in CTC count [13]. Counts of circulating tumour cells are typically made by visual or computer assisted counting of abnormal cells in blood samples drawn at the point of recruitment to a study [3,14]. The number of cells in a given sample is therefore subject to sampling variation and the recorded number of tumour cells in the sample is subject to measurement error.
Typically when dealing with measurement error, we must first consider the error distribution or the relationship of the mismeasured covariates to the true unobserved covariates. We may classify approaches for measurement error problems as being parametric, semi-parametric or nonparametric by the assumptions of the measurement error distribution. A nonparametric approach was taken by Pepe and Fleming [15] when they empirically estimated the likelihood in the presence of mismeasured covariates with validation data. Huang and Wang [16] also took a nonparametric approach to deal with mismeasured covariates in the Cox model with replicate data available. Tsiatis and Davidan [17] and Kulich and Lin [18] considered semi-parametric approaches of dealing with mismeasured covariates in survival analysis. There are two fundamentally different interpretations of the unobserved true values of the covariates. In structural modelling the true covariates are regarded as random and a model for their joint distribution is assumed, whereas in functional modelling, the true covariates are considered as a sequence of fixed unknown vectors, or minimal assumptions about the distributional form of the true covariates are made in addition to fixed values. Likelihood methods and Bayesian methods are two general structural modelling methods which require distributional assumptions of the unobserved true covariates. Much literature has concentrated on functional modelling since it involves few or no assumptions regarding the error distribution. For general non-differential error problems, two widely used approaches are regression calibration (RC) and simulation extrapolation (SIMEX). Regression calibration was first proposed by Prentice [19] for the proportional hazard models in survival analysis and required the validation data or replication measurements to estimate the conditional distribution of the unknown true covariates given the observed covariates. Simulation extrapolation was first proposed by Cook and Stefanski [20].
The key idea of this method is that the effect of measurement error can be investigated and therefore adjusted for using simulation techniques. This approach is well suited to additive or multiplicative measurement errors and leads to improve estimates subject to correct model specification [21].
Corrections for the measurement error typically depend on the availability of internal or external data to estimate the error distribution, for example, the replicated measurements or validation dataset. However, in our context regarding CTCs, it is impossible to measure the true concentration; it is also costly to have multiple blood samples to assess the number of CTCs. What's more, we take not only the measurement error but also the sampling variation into consideration in correcting biases. Therefore, in this article, we focus on structural modelling with specified distribution of the true concentration and use likelihood method for correction.
We propose that the implicit goal of prognostic analyses involving circulating tumour cells is to relate the concentration of circulating tumour cells in the total blood volume to the time of interest (e.g., survival time). We develop a complete data likelihood for a proportional hazards regression model based on this concentration which recognizes the sampling variation. The measurement error in the circulating tumour cell count for a specific blood sample is characterized by a simple Bernoulli process for each cell. We consider models using both the concentration in the blood stream and a binary classification of whether this concentration is above a particular threshold. Estimation is carried out based on an expectation-maximization algorithm [22] and we describe how to do this for a parametric proportional hazards model and a semiparametric Cox model [23]. The aim of this article is to demonstrate the potential bias that results when the above sources of error are not accounted for and to describe and illustrate likelihood based methods for carrying out corrected analyses based on potentially mismeasured or misclassified CTC counts. These methods are assessed through simulation studies and found to yield estimators with good frequency properties.

Notation
Consider a sample of independent individuals where denotes the number of circulating tumour cells, denotes the total blood volume, and . When a blood sample of v m is drawn we let denote the number of tumour cells contained in the sample. Human or automated procedures for counting tumour cells in blood samples are imperfect and it is common for cells present in the sample to be missed. We therefore C i let denote the number of tumour cells observed in the v m sample and let i W denote a 1 p × vector of observed auxiliary covariates for individual , 1, 2, , i m = … . In the notation above we use "dots" to denote latent variables.
Interest primarily lies in examining the association between circulating tumour cell counts and times to events such as disease progression or death; we let i T denote the time of We make the following assumptions regarding the above random variables.

A1.
( , ) | ( , ) Assumption A1 states that given the concentration of circulating tumour cells and the auxiliary covariates, the failure time is independent of the number of cells in the sample and the number of observed cells, which in turn implies and are surrogates, or in other words, the measurement error model is nondifferential. Assumption A2 states that counting of the circulating tumour cells in the sample is conducted independently of the concentration in the body and the auxiliary covariates given the sample count. This is entirely reasonable since the circulating tumour cells in the sample are counted without knowledge of ( , ) . Assumption A3 is an assumption of conditionally independent random censoring [24] which applies to the censoring time given i W .

Distributional assumptions and model specification
We adopt a structual modeling approach [25] to deal with the issues of measurement error in this problem. Let ( | ; ) denote the distribution function for the concentration of circulating tumour cells given the auxiliary variables relevant for the study population. The circulating tumour cells are assumed to be uniformly distributed throughout the blood stream and so given i R  the number of cells in a sample is Poisson distributed with mean i vR  [26]. The probability p that a particular cell in the sample is detected is referred to as the detection efficiency. We assume that the event an individual tumour cell in the sample is detected is independent of whether other cells are detected, and so | i i C C  is binomially distributed with i C  trials and probability of success p ; we treat 0 p p = as a known feature of the measurement system based on external data published in the literature [4].
The effect of circulating tumour cells on survival is typically assessed based on a proportional hazards model involving either the observed count or an indicator ( ) i i E I C K = ≥ of whether the count is elevated above a specified threshold K . In metastatic breast cancer, a threshold of where indexes the baseline hazard 0 ( ; ) h t α , β is the parameter of interest, and . We use i vR  since current practise is to use the observed cell count which is thought to represent the concentration per 7.5 m volume.

Bias ignoring sampling variation and measurement error
Under the current formulation the standard method of fitting regression models to the observed circulating tumour cell count involves a misspecified model since it ignores the sampling variation and measurement error in the observation process. Here we consider the implications of using this approach to analysis by studying the limiting behaviour of the resultant estimator using the theory of misspecified likelihoods [27] for the Cox model [28].
We let Z i =(C i ,W i ')' denote the covariate vector based on available count and auxiliary covariate and model the effect in T i |Z i as where the parameters are denoted as and η = (dΛ 0 ' , ')'; we use different Greek letters for the parameters in the misspecified model since they represent different quantities than their counterparts in the correctly specified models. The misspecified score equations are then (3) where and we write The limiting values of the naive estimator =(dΛ 0 ' , ')' from solving (2) and (3) are consistent for the solutions to where the expectation is with respect to [28]. ψ characterizes the distribution of A i and will be explained in the next section. We define One can then obtain the limiting values of the estimator dΛ 0 (s r ) as the limiting value of is given by the solution to Therefore, it remains to evaluate The expectations above can be evaluated numerically by using Monte Carlo integration. In Figure 1 we display a plot of the limiting value of the estimator of β 1 for the exponential and Cox regression models as a function of (a) the dispersion parameter of the concentration distribution, (b) the mean parameter of the concentration distribution, and (c) the regression parameter itself, for varying degrees of the efficiency of detection. The limiting values are quite different for the parametric and Cox regression models. The plots reveal a clear dependence on the distribution of the concentration and the efficiency of detection; the plotted points represent empirical estimates which are in close agreement with the theoretical results. Among these the most important appears to be the detection efficiency.

Likelihood accommodating sampling variation and measurement error
Using the structural model, if =(θ' ,γ')' , the complete data likelihood is where in the denominator of (8) we integrate with respect to the latent concentration of circulating tumour cells. If is the complete data score equations for the log partial likelihood (6) and is the observed data score equation. This equation is best solved iteratively using the EM algorithm [22] and we do this by taking the expectation of (9) using an estimate at the k th iteration to compute which is in turn set equal to zero and solved to give This procedure is repeated iteratively until convergence.
If (1)      To evaluate these conditional expectations in (10)-(11), we use Gauss-Laguerre quadrature [30] given the conditional probability density function (8). Gauss-Laguerre quadrature involves approximating the integrals on the interval [0, ∞) by: where and are the nodes and weights that picked based on weight function x e − . Monte Carlo methods with rejection sampling could alternatively be used to approximate these expectations by simulation. The asymptotic variance of can be estimated as suggested by Louis [31] and implemented by Martinussen [29]; see the Appendix for details on the current setting. Further details and examples of using the EM algorithm to fit Cox regression models can be found in Klein [32], Lipsitz and Ibrahim [33], Herring and Ibrahim [34], and Nielsen [35].

Empirical study of finite sample behaviour
Here we present the results of a series of simulation studies conducted to assess the influence of sampling variability and mismeasurement on the behaviour of naive estimators and to examine the performance of the proposed estimator. We generate the data following the model formulation in the previous section. Specifically we generate the total circulating tumour cell count in an individual assuming i R  is gamma distributed with mean µ and variance ϕ; we consider (μ,ϕ) ' =(2,1) ' and (2,4) ' to represent a mean of 2 in the population and a coefficient of фvariation of 0.5 and 1.0 respectively. We also consider (μ,ϕ) ' =(2/3,1) ' to correspond to a smaller mean reflecting a higher proportion of zero circulating tumour cell counts but with a coefficient of variation of 1.5. The times of interest were generated from exponential regression models with We considered the case in which the failure times were uncensored and administratively censored at a point where 25% of the events would be unobserved. To reflect the sampling and observation process we generated | Analyses were carried out under an exponential regression model and Cox regression model of the form of (12) using i vR  , i C  and i C . The first of these is not possible in practise since it requires the true concentration of circulating tumour cells. The second is also not possible since it requires perfect recording of the number of circulating tumour cells in the sample, and the third analysis reflects the current practise of using the recorded counts in the samples. The fourth analysis was based on the EM algorithm described above. A second set of analyses were there carried out using the binary counterparts to the above variables constructed with a threshold of 5. Specifically we let ( 5) (Tables 1 and 2) respectively where the empirical biases, the empirical standard errors and the empirical coverage probabilities for nominal 95% confidence intervals are given.
The results for the analysis based on a continuous covariate in (Table 1) reveal negligible empirical bias and excellent empirical coverage probability for the Cox regression model based on the true concentration. Appreciable bias in the estimator is apparent for the analyses based on the sample concentration and there is a considerable impact on the empirical coverage probability. The performance of the estimators based on the observed circulating tumour cell count is also very poor in terms of bias and coverage probability. The proposed estimator has considerably smaller bias and standard errors based on the method of Louis [31] give confidence intervals which are compatible with the 95% level. This performance is roughly the same whether the event times are subject to censoring or not and there is good performance seen as well for the parametric analyses. Table 2 contains analogous results for the binary covariate. Here the measurement error results in a misclassification of the binary indicator and these misclassification probabilities are reported in Table 2. Again the proposed EM algorithm reduces the bias over the naive analyses.

Discussion
Our proposed methodology is the first we are aware of to rigorously deal with the statistical issues arising from sampling variation and measurement error in the enumeration of CTCs. The utility of circulating tumor cells was originally examined in the context of their prognostic role for progression and survival, but they are now being used increasingly often to form surrogate outcomes in phase II and III clinical trials [36] and are currently receiving consideration as biomarkers to assess treatment response [37] in patients with advanced cancer. The statistical issues of sampling variation and measurement error raised here bear on their use in these contexts as well.
However, our method is not without limitations. We have focussed on the use of circulating tumour cells in the context of parametric and semiparametric proportional hazards models. Other types of regression models are of courses possible including accelerated failure time models, additive models or Cox-Aalen types models [38]. Model diagnostics represent an important step in survival analysis to assess the plausibility of the model assumptions. Key assumptions to check include the way the covariate effects have been expressed and the proportional hazards assumption. These can be assessed through the analysis of Schoenfeld or martingale residuals [39]. In settings where empirical covariate distributions might involve extreme values, influence statistics can play an important role. Further research is warranted on the development of methods for carrying out these diagnostic tests when covariates are measured with error. An additional limitation is due to the fact that the concentration of circulating tumour cells is not observed in any patients it is difficult to assess the validity of any parametric assumptions made for the distribution of this concentration. In such settings model expansion offers one feasible approach to check for the adequacy of the specification of ; score tests are particular appealing in this setting since they do not require fitting the expanded model [40]. Power of these tests will typically be rather low since they are based on the latent true concentration. See Waagepetersen [41] and Alonso et al., [42]. Additionally, depending on the disease severity of the subjects used in a CTC study there may be a large proportion of subjects with a true concentration of zero or subjects with very high concentrations. Other approaches for dealing with the difficulty in specifying the distribution of the concentration in the sample include nonparametric estimation but this is really only feasible in the absence of auxiliary covariates or if W i is discrete. Spline-based approaches of kernel density estimation offer alternatives [43].
Many statistical methods for correcting for measurement error rely on data from validation studies. If the correct underlying value of the covariate of interest is available in a validation sub-sample then one can estimate features of the measurement process. Validation studies will not be possible in this setting, but auxiliary data in which repeated blood samples are taken and circulating tumor cells are counted could yield more precise estimates of the true concentration and enhance the precision of the corrected estimators we described [44].

Conclusions
The present study demonstrates that using observed CTC count as a prognostic covariate may lead to biased results when sources of sampling variation and measurement error are not accounted for. This could lead to false conclusions regarding the association between CTC count and patient outcomes. A likelihood based methods for carrying out corrected analyses is described and illustrated. These methods are assessed through simulation studies and found to yield estimators with good frequency properties.