

Cotton CA, Wu Y and Cook RJ. Correcting for sampling variation and measurement error in Cox regression with circulating tumour cells. J Med Stat Inform. 2014; 2:1. http://dx.doi.org/10.7243/2053-7662-2-1
Cecilia A. Cotton†*, Ying Wu† and Richard J. Cook
*Correspondence: Cecilia A. Cotton ccotton@math.uwaterloo.ca
Authors Affiliation :
†These authors contributed equally to this work.
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background: Circulating tumour cell counts are being used increasingly often in oncology for prognosis, stratification and
the assessment of response to treatment in phase II trials. Cell counts have typically been based on visual or computer assisted
counting of abnormal cells in blood samples drawn at the point of recruitment to a study. The number of cells in a given sample is
subject to sampling variation and the recorded number of tumour cells in the sample is subject to measurement error.
Methods: A complete data likelihood was developed for a proportional hazards regression model which recognizes these sources
of variation in the observed circulating tumour cell count when it is used as a prognostic covariate.
Results: The performance of the expectation-maximization algorithm was assessed in simulation studies and found to yield
estimators with good frequency properties. The asymptotic bias of estimators based on the observed circulating tumour cell
count was also examined.
Conclusions: In the situations examined, the proposed estimator was consistent and had considerably smaller empirical bias than
estimators arising from analyses based on the observed cell count. The estimated standard error gave confidence intervals with
coverage probabilities compatible with the nominal level.
Keywords: Circulating tumour cells, cox regression, measurement error, sampling variation
There has been considerable interest in the prognostic utility
of circulating tumour cell (CTC) counts in oncology research
in recent years [1-4]. Early research was directed at the use
of circulating tumour cell counts as a variable prognostic
for disease progression, progression-free survival, or overall
survival [5], but these data are being used increasingly often
for stratification in clinical trials [5], as a basis for assessing
response to treatment in phase II trials [6], and may have the
potential to guide treatment selection [7]. Their use in these
important settings warrants a careful consideration of the
various scientific objectives and issues in statistical analyses.
In mixed models, independent variables are treated as known values, but in practice we encounter missing values for both dependent and independent variables. When conducting data analysis using mixed models, an entire row of the data matrix is excluded if there is only one missing value for independent variables. This results in a fitted model based on fewer subjects or time points, or a reduced data matrix, therefore excluding a lot of information from the model. Since so much information can be excluded from the model, it can be difficult to draw meaningful conclusions.
Miller et al., [8] tabulate data on circulating tumour cell counts in cohorts of normal donors, patients with benign disease, and patients with metastatic breast, colorectal, and prostate cancer. These data reveal that many metastatic cancer patients do not have circulating tumour cells detectable in their blood samples but among those that do, there can be considerable variability in the number detected. Most recent analyses have classified individuals into groups with unfavourable or favourable prognoses depending on whether the observed count was above a specified threshold [9,10]. In metastatic breast and prostate cancer the threshold typically adopted is five so those with a circulating tumour cell count of at least 5 in a 7.5 sample are considered to have a poorer prognosis than those with a count less than 5; in metastatic colorectal cancer the threshold is 3 [5,11,12]. More recently, others have found an association between overall survival and both a baseline CTC count used as a continuous variable and post-treatment change in CTC count [13].
Counts of circulating tumour cells are typically made by visual or computer assisted counting of abnormal cells in blood samples drawn at the point of recruitment to a study [3,14]. The number of cells in a given sample is therefore subject to sampling variation and the recorded number of tumour cells in the sample is subject to measurement error.
Typically when dealing with measurement error, we must first consider the error distribution or the relationship of the mismeasured covariates to the true unobserved covariates. We may classify approaches for measurement error problems as being parametric, semi-parametric or nonparametric by the assumptions of the measurement error distribution. A nonparametric approach was taken by Pepe and Fleming [15] when they empirically estimated the likelihood in the presence of mismeasured covariates with validation data. Huang and Wang [16] also took a nonparametric approach to deal with mismeasured covariates in the Cox model with replicate data available. Tsiatis and Davidan [17] and Kulich and Lin [18] considered semi-parametric approaches of dealing with mismeasured covariates in survival analysis.
There are two fundamentally different interpretations of the unobserved true values of the covariates. In structural modelling the true covariates are regarded as random and a model for their joint distribution is assumed, whereas in functional modelling, the true covariates are considered as a sequence of fixed unknown vectors, or minimal assumptions about the distributional form of the true covariates are made in addition to fixed values. Likelihood methods and Bayesian methods are two general structural modelling methods which require distributional assumptions of the unobserved true covariates. Much literature has concentrated on functional modelling since it involves few or no assumptions regarding the error distribution. For general non-differential error problems, two widely used approaches are regression calibration (RC) and simulation extrapolation (SIMEX). Regression calibration was first proposed by Prentice [19] for the proportional hazard models in survival analysis and required the validation data or replication measurements to estimate the conditional distribution of the unknown true covariates given the observed covariates. Simulation extrapolation was first proposed by Cook and Stefanski [20]. The key idea of this method is that the effect of measurement error can be investigated and therefore adjusted for using simulation techniques. This approach is well suited to additive or multiplicative measurement errors and leads to improve estimates subject to correct model specification [21].
Corrections for the measurement error typically depend on the availability of internal or external data to estimate the error distribution, for example, the replicated measurements or validation dataset. However, in our context regarding CTCs, it is impossible to measure the true concentration; it is also costly to have multiple blood samples to assess the number of CTCs. What's more, we take not only the measurement error but also the sampling variation into consideration in correcting biases. Therefore, in this article, we focus on structural modelling with specified distribution of the true concentration and use likelihood method for correction.
We propose that the implicit goal of prognostic analyses involving circulating tumour cells is to relate the concentration of circulating tumour cells in the total blood volume to the time of interest (e.g., survival time). We develop a complete data likelihood for a proportional hazards regression model based on this concentration which recognizes the sampling variation. The measurement error in the circulating tumour cell count for a specific blood sample is characterized by a simple Bernoulli process for each cell. We consider models using both the concentration in the blood stream and a binary classification of whether this concentration is above a particular threshold. Estimation is carried out based on an expectation-maximization algorithm [22] and we describe how to do this for a parametric proportional hazards model and a semiparametric Cox model [23]. The aim of this article is to demonstrate the potential bias that results when the above sources of error are not accounted for and to describe and illustrate likelihood based methods for carrying out corrected analyses based on potentially mismeasured or misclassified CTC counts. These methods are assessed through simulation studies and found to yield estimators with good frequency properties.
Notation
Consider a sample of m independent individuals where
denotes the number of circulating tumour cells,
denotes
the total blood volume, and
denotes the concentration
of circulating tumour cells in the blood stream for
individual i, i = 1,..., m. When a blood sample of vml is
drawn we let
denote the number of tumour cells contained
in the sample. Human or automated procedures for
counting tumour cells in blood samples are imperfect and it
is common for cells present in the sample to be missed. We
therefore Ci let denote the number of tumour cells observed in the vml sample and let Wi denote a px1 vector of
observed auxiliary covariates for individual i, i = 1,2,..., m.
In the notation above we use "dots" to denote latent variables.
Interest primarily lies in examining the association between
circulating tumour cell counts and times to events such as
disease progression or death; we let Ti denote the time of
the event of interest. In the context of a prospective clinical
trial the plan is to follow individuals over the period [0,τ )
where τ is an administrative censoring time. Individuals may
of course withdraw from the study early and we let τi† denote
a random time of withdrawal so that events are ultimately
observed over the interval [0, τi )
where
for
individual i, i =1,..., m. Let
and
, i =1,2,..., m . Let
indicate that individual i is uncensored and
indicate that the individual i has not yet failed. Then
indicates that individual i is at risk and
under observation at time u.
We also let
indicate that the
individual i fails at time u and let indicate
that failure occurred and was observed at time u.
We make the following assumptions regarding the above random variables.
Assumption A1 states that given the concentration of
circulating tumour cells and the auxiliary covariates, the failure
time is independent of the number of cells in the sample and
the number of observed cells, which in turn implies
and Ci are surrogates, or in other words, the measurement error
model is nondifferential. Assumption A2 states that counting
of the circulating tumour cells in the sample is conducted
independently of the concentration in the body and the
auxiliary covariates given the sample count. This is entirely
reasonable since the circulating tumour cells in the sample are
counted without knowledge of
. Assumption A3 is an
assumption of conditionally independent random censoring
[24] which applies to the censoring time given Wi .
Distributional assumptions and model specification
We adopt a structual modeling approach [25] to deal with the
issues of measurement error in this problem. Let
denote the distribution function for the concentration of
circulating tumour cells given the auxiliary variables relevant
for the study population. The circulating tumour cells are
assumed to be uniformly distributed throughout the blood
stream and so given
the number of cells in a sample
is Poisson distributed with mean
[26]. The probability
p that a particular cell in the sample is detected is referred
to as the detection efficiency. We assume that the event an
individual tumour cell in the sample is detected is independent
of whether other cells are detected, and so
is binomially
distributed with
trials and probability of success p ; we
treat p = p0 as a known feature of the measurement system
based on external data published in the literature [4].
The effect of circulating tumour cells on survival is typically
assessed based on a proportional hazards model involving
either the observed count Ci or an indicator
of
whether the count is elevated above a specified threshold K.
In metastatic breast cancer, a threshold of K = 5 has been
reported and used routinely in the literature, and if Ei =1 the individual would typically be considered high risk. The
observed cell count Ci is implicitly interpreted in the context of
the v = 7.5ml blood sample. The fact that the samples drawn
are all of the same volume implies that the prognostic value
of the concentration
is of primary interest and
can
be viewed as an estimate of the concentration of circulating
tumour cells in the blood stream. The Cox regression model
of interest is
(1)
where α indexes the baseline hazard
,
,
where β1 is the parameter of interest, and
.
We use
since current practise is to use the observed cell
count which is thought to represent the concentration per
7.5ml volume.
Bias ignoring sampling variation and measurement error
Under the current formulation the standard method of fitting
regression models to the observed circulating tumour cell
count involves a misspecified model since it ignores the
sampling variation and measurement error in the observation
process. Here we consider the implications of using this
approach to analysis by studying the limiting behaviour
of the resultant estimator using the theory of misspecified
likelihoods [27] for the Cox model [28].
We let Zi = (Ci,Wi')' denote the covariate vector based on
available count and auxiliary covariate and model the effect
in Ti |Zi as
where the parameters are denoted as
and
we use different Greek letters for the parameters
in the misspecified model since they represent different
quantities than their counterparts in the correctly specified
models. The misspecified score equations are then
(2)
(3)
where
and we write
.
The limiting values of the naive estimator
from
solving (2) and (3) are consistent for the solutions to
The expectations above can be evaluated numerically by using Monte Carlo integration. In Figure 1 we display a plot of the limiting value of the estimator of β1 for the exponential and Cox regression models as a function of (a) the dispersion parameter of the concentration distribution, (b) the mean parameter of the concentration distribution, and (c) the regression parameter itself, for varying degrees of the efficiency of detection. The limiting values are quite different for the parametric and Cox regression models. The plots reveal a clear dependence on the distribution of the concentration and the efficiency of detection; the plotted points represent empirical estimates which are in close agreement with the theoretical results. Among these the most important appears to be the detection efficiency.
Figure 1
: Asymptotic bias of naive maximum likelihood estimates for β from the exponential and Cox proportional hazards
model for p = 0.6, 0.8 and 1.0 across a range of values of (a) ϕ, (b) µ, and (c) β1. Points indicate values of the parameters used
in the simulation models of Section 3.
Likelihood accommodating sampling variation and
measurement error
Using the structural model, if
, the complete data
likelihood is
By Assumption A3,
Under Poisson sampling and Assumption A2,
where
is binomial with
trials and known probability
of success p0 , and
is Poisson with
. Since
does not
involve any unknown parameters, we can write the complete
data partial log-likelihood as
(5)
(6)
(7)
Let Di = {(Xi, δi),Ci,Wi} denote the observed data for individual
i , i =1,2,..., m and D = (D1,D2,..., Dm). Since
is unobserved
we must take the conditional expectation of (5) using
(8)
where in the denominator of (8) we integrate with respect to
the latent concentration of circulating tumour cells.
If
where
is the complete
data score equations for the log partial likelihood (6) and
, then
is the observed data score equation. This equation is best
solved iteratively using the EM algorithm [22] and we do this
by taking the expectation of (9) using an estimate
at the k th iteration to compute
,
which is in turn set equal to zero and solved to give
. This procedure is repeated iteratively until convergence.
If (1) is to be modeled using a semi-parametric Cox model,
following Martinussen [29], we let s1,s2,..., sR be the distinct
times at which
,
and
. Then for any estimate
, the expected score
vector (9) has components
conformable with (dH0, β', γ'), and
The equation
can be solved to give
(10)
which is zero except at times of failure, s1,s2,..., sR .
Letting
we
substitute (10) into
to get the estimating equation
for β as
(11)
To evaluate the estimating equation for, we propose
specification of the parametric form of
.
To evaluate these conditional expectations in (10)-(11), we
use Gauss-Laguerre quadrature [30] given the conditional
probability density function (8). Gauss-Laguerre quadrature
involves approximating the integrals on the interval [0, ∞) by:
where xs and ws are the nodes and weights that picked based
on weight function e−x . Monte Carlo methods with rejection
sampling could alternatively be used to approximate these
expectations by simulation. The asymptotic variance of
can be estimated as suggested by Louis [31] and implemented
by Martinussen [29]; see the Appendix for details on the
current setting. Further details and examples of using the
EM algorithm to fit Cox regression models can be found in
Klein [32], Lipsitz and Ibrahim [33], Herring and Ibrahim [34],
and Nielsen [35].
Empirical study of finite sample behaviour
Here we present the results of a series of simulation studies
conducted to assess the influence of sampling variability
and mismeasurement on the behaviour of naive estimators
and to examine the performance of the proposed estimator.
We generate the data following the model formulation
in the previous section. Specifically we generate the total
circulating tumour cell count in an individual assuming
is
gamma distributed with mean µ and variance ϕ we consider
(µ,ϕ)'=(2,1)' and (2,4)' to represent a mean of 2 in the
population and a coefficient of ϕ variation of 0.5 and 1.0
respectively. We also consider (µ,ϕ)'=(2/3,1)' to correspond
to a smaller mean reflecting a higher proportion of zero
circulating tumour cell counts but with a coefficient of variation
of 1.5. The times of interest were generated from exponential
regression models with
(12)
We considered the case in which the failure times were
uncensored and administratively censored at a point where
25% of the events would be unobserved. To reflect the sampling
and observation process we generated
as a Poisson
random variable with mean
and
as binomially
distributed with probability p0 = 0.80.
Analyses were carried out under an exponential regression
model and Cox regression model of the form of (12) using
,
and Ci . The first of these is not possible in practise since it
requires the true concentration of circulating tumour cells. The
second is also not possible since it requires perfect recording
of the number of circulating tumour cells in the sample, and
the third analysis reflects the current practise of using the
recorded counts in the samples. The fourth analysis was
based on the EM algorithm described above. A second set of analyses were there carried out using the binary counterparts
to the above variables constructed with a threshold of 5.
Specifically we let
and
denote
the latent variables based on the unobserved concentration
and count, Ei = I(Ci ≥5) be the binary variable based on the
observed count. Based on the parameter configurations
, and for
(µ,ϕ) = (2,1),(2, 4) and
(2/3,1) respectively. The sensitivities
and
specificities
of the observed binary variable
for the true binary variable are (0.919, 0.792), (0.899,0.907)
and for (µ,ϕ) = (2,1), (2,4) and (2/3,1) respectively. The analyses
using the continuous and binary variables are summarized in
(Table 1 and 2) respectively where the empirical biases,
the empirical standard errors and the empirical coverage
probabilities for nominal 95% confidence intervals are given.
Table 1 : Empirical performance of estimators of the regression coefficient for the continuous covariate under parametric and semi-parametric Cox regression models with and without adminstrative censoring; sample size m=500, number of simulations nsim=500.
Table 2 : Empirical performance of estimators of the regression coefficient for the binary covariate under parametric and semi-parametric Cox regression models with and without adminstrative censoring; sample size m=500, number of simulations nsim=500.
The results for the analysis based on a continuous covariate in Table 1 reveal negligible empirical bias and excellent empirical coverage probability for the Cox regression model based on the true concentration. Appreciable bias in the estimator is apparent for the analyses based on the sample concentration and there is a considerable impact on the empirical coverage probability. The performance of the estimators based on the observed circulating tumour cell count is also very poor in terms of bias and coverage probability. The proposed estimator has considerably smaller bias and standard errors based on the method of Louis [31] give confidence intervals which are compatible with the 95% level. This performance is roughly the same whether the event times are subject to censoring or not and there is good performance seen as well for the parametric analyses.
Table 2 contains analogous results for the binary covariate. Here the measurement error results in a misclassification of the binary indicator and these misclassification probabilities are reported in Table 2. Again the proposed EM algorithm reduces the bias over the naive analyses.
Our proposed methodology is the first we are aware of to rigorously deal with the statistical issues arising from sampling variation and measurement error in the enumeration of CTCs. The utility of circulating tumor cells was originally examined in the context of their prognostic role for progression and survival, but they are now being used increasingly often to form surrogate outcomes in phase II and III clinical trials [36] and are currently receiving consideration as biomarkers to assess treatment response [37] in patients with advanced cancer. The statistical issues of sampling variation and measurement error raised here bear on their use in these contexts as well.
However, our method is not without limitations. We have focussed on the use of circulating tumour cells in the context of parametric and semiparametric proportional hazards models. Other types of regression models are of courses possible including accelerated failure time models, additive models or Cox-Aalen types models [38]. Model diagnostics represent an important step in survival analysis to assess the plausibility of the model assumptions. Key assumptions to check include the way the covariate effects have been expressed and the proportional hazards assumption. These can be assessed through the analysis of Schoenfeld or martingale residuals [39]. In settings where empirical covariate distributions might involve extreme values, influence statistics can play an important role. Further research is warranted on the development of methods for carrying out these diagnostic tests when covariates are measured with error.
An additional limitation is due to the fact that the concentration of circulating tumour cells is not observed in any patients it is difficult to assess the validity of any parametric assumptions made for the distribution of this concentration. In such settings model expansion offers one feasible approach to check for the adequacy of the specification of G( . ); score tests are particular appealing in this setting since they do not require fitting the expanded model [40]. Power of these tests will typically be rather low since they are based on the latent true concentration. See Waagepetersen [41] and Alonso et al., [42]. Additionally, depending on the disease severity of the subjects used in a CTC study there may be a large proportion of subjects with a true concentration of zero or subjects with very high concentrations. Other approaches for dealing with the difficulty in specifying the distribution of the concentration in the sample include nonparametric estimation but this is really only feasible in the absence of auxiliary covariates or if Wi is discrete. Spline-based approaches of kernel density estimation offer alternatives [43].
Many statistical methods for correcting for measurement error rely on data from validation studies. If the correct underlying value of the covariate of interest is available in a validation sub-sample then one can estimate features of the measurement process. Validation studies will not be possible in this setting, but auxiliary data in which repeated blood samples are taken and circulating tumor cells are counted could yield more precise estimates of the true concentration and enhance the precision of the corrected estimators we described [44].
The present study demonstrates that using observed CTC count as a prognostic covariate may lead to biased results when sources of sampling variation and measurement error are not accounted for. This could lead to false conclusions regarding the association between CTC count and patient outcomes. A likelihood based methods for carrying out corrected analyses is described and illustrated. These methods are assessed through simulation studies and found to yield estimators with good frequency properties.
The author declares that they have competing interests.
Authors' contributions | CAC | YW | RJC |
Research concept and design | √ | √ | √ |
Collection and/or assembly of data | -- | √ | -- |
Data analysis and interpretation | √ | √ | √ |
Writing the article | √ | √ | √ |
Critical revision of the article | √ | √ | √ |
Final approval of article | √ | √ | √ |
Statistical analysis | -- | √ | -- |
The authors wish to thank Dr Alison Allan (London Regional Cancer Program, London Health Sciences Centre; Departments of Oncology and Anatomy & Cell Biology, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario) for helpful discussions and comments. This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada (RGPIN 402474 for CAC and RGPIN 155849 for RJC) and the Canadian Institutes for Health Research (FRN 13887). Ying Wu was supported by a grant from the Division of High Impact Clinical Trials of the Ontario Institute for Cancer Research. Richard Cook is a Canada Research Chair in Statistical Methods for Health Research.
EIC: Jimmy Efird, East Carolina University, USA.
Received: 26-Sep-2013 Revised: 11-Oct-2013
Accepted: 18-Oct-2013 Published: 23-Jan-2014
Cotton CA, Wu Y and Cook RJ. Correcting for sampling variation and measurement error in Cox regression with circulating tumour cells. J Med Stat Inform. 2014; 2:1. http://dx.doi.org/10.7243/2053-7662-2-1
Copyright © 2015 Herbert Publications Limited. All rights reserved.