Assessing inter-rater agreement between multiple medical instruments with heteroscedastic measurements

Background: In clinical medicine, agreement evaluation plays a major role in determining the compatibility and the accuracy of newly introduced methods with pre-existing methods. These methods may be assays, clinical observers, medical devices etc. It is vital to assess the compatibility and the accuracy of these newly introduced techniques because they deal with the measurements of the human body, such as blood pressure, cholesterol level, heart rate etc. In practice, agreement evaluation is carried out among two methods of measurements and deals with the data that are homoscedastic. The main objective of this study is to extend the standard mixed model to allow the error variances to depend on magnitude of measurement and evaluate agreement between multiple methods assuming the new model, taking the heteroscedasticity into account. Methods: In order to assess the agreement, there are two typical steps in method comparison studies. The first step is to model the data using the Heteroscedastic mixed effects model. The model fitting is carried out by using two main approaches, namely the mean method and the best linear unbiased predictor method. After fitting the model for the second step, the agreement evaluation is carried out using Concordance correlation coefficient and Total deviation index. Results: The illustrative example contained five methods of measurements and was with heteroscedastic measurements. First, the model fitting was carried out according to the two approaches and the resulting parameters were almost identical. After the model fitting, the agreement evaluation was performed. According to the values resulted from the agreement measurements, it is clear that all five methods agree sufficiently well with the reference method. Conclusions: The proposed model can be used to model the method comparison data with heteroscedastic measurements with multiple methods of measurements as well as balanced and unbalanced data designs. Under the proposed model, the agreement evaluation methodology for comparing multiple methods is also developed taking heteroscedasticity into account.


Introduction
In the field of clinical medicine, measurements of the human body take a major part of diagnostic, prognostic and therapeutic evaluations. Due to the rapid advancement technology, new methods and instruments are introduced into this field. These introduced instruments or methods might be more advance, cheaper and easier to use than the old standard instruments. Before these newly introduced methods or instruments put into use, the accuracy and precision of the measurements need to be verified. If these instruments agree sufficiently well with the already existing methods, they can be used interchangeably.
To get an understanding of these measurements, they must be compared against with a well-established technique. Then we need to assess the degree of agreement between these methods [1][2][3][4][5].
Experts in this field have already introduced many techniques to test the degree of agreement between two methods. But most of these techniques can be applied under certain assumptions. In order to assess the agreement between the assays, the first step is to model the data. The Linear mixed effects model is most commonly used in modeling the method comparison data [6][7][8][9][10][11][12][13][14][15]. Because of the flexibility in modeling of within subject dependence, linear mixed models are popular. In this model, normality is assumed for the error terms, but in practice sometimes the normality assumption is violated. In such cases, linear mixed models cannot be used to model the data [16].

Journal of Medical Statistics and Informatics
To overcome this problem [17] proposed a robust mixed model called "General skew-t mixed model (GSTMM)" that assumes a multivariate skew-t distribution for random effects and an independent multivariate t-distribution for the errors. But this general skew-t mixed model (GSTMM) and most of the other existing models available for method comparison studies are based on the assumption that the variability of the continuous measurement remains a constant throughout the range of the measurement. Though, this is not the case in some practical situations and the variability of the measurements might change with the magnitude, i.e., the 'Heteroscedasticity' of error terms [18].
A novel model to method comparison data with heteroscedastic error variances is proposed in [15], to evaluate the agreement between two methods of measurements measuring continuous data. However, this model cannot accommodate the comparison of multiple methods of heteroscedastic measurements. Therefore, the main objective of this study is to propose a heteroscedastic mixed effects model to analyze heteroscedastic method comparison data with more than two methods of measurements and to adapt the agreement evaluation methodology for multiple methods, taking heteroscedasticity into account.
The lung tumor size measurements data from [19] which motivated this work and are analyzed later in this article, provide a specific example of this phenomenon. The information was gathered from August 2000 to May 2001. In the dataset, out of 33 patients there are 40 lung tumors. These 40 lung tumors belong to 16 men and 17 women from the ages 43 to 78 years. All the lung tumors are larger than 1.5 cm in maximal diameter. Computed Tomography (CT) images of these 40 lung tumors are distributed among five radiologists. All five radiologists are with Thoracic Fellowship training and have more than four years of post-training experience. The five radiologists measured the lung tumors on the CT images by using a ruler or with calipers. Each of these measurements is performed independently and also each of these images was measured twice by each inspector. Here we will consider the measurements taken by the five radiologists as different methods of measurements. Therefore, the dataset contains 40 lung tumors (subjects), 5 readers (methods) and 2 replications for each measurement. In total 40*5*2=400 records.
The rest of this article is organized as follows. Section 2 presents the proposed methodology to deal with heteroscedastic method comparison data measured by multiple methods or raters. Section 3 discusses agreement evaluation under the proposed model using two techniques of model fitting and the last section discusses the results and conclusions of the study. The model fitting and the analysis was carried out by using the R statistical software.

Methodology
In this study, the methodology was divided into two main parts. The first step is to model the data set using an appropriate model and the second step is to measure the agreement among the methods of measurements. The most popular statistical modeling technique in method comparison data is the "Liner mixed effects model". This model is an extension of linear regression models [12] and contains both fixed effects and random effects. In practice the mixed effect models are important when there are repeated measurements. The standard mixed model in the matrix form can be represented as follows.
where, Y i is the vector of observed responses on the i th subject, X i , Z i are design matrices for fixed and random effects, β is the vector of fixed effects, b i is the vector of random effects, e i is the error vector and e is the vector of all unknown parameters in the model.
The main assumptions of this model are the random effects, random errors have normal distributions and they are mutually independent and the model has a constant error variance which depends only on the method.
where Ψ is a lxl positive definite matrix with diagonals Ψ 1 2 ,......., Ψ 1 2 , σ j 2 is the error variances of method j. After fitting the dataset with the mixed effects model, the residuals were analyzed to identify whether there are any model assumption violations.

Heteroscedastic Mixed Effects Model for multiple methods
This model is used when the assumptions (i.e., the random effects and the random errors are normally distributed, independent and has a constant variance) of the standard mixed effect model (homoscedastic model) are violated where error variance changes with the magnitude of the measurements. The heteroscedastic model is as follows.
Let Y ijk is the k th replicate measurement by the j th method on the i th subject, where k=1,…,n ij , j=1,…, and i=1,…, m.
Here n ij is the number of measurement from the j th method on the i th subject, β j is the fixed mean of the j th method, b ij is the random effect on the i th subject on the j th method, e ijk is the within subject random error, ν is the variance covariate and the ν i is values for the i th subject, z(ρ ij ) is the Fisher's z-trans- 2  2  2  2  2  1  1  2  2 , , , , , , , , , doi: 10.7243/2053-7662-5-1 formation of the correlation coefficient between methods i and lastly δ is the heteroscedastic parameter, when δ=0 this assumes homoscedasticity, μ ij =β j +b ij denotes the conditional mean of E (Y ijk |b i ) and ν i =h (μ i ) denotes that the ν i is a func- Variance covariate is a function of magnitude of measurement μ i that will be used to model the error variances. A model for conditional error variance is as follows, where g j is the variance function and δ j is the heteroscedasticity parameter of the method j. According to [12] there exists some common variance functions such as, power model: The heteroscedastic model in the matrix form is as follows.
Due to the scarcity of closed form for the likelihood functions, the exact modeling approaches will be troublesome. Therefore, the model fitting is carried out by two model approximations. First using the mean vector of the reference method as the covariate and the second is using the Best Linear Unbiased Predictor (BLUP) as the covariate. In this case the likelihood functions will be possible in a closed form. Score test and likelihood ratio tests have been carried out to assess the validity of these models. After confirming the models validity, the agreement evaluation is done by using Concordance correlation coefficient (CCC) and Total deviation index (TDI).

Model fitting
The model fitting for this heteroscedastic mixed effects models is carried out by selecting an appropriate value for the variance covariate. In this study, there are two main options for the variance covariate, namely mean of the reference method and the BLUP as the variance covariate. The power model function g(ν,δ)=| ν | δ was selected to fit the model [12].

Using observable mean measurement as variance covariate
As the first approach, we select the mean vector of the reference method as the variance covariate. In this case μ i *, an observable quantity is fixed and can fit the model by maximum likelihood method. Here the μ i * is expected to be close to μ i and μ i =Y i .

Using BLUP as the covariate
As the second approach, we use the best linear unbiased predictors (BLUP) as the variance covariate and fit the data using the heteroscedastic model. Random effects and error terms are independent in this heteroscedastic model and according to the [12] the BLUP can be written as, In order to calculate the μ i,blup any statistical software can be used. ν i * depends on the unknown parameter θ. The method of calculating the μ i,blup is an iteratively reweighted method. This method has two steps and it is continued until it converges [12]. More details of this can be found on [15]. When considering the two approaches, the method of using the true mean as the covariate is much simpler than the method of using BLUP as the covariate, because of the complexity of calculating the BLUP than calculating the mean. However, method of using BLUP is more accurate than the other [14].

Agreement evaluation under the proposed Methodology
In health science researches, agreement evaluation is a topic which has considerable interest. This is assessing the agreement between two or more methods measuring the same response [20,21]. In this Section we discuss two measurements that are used to assess the agreement in this study [22,23].

Concordance correlation coefficient (CCC)
Concordance correlation coefficient (CCC) is one of the most common measurements used in order to assess the agreement between methods of measurements. This was introduced by [22]. CCC value ranges between -1 to 1. The higher the values it gives a better agreement. Concordance correlation coefficient under the proposed heteroscedastic mixed model is follows for the lung tumor size measurements, This represents the concordance correlation coefficient between the reader1 and the reader j. For greater accuracy of the measurement CCC was first calculated with Fisher's ztransformation and then converted into the CCC [24].

Total Deviation Index (TDI)
Total deviation index (TDI) is another common measurement of evaluating agreement between two methods of measurements. TDI is the π 0 th percentile of absolute value of the differences between the methods (μ d ), for a given large probability π 0 . TDI always takes a positive value and the smaller value for TDI indicates good agreement. Under the proposed heteroscedastic model, Total deviation index (TDI) between the reader1 and the reader j is defined as follows,

Results
The initial model fitting was carried out by using the standard mixed effects model. To check the normality of the residuals for each method, the quantile-quintile plot and the Shapiro-Wilk normality test was used. Table 1 represents the results of the Shapiro-Wilk normality test, which was applied to the residuals of each reader of the mixed effect model. According to the results, it is obvious that the residuals do not follow a normal distribution as all p-values are small. Therefore, the main assumption for the standard mixed effects model (i.e., the residuals have a constant variance) is violated. Moreover, Figure 1 shows separate quantile-quantile plot for each reader when standard mixed effects is fit to the data. The circles cross the line three times indicates that the hump is not the right shape for these data to be normal. These data are therefore not exactly normal. Hence, we model the error variability using two approximations for μ i , namely mean of the reference method and the BLUP of μ i . We consider reader1 as the reference method. As the first approach, let's take the mean vector of the reader1 as the variance covariate and fit the data into a heteroscedastic model.
In the dataset there are 40 subjects and each subject has 2 replicates. Since there are 5 readers in the dataset, we need to compare 5 different methods of measurements. Therefore, as for the model stated in the methodology, i=1,2,…,40; k=1,2; j=1,2,3,4,5.
The Table 2 represents the estimates and standard errors of the fitted heteroscedastic model parameters, which was fitted by using mean of the reference method and BLUP as the variance covariate. Considering the results from the Table 2, both approaches take approximately equal values when compared with the other methods. Estimator of the reader3 has the largest men value, and the smallest estimator is from reader2. The Likelihood ratio test has been carried out to identify the most appropriate model among the standard mixed effects model and the proposed heteroscedastic model. The test rejected the null hypothesis, which implies that the reduced model is more appropriate, confirming the heteroscedastic model is the best among these two models. Furthermore, the score test is also performed on these models. The result was the same as before confirming that the heteroscedastic model is more suitable for the given dataset. Since the fitted heteroscedastic model is appropriate the next step is to assess the agreement between the methods of measurements.

Agreement evaluation under the Heteroscedastic model
The Figure 2 represents the estimates and their confidence intervals of CCC and TDI using the mean method. The solid agreement, while with the increase of the magnitude of the lung tumor diameter, the agreement decreases. Therefore, it is safe to conclude that the reader2 sufficiently agrees with the reference method (reader1). Likewise, the readings from the readers 3, 4 and 5 derived almost similar results as the reader2 by confirming that all the methods (readers) agree well with the reference method. i.e., reader1.
The Figure 3 represents the CCC and TDI values for the read-er1 and reader2, using the BLUP based method. The Figure 2 and the Figure 3 are almost identical. Therefore, reader2 is well agreed with the reader1.

Discussion
The proposed heteroscedastic mixed effects model was fitted by using two methods. One was by using true mean as the variance covariate and the other method is by using the Best Linear Unbiased Predictor (BLUP) as the variance covariate. When fitting the model by using both methods, it was confirmed that the heteroscedastic model is better than the standard mixed effect model. The results from the likelihood ratio test and the score test confirms that conclusion. When considering the means from the Table 2, we can observe that the means from both the approaches are identical if round up to 2 decimal places. This means that the mean method and the BLUP method generate almost identical estimates. CCC and TDI were used in evaluating the agreement between the reference method (reader1) and the other 4 methods (readers). Both mean based method and the BLUP based method produced almost identical (with slight differences) results for CCC and for TDI. Figures 2 and 3 both show that reader2 agrees well with reader1. Not only the reader2, but also reader3, reader4 and reader5 are also agreed well with the reader1. Moreover, Figures 2 and 3 show that the CCC and TDI values varies with the magnitude. If the data set was modeled using a homoscedastic model, constant values for CCC (0.8069) and TDI (1.6140) can be obtained. Therefore, if a dataset with heteroscedasticity is modeled with standard mixed effects model, the resulting outcome will be misleading and inaccurate. To overcome this issue, the data must be modeled with a model that accounts the heteroscedasticity.

Conclusions
We propose a method successfully in order to fit 5 methods of heteroscedastic clinical measurements and this proposed model can be easily extended to deal with any number of multiple methods which has replicated data. Our approach can accommodate balanced or unbalanced data designs and it works well with any scalar measure of agreement. Moreover, the two model fitting methods discussed give almost identical estimates to the model. The limitation of this study is that the proposed method can be applied only with replicated measurements.