Modification in inter-rater agreement statistics-a new approach

Assessing agreement between the examiners, measurements and instruments are always of interest to health-care providers as the treatment of patients is highly dependent on the medical reports. Till now several agreement statistics have been developed and all of them have certain limitations. In 2002 Kilm Gwet introduced a more robust and unbiased agreement statistics named “Gwet’s AC1 statistics”. It has been shown by various researchers that AC1 statistics has the best statistical properties amongst all the other agreement statistics. Though it has been reported to be the better estimate still several inconsistencies existed in this agreement statistics. In this paper author aimed to develop a new formula that can overcome all the inconsistencies and dependencies of inter-rater agreement statistics.


Introduction
In health sciences, health-care providers are always concerned about the accuracy of the measurements and the multiple outcomes. Precise results are essential in medical sciences as the life of patients depends on these results. Assessing the strength of agreement between the instruments, examiners, or a combination of the two is important to attain similar and unerring results [1]. Scoring techniques are widely used in medical science to determine certain deformities. Developing accurate scoring systems, therefore, requires substantial agreement between the raters. Several methods have been developed to assess the inter-rater agreement such as Yule's Y [2], Bennett, Alpert & Goldstein's S (1954) [3], Scott's π (1955) [4], Cohen's Kappa (1960) [5], Flesis Kappa (1971) [6], Krippendorff's Alpha (1980), Bandiwala's B statistics (1985) [7], Van Eerdewegh's V [8]. Amongst all these, Cohen's Kappa is the most popular and commonly used statistics even though; numerous inconsistencies in Kappa statistics have been pointed out. The interpretation of Kappa statistics becomes cumbersome as it is being affected by marginal probabilities, trait prevalence and skewness. Also, Kappa statistics produce awful results if the sample size is small.
Bryt T in 1993 highlighted two main dilemmas of Kappa statistics that isthe effect ofthe presence of bias between the raters and distribution of the data across the categories (prevalence). Brty T proposed new indices namely "Bias index", "prevalence index" and"prevalence adjusted and bias-adjusted Kappa (PABAK)"and suggested to use them instead of Kappa statistics. The author further suggested reporting bias and prevalence indices along with Kappa statistics as reporting Kappa statistics alone can be misleading [9]. More recently, in 2002 Kilm Gwet introduced a new formula of chance agreement and called the agreement statistics as "AC1 statistics" and suggested to use it [10]. Many researchers have tested its validity and showed it to be robust; less affected by trait prevalence, not sensitive to marginal homogeneity and skewness and provides better results than Kappa statistics [11,12]. However, the interpretation and range of both PABAK and AC1statistics are same as of Kappa statistics.V Shankar (2014) have shown that B-statistics is a better agreement statistics amongst Kappa, delta, Aickin's alpha and AC1 statistics [13].
Kilm Gwet [10] proposed a new formula for the chance agreement that depends solely upon the values of category-1 for 2x2 cross-tables. It does not take into account the category-2 as well as the disagreement between the two raters.
In this paper, the author has shown that AC1 and other inter-rater agreement statistics are significantly affected by the change in individual cell probabilities, symmetry, marginal homogeneity and trait prevalence. Also, AC1 statistics provide no agreement in case of 50% observed agreement between the raters. To overcome these issues, the author has introduced a new, simple and easy to use a formula of chance agreement that takes into account both the categories as well adjusts for CrossMark ← Click for updates doi: 10.7243/2053-7662-8-2 discordant values. The chance agreement and agreement statistics are named as Minimum expected chance agreement" and "SI statistics" respectively. Moreover, it has been shown that SI-statistics is not affected by trait prevalence, robust to change in cell values for the fixed observed agreement if a matrix is symmetric and marginal homogenous.

SI-Statistics
Let e v minimum expected chance agreement calculated as Where p 0 is the observed agreement, and SI-statistics ranges from 0 to 1.

Sensitivity to change in individual cell probabilities
To assess the sensitivity of SI-statistics and other inter-rater agreement statistics to change in cell values or cell probabilities, we simulated the data on sample size 24 where the observed agreement was fixed to 0.58 and the matrix was kept symmetric and marginally homogenous. Table 1 shows the simulated data and Table 2 shows the comparison of SI-statistics with Kappa, AC1 and other agreement statistics. Results showed that SI-statistics is unaffected by a change in individual cell probabilities for fixed observed agreement. However, all the agreement statistics are inconsistent and varies significantly except PABAK and S-statistics. Moreover, SI-statistics provides a reasonable estimate of actual (observed) agreement between the two raters.

Sensitivity to trait prevalence
Trait prevalence is defined as the likelihood that a randomly selected subject proved to be positive [10]. To assess the effect of trait prevalence, the author has fixed the sensitivity and specificity of each rater and allowed trait prevalence to vary from 0 to 1as suggested by Gwet K [10]. Data simulation was done using the following equations [10].
Where, P r is trait prevalence, P A+ and P B+ are the probability of rater A and rater B to categorize subject as positive respectively, α A and α B are the sensitivity of rater A and rater B for the correctly classifying a subject as positive respectively and β A and β B are the specificities associated with rater A and rater B for the correctly classifying a subject as negative respectively.
Expected marginal totals for classifying a subject as positive for both rater A and rater B can be calculated as Individual cell probabilities can be computed as follows In Tables 3a and 3b, raters had common sensitivity and specificity of 0.9 and fixed observed agreement. Results showed that SI-statistics is unaffected by trait prevalence Whereas, other agreement statistics provided varying results except for PABAK and S-statistics. Table I. Interpretation of agreement statistics [14].

Rater B Rater A Marginal totals Positive (+) Negative (-)
A negative coefficient reflects weaker agreement than expected or discrepancy. In general, low negative values (0 to -0.10)can be interpreted as "no agreement". A large negative coefficient represents considerable discrepancy among raters [15]. In Tables 4a and 4b, raters had common sensitivity of 0.8 and specificity of 0.9. Results showed that SI-statistics provided efficient results than other statistics. Additionally, it was observed that if an observed agreement had linear decreasing trend, so SI-statistics also showed liner decreasing trend. However, Kappa, AC1 and other statistics showed random variation except for PABAK and S-statistics.
In Tables 5a and 5b, raters had different sFensitivity and specificity. Rater A had a sensitivity of 0.8 and specificity of 0.9, whereas Rater B had a sensitivity of 0.85 and specificity of 0.7. SI-Statistics was found to be the robust estimator amongst other statistics.

Sensitivity to equal cell distribution or 50% observed agreement
From Tables 5a and 5b it follows that when cells distribution is equal and observed agreement is fixed to 0.5 then AC1, PABAK, S-statistics, Scott's π reports no agreement between the two raters. For equal observed agreement and disagreement (condition 1), all the agreement statistics estimated no agreement, whereas SI-statistics reported a fair agreement. Similarly, for condition 2 to 7 where the observed agreement was 0.5 but observed disagreement varies, SI statistics reported the moderate agreement (0.5), Kappa estimated slight agreement (0.2), Van Eerdewegh's V showed moderate agreement (0.58), however, Yule's Y estimated perfect agreement between the two raters.

For nxn contingency tables
Similar experiments were run for 3x3 (Appendix A) and 4x4 (Appendix B) contingency tables and results showed that SI-statistics provided better results.         better estimates (Appendix C).

Concluding remarks
In literature, Gwet's AC1 statistic has been reported to be the most robust and less biased agreement statistics. In this article, it has been shown that AC1 statistic is sensitive to 1. Change in individual cell probabilities for fixed observed agreement 2. Trait prevalence 3. Equal cell distribution 4. Estimates zero agreement or no agreement between the raters if an observed agreement is 50% 5. SI statistic has been shown to be more stable and provide better results than other agreement statistics. 6. SI-Statistics can handle missing values and provides better results 7. Furthermore, SI-statistics ranges from 0 to 1 only. It does not have negative value inferring disagreement between the raters. So, SI-statistics only reports agreement.