Status of Weighted Agreement Statistics Between Two Raters in Ordinal Data Affected by Sample size and Number of Categories


Abstract views: 107 / PDF downloads: 76

Authors

DOI:

https://doi.org/10.5281/zenodo.8239340

Keywords:

Brennan-Prediger, Gwet’s AC2, Krippendorff’s Alpha, Linear weighted, Quadratic weighted, Spearman correlation coefficient

Abstract

The aim of this study is to introduce weighted inter-rater agreement statistics used in ordinal scales, compare weighted agreement statistics along with the Spearman correlation coefficient and reveal their status of being affected by the sample size and number of categories. For this purpose, data for different sample sizes and number of categories were produced for the cases when there is no relationship or when there is low, medium, or high relationship between the two raters, and the aforementioned weighted agreement statistics were calculated. It can be said that Cohen’s kappa, Scott’s π, and Krippendorff’s alpha coefficients provide similar results with the correlation coefficients, and they get values very close to the correlation coefficient value in the B-P agreement statistics. However, it can be said that Gwet’s AC2 statistic, especially for a category number of 3, differs from the correlation coefficient value in cases where there is no/low correlation between the raters, and a moderate level of agreement can be mentioned between the raters, albeit by chance. While investigating the agreement between two raters in ordinal scales, it is recommended to be careful in cases when only the number of categories is three and to use Gwet’s AC2 agreement statistics. In other cases, it can be said that the concepts of agreement and relationship can be used interchangeably with peace of mind.

References

Barnhart H.X., Haber M.J. & Lin L.I. (2007). An overview on assessing agreement with continuous measurements. Journal of Biopharmaceutical Statistics 17(4):529-569. DOI: https:// doi.org/10.1080/10543400701376480.

Bland J.M. & Altman D.G. (2010). Statistical methods for assessing agreement between two methods of clinical measurement. International Journal of Nursing Studies 47(8):931-936.DOI: https://doi.org/10.1016/j.ijnurstu.2009.10.001.

Brennan R.L. & Prediger D.J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement 41(3):687-699. DOI: https://doi.org/10.1177/001316448104100307.

Cicchetti D. & Allison T. (1971). A new procedure for assessing reliability of scoring EEG sleep recordings. The American Journal of EEG Technology 11(3):101-109. DOI: https://doi.org/10.1080/00029238.1971.11080840.

Cohen J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70(4):213-220. DOI: https://doi.org/10.1037/h0026256.

Fleiss J.L. & Cohen J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33(3):613–619. DOI: https://doi.org/10.1177/001316447303300309

Gwet K.L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. 4 th edition, pp. 27-127, 185-302. Advanced Analytic, LLC, Gaithersburg, USA.

Gwet K.L. (2015.) Testing the difference of correlated agreement coefficients for statistical significance. Educational and Psychological Measurement 76(4):609-637. DOI: 10.1177/0013164415596420.

Haber M. & Barnhart H.X. (2008). A general approach to evaluating agreement between two observers or methods of measurement. Statistical Methods in Medical Research 17(2):151-169. DOI: 10.1177/0962280206075527.

Haber M., Barnhart H.X., Song J. & Gruden J (2005). Observer variability: A new approach in evaluating interobserver agreement. Journal of Data Science 3(1):69-83.

Kanık E.A., Erdogan S. & Temel G.O. (2012). Agreement statistics impacts of prevalence between the two clinicians in binary diagnostic tests. Annals of Medical Research 19(3):153-158. DOI: 10.7247/jiumf.19.3.5.

Kanık E.A., Orekici Temel G. & Ersöz Kaya I. (2010). Effect of sample size, the number of raters and the category levels of diagnostic test on Krippendorff alpha and the Fleiss kappa statistics for calculating inter rater agreement: A simulation study. Türkiye Klinikleri Journal of Biostatistics 2(2):74-81. DOI:10.7247/jtomc.19.4.4.

Krippendorff K. (2004). Measuring the reliability of qualitative text analysis data. Quality and Quantity 38(6):787-800. DOI: http://dx.doi.org/10.1007/s11135-004-8107-7.

Lin L. (2008). Overview of agreement statistics for medical devices. Journal of Biopharmaceutical Statistics 18(1):126-144. DOI: 10.1080/10543400701668290.

Lin L., Hedayat A.S. & Wu W. (2007). A Unified approach for assessing agreement for continuous and categorical data. Journal of Biopharmaceutical Statistics 17(4):629-652.

Lin L., Hedayet A.S. & Wu W. (2012). Statistical tools for measuring agreement. 1st edition, pp. 1-109. Springer, New York.

Liu J., Tang W., Chen G., Lu Y., Feng C. & Tu X.M. (2016). Correlation and agreement: overview and clarification of competing concepts and measures. Shanghai Archives of Psychiatry 28(2):115-120. DOI: 10.11919/j.issn.1002-0829.216045.

Moradzadeh N., Ganjali M. & Baghfalaki T. (2017). Weighted Kappa as a function of unweighted kappas. Communications in Statistics-Simulation and Computation 46(5):3769-3780. DOI:10.1080/03610918.2015.1105975

Nelson K.P. & Edwards D. (2018). A measure of association for ordered categorical data in population-based studies. Statistical Methods in Medical Research 27(3):812-831. DOI: 10.1177/0962280216643347.

Raadt A., Warrens M., Bosker R. & Kiers H.A.L. (2021). A comparison of reliability coefficient for ordinal rating scales. Journal of Classification:1-25. DOI: https://doi.org/10.1007/s00357-021-09386-5.

Stralen K.J., Dekker F.W., Zoccali C. & Jager K.J. (2012). Measuring agreement, more complicated than it seems. Nephron Clinical Practice 120(3):c162-c167. DOI: 10.1159/000337798.

Tran D., Dolgun A. & Demirhan H. (2020.) Weighted inter-rater agreement measures for ordinal outcomes. Communications in Statistics-Simulation and Computation 49(4):989-1003. DOI: https://doi.org/10.1080/03610918.2018.1490428.

Tran Q.D., Dolgun A. & Demirhan H. (2021). The impact of gray zones on the accuracy of agreement measures for ordinal tables. BMC Medical Research Methodology 21(1):1-9. DOI:10.1186/s12874-021-01248-3.

Vanbella S. (2016). A new interpretation of the weighted kappa coefficients. Psychometrika 81(2):399-410. DOI: 10.1007/s11336-014-9439-4.

Vanbelle S. & Albert A. (2009). A note on the linearly weighted kappa coefficient for ordinal scales. Statistical Methodology 6(2):157-163. DOI: https://doi.org/10.1016/j.stamet.2008.06.001.

Warrens M.J. (2012). Some paradoxical results for the quadratically weighted kappa. Psychometrika 77(2):315-323. DOI: 10.1007/S11336-012-9258-4.

Downloads

Published

2023-07-25

How to Cite

Erdoğan, S., & Sucu, D. H. (2023). Status of Weighted Agreement Statistics Between Two Raters in Ordinal Data Affected by Sample size and Number of Categories. Euroasia Journal of Mathematics, Engineering, Natural & Medical Sciences, 10(28), 219–231. https://doi.org/10.5281/zenodo.8239340

Issue

Section

Articles