Any agreement statistic that was created for nominal data has difficulty with missing data; therefore using Fleiss is not really a viable option. Based on an article in 1973 by Fleiss and Cohen (reference below) that demonstrated the equivalence of weighted kappa to ICC, I'd suggest the latter. Your measure(s) suggests that the data are at least ordinal if not interval in nature, and that justifies the use of ICC rather than a weighted kappa. Try that. You'll need to decide which model to use based on the following list.

Model 1: Raters are a random sample from a specified population of raters, and each rater does not rate all subjects/objects. Therefore, each subject/object is rated by a potentially different set of raters.

Model 2: Raters are a random sample from a specified population of raters, and each rater rates each subject/object.

Model 3: Raters constitute the entire population of raters, and each

rates each subject/object.

Fleiss, J.L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613-619.

Hello,

I performed a study with 32 raters who rated severity (0-4 normal, mild,

moderate severity) of several visual perceptual parameters for 4 different

videos. All raters rated all parameters for all videos after being given

clinical information about the patient. Raters received incorrect clinical

information for 3 of 4 videos and correct info for 1 video.

I am trying to answer the following question, "Is there a statistically

significant difference in rater reliability when video stimuli are paired

with matched versus mismatched clinical vignettes? My plan was to make each

combination of video and clinical vignette a unique variable or a different

“treatment”. A permutation test procedure will be used to assess the

statistical significance of the difference between these two, Km-Kmm. This

analysis proceeds by determining a null hypothesis (agreement is the same

for matched and mismatched scenarios) distribution for this difference by

considering all possible reassignments (permutations) of the labels

“matched” and “mismatched” to the observed data (Mielke, 2007).

I do not know how to test this hypothesis in SPSS (version 24) on my Mac and

am also getting an error message "There are too few complete cases" with the

following syntax when I want to examine inter rater reliability of the

entire group of raters.

STATS FLEISS KAPPA VARIABLES=Rater11 Rater12 Rater13 Rater14 Rater15 Rater16

Rater17 Rater18

Rater21 Rater22 Rater23 Rater24 Rater25 Rater26 Rater27 Rater28 Rater31

Rater32 Rater33 Rater34

Rater35 Rater36 Rater37 Rater38 Rater41 Rater42 Rater43 Rater44 Rater45

Rater46 Rater47 Rater48

/OPTIONS CILEVEL=95.

Any assistance would be greatly appreciated. Thanks again!

