Saturday, September 7, 2013

Kappa - It's Greek to Me

The Gist:  Many junior physicians use clinical decision instruments as an objective means of risk stratification or clinical decision making; however, these have subjective components.  Kappa, a measure of interrater agreement, is a commonly expressed statistic in medical literature, particularly in clinical decision aids.  Understanding the use, strengths, and weaknesses of kappa may help with application of decision aids and appraisal of literature.

The Case: A 13 year old boy presented to the Janus General ED after being struck in the head with a baseball bat.  He had a slight headache, no vomiting, normal mental status, and unremarkable physical exam except a hematoma over his left parietal region.
  • I presented the case as low-risk by PECARN with ~<0.05% chance of a clinically significant injury.  An attending inquired as to how I determined that the mechanism was "not severe."  Would my assessment change if Mark McGuire swung the bat that hit my patient?  Similarly, where was my threshold with the 18 month old that fell off a bed? Did the precise number of feet matter? The truth was, probably not, not because it wasn't listed in the objective criteria of the decision aid, but because after my assessment of the patient, I already estimated that the likelihood of a clinically significant injury was minimal. I wondered:  How did they come up with these variables (was there really a difference between falls from 3 ft and 4 ft)? How frequently would other people disagree with my seemingly "objective" determinations?
I found a paper by Nigrovic et al the next day that evaluated the agreement between nurses and physicians in the application of PECARN to mild blunt head injury pediatric patients.  This study demonstrates the differential level of agreement, or reliability, between elements of the PECARN predictors - with notable differences between subjective and objective components.*  For example, everyone agreed on vomiting, but anything containing the word "severe" was a little more nebulous.
  • History of vomiting - 97% agreement between nursing and physician assessment, with an outstanding kappa of 0.89 (95% CI 0.85-0.93). 
  • Severe injury mechanism - 76% agreed, kappa 0.24 (95% CI 0.13-0.35) in the age<2 cohort and kappa = 0.37 (95% CI 0.29-0.45) in the age 2-18 group.
Wait, what is this kappa (k) business?
  • It quantifies interrater reliability - a measure of the degree of agreement between observers that is greater than chance alone.
    • Sometimes, even in medicine, clinicians and trainees guess.  For example, when reading a radiograph and deciding on atelectasis versus infiltrate, a physician may hedge and choose one.  This may seem straightforward, but imagine a variable such as severity of headache.  Suppose one clinician has a terrific headache and rates headaches encountered that day as non- or less severe.  The cases when that clinician and another agree would therefore be based on chance.  
  • Calculation: (Observed Agreement - Agreement Expected by Chance)/(1-Agreement Expected by Chance) - Ok, so, the actual calculation is more complicated and is explained here.
  • Assesses precision/reliability
    • Using the aforementioned study, one can see that nurses and physicians reliably detected the presence of vomiting but less reliably agreed on the presence of a severe mechanism of injury or severe headache.
What does the value mean?
  • -1.0 = perfect disagreement, +1.0 = perfect agreement

What are the limitations of kappa?
  • The expected agreement is affected by abnormal prevalence.  In a skewed sample, the observed agreement may be markedly different than the relative agreement (1).  This is referred to as the kappa paradox, and there are various ways to compensate for this issue.
    • Rare findings - agreement between observers may not be as reliable and will be reflected by a lower kappa.  Looking at the Nigrovic et al paper, the kappa for palpable skull fracture is abysmal at 0.00, yet the proportion of physicians and nurses in agreement was 98%.  This exists as a product of the rarity of the finding, as 1/434 and 7/434 physician and nursing assessments were positive, respectively.  Similarly, signs of basilar skull fracture was fair at 0.37 with an enormous confidence interval (95% CI 0.07-0.67).  
  • Generalizability. Diversity of skill/experience may affect kappa.
    • Are the raters emergency physicians? medical students? specialized radiologists?
    • This was ostensibly what Nigrovic et al sought to determine - do clinicians at various levels of expertise agree?  The answer - it depends.  
What now? As a junior trainee, the ways I evaluate patients and objective data is different than that of a senior clinician.   Thus, I'm armed with this knowledge to acknowledge the limitations of the clinical decision instruments I use, understand why and how the variables are not hard and fast "rules," and use both to better patient care.
*Note: The developers of PECARN (original study) only selected criteria with a minimum kappa of 0.5 (with a lower bound of the confidence interval of 0.40).

1.  de Vet HC, Mokkink LB, Terwee CB, Hoekstra OS, Knol DL.  Clinicians are right not to like Cohen’s κ 2013;346:f2125
2.  Nigrovic LE, Schonfeld D, Dayan PS, Fitz BM, Mitchell SR, Kuppermann N. Nurse and Physician Agreement in the Assessment of Minor Blunt Head Trauma. Pediatrics. 2013. Available at: Accessed August 29, 2013.

No comments:

Post a Comment