Kappa


Kappa provides a measure of the degree to which two judges concur in their respective sortings of N items into k mutually exclusive categories. A 'judge' in this context can be an individual human being, a set of individuals who sort the N items collectively, or some non-human agency, such as a computer program or diagnostic test, that performs a sorting on the basis of specified criteria.


Simple Unweighted KappaT
The original and simplest version of kappa is the unweighted kappa coefficient introduced by J. Cohen in 1960. To illustrate, suppose that our judges are two clinical tests, A and B, independently employed to sort each of N=100 subjects into one or the other of k=3 diagnostic categories. The table on the left shows a cross-tabulation of the sortings actually observed, while the one on the right shows the cell frequencies that would have been expected by mere chance, given the observed marginal totals. In both tables the cells representing concordance of the two tests are highlighted in bold red.

(i) Observed
(ii) Chance Expected
B
Total
B
Total
1
2
3
1
2
3
A
1
44
5
1
50
A
1
30
15
5
50
2
7
20
3
30
2
18
9
3
30
3
9
5
6
20
3
12
6
2
20
 Total 
60
30
10
100
 Total 
60
30
10
100
Observed Concordant Items: 
count = 70  
proportion = .70  
Expected Concordant Items: 
count = 41  
proportion = .41  

In this example the observed number of concordant items is 70, the chance expected number is 41, and the excess of observed over expected is 70−41=29. Similarly, the chance expected number of non- concordant items is 100−41=59. Cohen's kappa is simply the ratio of the former to the latter: 29/59=.4915. Essentially, what it says is: Of all the items that we would have expected to be non-concordant if nothing more than chance coincidence were operating in the situation, 49.15% of them are in fact concordant.

In the uncommon case where Judges A and B start out by sorting the same number of items into category 1, the same number into category 2, and so forth, the upper limit of kappa is 1.0. When the judges start out with different numbers of items in any one or several of the categories, the maximum possible value of kappa will be less than 1.0. In the present example the maximum possible degree of concordance, given the observed marginal totals,
(iii) Maximum Possible
B
Total
1
2
3
A
1
50
0
0
50
2
0
30
0
30
3
10
0
10
20
 Total 
60
30
10
100
would be produced by the sorting shown in the adjacent table, which would yield kappa=.8305. The ratio of our observed kappa to this maximum possible value is .4915/.8305=.5918: which is to say that the observed value is 59.18% as large as it possibly could be, under the circumstances.



Weighted KappaT
When the categories are merely nominal, Cohen's simple unweighted coefficient is the only form of kappa that can meaningfully be used. If the categories are ordinal—if it is the case that category 2 represents more of something than category 1, that category 3 represents more of that same something than category 2, and so on—then it is potentially meaningful to take into account not only the absolute concordances (the ones shown in the bold-face red font in the example), but also the relative concordances (as now shown in the black font). In taking these relative concordances into account, each cell in a row of the matrix is weighted in accordance with how near it is to the cell in that row that includes the absolutely concordant items.

(i) Observed
(ii) Chance Expected
B
Total
B
Total
1
2
3
1
2
3
A
1
44
5
1
50
A
1
30
15
5
50
2
7
20
3
30
2
18
9
3
30
3
9
5
6
20
3
12
6
2
20
 Total 
60
30
10
100
 Total 
60
30
10
100


Observed
Frequencies
B
1
2
3
A
 1 
44
5
1
To illustrate the weighting process, consider the first row in the above table of observed frequencies. Suppose we had good reason to assume that the distance between categories 1 and 2 is about the same as the distance between categories 2 and 3. In this case, cell A1B2 would lie at a distance of one (relative) unit from cell A1B1 (A1B1 is the cell in row 1 that marks absolute concordance) and cell A1B3 would fall at a distance of two (relative) units from cell A1B1. The same principle would apply to the other rows in the matrix. ThusT
Distances
B
1
2
3
A
 1 
0
1
2
2
1
0
1
3
2
1
0

With k ordinal categories and equal imputed distances between successive categories, the maximum possible distance between any two categories is k−1, which in the present example is equal to 2. The weights derived from these imputed distances can be either linear or quadratic:

If they are linear, then the weight for any particular cell isT
weight = 1− 
|distance|
maximum possible distance


And if they are quadratic, then the weight for a particular cell isT
weight = 1− 
(distance)2
(maximum possible distance)2


With the present example, this would yield the following sets of weights for the cells:T
Linear
Quadratic
B
B
1
2
3
1
2
3
A
1
1
.5
0
A
1
1
.75
0
2
.5
1
.5
2
.75
1
.75
3
0
.5
1
3
0
.75
1


To give you an idea of how a weighted kappa coefficient is calculated, I show again the data of our example, only now each of the frequency values has been divided by N (in this case, N=100) to convert it into a proportion.

Proportions
(i) Observed
(ii) Chance Expected
B
Total
B
Total
1
2
3
1
2
3
A
1
.44
.05
.01
.50
A
1
.30
.15
.05
.50
2
.07
.20
.03
.30
2
.18
.09
.03
.30
3
.09
.05
.06
20
3
.12
.06
.02
.20
 Total 
.60
.30
.10
1.00
 Total 
.60
.30
.10
1.00

For each of the gray cells in the "Observed" table, multiply the proportion by the linear weight corresponding to that cell, and sum the results across all nine of the cells. This sum will be

     Pobserved = .8

Performing the same operation for the nine gray cells in the "Chance Expected" table will yield

     Pexpected = .62

The kappa coefficient with linear weighting is then simply the ratio

kappaLW
Pobserved − Pexpected
1 − Pexpected
 

.8 − .62
1 − .62
 = .4737


Performing this same procedure with the quadratic weights would yield kappaQW=.4545.

Weighted kappa coefficients are less accessible to intuitive understanding than is the simple unweighted coefficient, and they are accordingly more difficult to interpret. References are listed below for those who might wish to pursue the matter further.


Proportions of AgreementT
Independently of kappa, it is also possible to measure the proportion of agreement between the two judges within each of the k categories separately. In the example (Table i), Judges A and B agreed on a total of 44 items for category 1. For Judge A there were 6 additional items in category 1 with which B did not agree, while for Judge B there were 16 additional items in category 1 with which A did not agree. The proportion of agreement for category 1 is therefore 44/(44+6+16)=.6667. Similarly, the proportion of category 1 agreement to be expected by mere chance (Table ii) is 30/(30+20+30)=.375; and the maximum possible proportion of agreement (Table iii), given the observed marginal totals, is 50/(50+0+10)=.8333.

(i) Observed
(ii) Chance Expected
B
Total
B
Total
1
2
3
1
2
3
A
1
44
5
1
50
A
1
30
15
5
50
2
7
20
3
30
2
18
9
3
30
3
9
5
6
20
3
12
6
2
20
 Total 
60
30
10
100
 Total 
60
30
10
100
Observed Concordant Items: 
count = 70  
proportion = .70  
Expected Concordant Items: 
count = 41  
proportion = .41  
(iii) Maximum Possible
B
Total
1
2
3
A
1
50
0
0
50
2
0
30
0
30
3
10
0
10
20
 Total 
60
30
10
100


References
Agresti, A. (1996), An Introduction to Categorical Data Analysis, New York: Wiley

Cicchetti, D.V. and Allison, T. (1971), "A New Procedure for Assessing Reliability of Scoring EEG Sleep Recordings," American Journal of EEG Technology, 11, 101-109

Cohen, J. (1960), "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement, 20, 37-46.

Cohen, J. (1968), "Weighted Kappa: Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit," Psychological Bulletin, 70, 213-220.

Fleiss, J.L., Cohen, J., and Everitt, B.S. (1969), "Large-Sample Standard Errors of Kappa and Weighted Kappa," Psychological Bulletin, 72, 323-327.

Fleiss, J.L. and Cohen, J. (1973), "The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability," Educational and Psychological Measurement, 33, 613-619.

Fleiss, J.L. (1981), Statistical Methods for Rates and Proportions, 2nd Edition, New York: Wiley

Home Click this link only if you did not arrive here via the VassarStats main page.



©Richard Lowry 2001-
All rights reserved.