Kaplan-Meier Survival Probability Estimates

Data Entry_Q

⇐ Number of subjects enrolled at beginning of study

Time
Period

Became
Unavailable
(Censored)

Died
(Failed)
(Succeeded)

Survival
Probability
Estimate

0.95 Confidence Interval

Lower Limit

Upper Limit

[If you have already read this introduction or wish to skip it, click here and enter the number of time periods at the prompt.]

Suppose that 100 subjects of a certain type were tracked over a period of time to determine how many survived for one year, two years, three years, and so forth. If all the subjects remained accessible throughout the entire length of the study, the estimation of year-by-year survival probabilities for subjects of this type in general would be an easy matter. The survival of 87 subjects at the end of the first year would give a one-year survival probability estimate of 87/100=0.87; the survival of 76 subjects at the end of the second year would yield a two-year estimate of 76/100=0.76; and so forth.

But in real-life longitudinal research it rarely works out this neatly. Typically there are subjects lost along the way for reasons unrelated to the focus of the study. To illustrate the complication in this sort of situation, consider the following hypothetical scenario. Of the 100 subjects who are "at risk" at the beginning of the study, 3 become unavailable during the first year and 5 are known to have died by the end of the first year. Another 3 become unavailable during the second year and another 10 are known to have died by the end of the second year. And so on for the other years shown. For the sake of numerical simplicity I am showing 3 subjects becoming unavailable in each of the five years. In real-life research the loss rate would of course not normally be so uniform as this.

Time Period	At Risk	Became Unavailable (Censored)	Died	Survived
Year 1	100	3	5	?
Year 2	?	3	10	?
Year 3	?	3	15	?
Year 4	?	3	20	?
Year 5	?	3	25	?

The question in a situation of this sort is: What shall we make of the subjects who become unavailable in a given time period? (Within the context of the Kaplan-Meier procedure, the subjects who become unavailable are spoken of as censored.) We in fact do not know whether these subjects survived or died. Yet, if we were simply to omit them from the study, we would be losing valuable information: namely, that the 3 subjects who became unavailable during Year 2 survived at least through Year 1; that the 3 who became unavailable during Year 3 survived at least through Year 2; and so on.

Kaplan and Meier, recognizing that any attempt to salvage this information would involve a certain amount of "fudging," proposed that subjects who become unavailable during a given time period be counted among those who survive through the end of that period, but then deleted from the number who are at risk for the next time period. "These conventions," they wrote,

may be paraphrased by saying that deaths recorded as of [time] t are treated as if they occurred slightly before t, and losses recorded as of [time] t are treated as occurring slightly after t. In this way the fudging is kept conceptual, systematic, and automatic. (Kaplan & Meier, 1958)

Time Period	At Risk	Became Unavailable (Censored)	Died	Survived
Year 1	100	3	5	95
Year 2	92	3	10	82
Year 3	79	3	15	64
Year 4	61	3	20	41
Year 5	38	3	25	13

The adjacent table shows how these conventions would work out for the present example. Of the 100 subjects who are at risk at the beginning of the study, 3 become unavailable during the first year and 5 die. The number surviving the first year is therefore 100-5=95 and the number at risk at the beginning of Year 2 is 100-3-5=92. Another 3 subjects become unavailable during the second year and another 10 die. So the number surviving Year 2 is 92-10=82 and the number at risk at the beginning of Year 3 is 92-3-10=79. And so on for the other years shown.

As illustrated in the next table, the Kaplan-Meier procedure then calculates the survival probability estimate for each of the t time periods, except the first, as a compound conditional probability.

Time Period	At Risk	Became Unavailable (Censored)	Died	Survived	Kaplan-Meier Survival Probability Estimate
Year 1	100	3	5	95	(95/100)=0.95
Year 2	92	3	10	82	(95/100)x(82/92)=0.8467
Year 3	79	3	15	64	(95/100)x(82/92)x(64/79)=0.70
Year 4	61	3	20	41	(95/100)x(82/92)x(64/79)x(41/61)=0.4611
Year 5	38	3	25	13	(95/100)x(82/92)x(64/79)x(41/61)x(13/38)=0.1577

The estimate for surviving through Year 1 is simply 95/100=0.95. And if one does survive through Year 1, the conditional probability of then surviving through Year 2 is 82/92=0.8913. The estimated probability of surviving through both Year 1 and Year 2 is therefore (95/100)x(82/92)=0.8467. Similarly, if one survives through the first two years, the conditional probability of then surviving through Year 3 is 64/79=0.8101. So the estimated probability of surviving through Year 1 and Year 2 and Year 3 is (95/100)x(82/92)x(64/79)=0.70. And similarly for the other time periods.

This cumbersome structure is shown only to illustrate the logic of the procedure. For practical computational purposes, the same results can be obtained more efficiently by using the Kaplan-Meier product-limit estimator_Q

where S(t_i) is the estimated survival probability for any particular one of the t time periods; n_i is the number of subjects at risk at the beginning of time period t_i; and d_i is the number of subjects who die during time period t_i.

The Kaplan-Meier procedure is not limited to the measurement of survival in the narrow sense of dying or not dying. It can also be used to estimate the time-defined probabilities for the failure of an instrument or device of a certain type; or alternatively, to estimate the time-defined probabilities for some particular type of success (e.g., finding employment after becoming unemployed).

For purposes of illustration, the following Kaplan-Meier calculator is set up for 5 time periods and the values that need to be entered for the above example (total number of subjects along with the number of subjects for each time period who died or became unavailable) are already in place. To perform the analysis on the data of this example, click the «Calculate» button. To perform an analysis on a different set of data with exactly 5 time periods, click the «Clear» button, enter the relevant values into the yellow cells, and then click «Calculate». To perform an analysis with fewer or more than 5 time periods, click the «Reload» button and enter the number of time periods at the prompt. The user's own labels for the time periods can be substituted for the labels t1, t2, etc.
c

The

lower and upper limits of the 95% confidence intervals are calculated according to the efficient-score
method (corrected for continuity) described by Robert Newcombe (1998), based on the procedure outlined
by E. B. Wilson (1927).

References:
Kaplan, E.L. & Meier, P. "Nonparametric estimation from incomplete observations," Journal of the American Statistical Association, 53, 457-481 (1958).

Newcombe, Robert G. "Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods," Statistics in Medicine, 17, 857-872 (1998).

Wilson, E. B. "Probable Inference, the Law of Succession, and Statistical Inference," Journal of the American Statistical Association, 22, 209-212 (1927).

Home

Click this link only if you did not arrive here via the VassarStats main page.