Phase II Performance of P-Charts and P’-Charts

The p-chart has traditionally been used to monitor processes that yield binary data. The p’-chart that adjusts the p-chart control limits for between subgroup variation was proposed in 2002 in an effort to reduce the p-chart’s false alarm rate in the presence of large subgroup sizes. As illustrated with an example using real pharmacy data, the p-chart and p’-chart often yield very different results and how to decide which chart is appropriate for a given situation is not clear. A simulation study was undertaken to examine the phase II performance of the p-chart and p’-chart. With large subgroup sizes or when the between subgroup variation is high, p-charts have relatively high sensitivity to detect out-of-control shifts, but exhibit a high false alarm rate when the process is in-control, while p’-charts have low sensitivity to detect out-of-control shifts but have relatively low false alarm rates. For a specific situation, Youden’s index can be used to decide whether to act on p-chart or p’-chart results while considering the relative costs of false positives and false negatives.


Introduction
The p-chart is a type of statistical process control (SPC) chart designed to monitor data arising from a binomial distribution where each individual unit is dichotomized according to whether or not it is "nonconforming". In this situation, a unit is an observed value from a Bernoulli random variable with p being the probability of the event occurring and 1-p the probability of the event not occurring. When the process is "in-control" and each unit in a random sample of size n is independent having probability p of the event occurring, the number of events, Y, that occur in a sample follow a binomial distribution with the probability of observing y events given by: Thus, a p-chart displays the proportion of nonconforming units after units are aggregated by subgroups (e.g., lots from a manufacturing process, or some unit of time such as months). Use of SPC charts is traditionally divided into phase I and phase II applications. Phase I typically involves analysis of retrospective data to construct control limits once the process is deemed to be in statistical control. Then phase II is used to monitor the process so as to quickly detect any aberrations. Subgroup proportions outside of the control limits are considered to be the result of "assignable causes, usually represented by changes in the parameter(s) of a probability distribution representing the common cause variation in the process" [1]. The source of assignable cause variation should be investigated using a predetermined outof-control action plan in order to learn about factors that exert a favorable or unfavorable influence, depending on the direction of the deviation [2]. But it has long been noticed that a large percentage of subgroups' proportions will fall outside the p-chart control limits when the subgroup sizes are large. Examination of equations (4) and (6) reveals that the width of the p-chart's control limits is inversely proportional to the number of observations in the subgroup. In practice, extensive time and effort might be needed to investigate a subgroup proportion falling outside of the control limits. Thus it is desirable to only take action when true assignable causes occur and to minimize the number of "false alarms". A false alarm in this context means to interpret a subgroup's observed proportion as being the result of assignable cause variation when, in truth, it was the result of common cause variation inherent to the system.
Traditionally, X-charts (also called individuals charts) were used for samples with large subgroup sizes in order to provide wider control limits by adjusting for between subgroup random variation [3]. Formally, X-charts are appropriate for continuous data arising from a normal distribution. For a continuous random variable x, the center line of the X-chart is simply the mean of all observations in the sample of size n, and the usual three-sigma control limits are computed as where When using an X-chart for binary data, the centerline is p , just as described for the p-chart. However the individual observations are nowˆi p , where ˆ/ i i i p y n = , which represents the proportion of units having the event in the ith subgroup.
So for binary data,σ is computed by substituting ˆi p for i x and substituting p for x in equation (10) above. Now σ quantifies the between-subgroup variability, but a drawback of using the X-chart is that the control limits are constant across subgroups even when subgroup sizes vary, unlike the p-chart control limits that adjust for unequal subgroup sizes.
Recognizing the aforementioned deficiencies of using the p-chart and X-chart for binary data, Laney proposed a modified p-chart to accommodate random between subgroup variation inherent in the system [4]. Laney's p'-chart also yielded control limits that varied across subgroups as dictated by the subgroup sizes. The center line for Laney's p'-chart is given by the same value of p as the traditional p-chart, but the modi- where m is the number of subgroups. Now ˆz σ is calculated as where 1.128 is the expected value of the relative range for a sample size of 2, which is the number of values used to compute the moving range [2].
For a given situation, the analyst is left to decide whether to use the p-chart or p'-chart. There are rules of thumb and general guidelines to help the analyst decide. For example, Provost and Murray (p. 272) state in a book widely used to teach quality improvement concepts at healthcare organizations that "Only when subgroup sizes are above 1,000 should the [p'-chart] adjustment be even considered" [5]. Alternatively, some analysts implement a Variance Ratio Test (VRT) adapted from Jones and Govindaraju (2000) that compares the amount of observed variation in a sample of data to the amount of variation expected from a binomial distribution [6]. (The VRT is the basis for the "P Chart Diagnostic" in Mintab® Statistical Software that is used to advise the analyst about whether to use a p-chart or p'-chart for a particular sample of data.) Jones and Govindaraju pointed out that overdispersion . . doi: 10.7243/2053-7662-6-3 will occur if the probability p varies "in a smooth way" across subgroups, which will give rise to an elevated false alarm rate if the standard p-chart is used [6].
To apply the VRT, normalized event counts are computed for each subgroup as where n denotes the average subgroup size. Then normal scores of the i y  are computed as

Illustrative Example
Consider the real pharmacy data in Table 1 that displays the proportion of inpatient narcotic orders that were for hydrocodone at a large pediatric hospital [7]. On October 6, 2014 hydrocodone-containing products were "up-scheduled" by the United States Food and Drug Administration from C-III to C-II status, thereby severely restricting electronic or phone  in prescriptions. It would be important to quickly detect a reduction in orders for hydrocodone-containing products in this situation so that healthcare providers could be educated about patient risks associated with hydrocodone alternatives in a timely manner. Should a p-chart or p'-chart be used to monitor the hydrocodone medication orders summarized in Table 1? Following the recommendation of Provost and Murray, one would use a p-chart for the hydrocodone medication order data since all of the subgroup sizes are less than 1,000 [5]. On the other hand, as shown in Figure 1, Minitab's P Chart Diagnostic recommends using a p'-chart instead of a standard p-chart for the hydrocodone medication order data so as to avoid an elevated false alarm rate.
If one uses the p'-chart as recommended by Minitab's P Chart Diagnostic interpretation of the VRT, a signal of a decrease in hydrocodone orders that occurs as a result of the Oct 6 upscheduling is not detected (Figure 2). On the other hand, a p-chart of the same data detects the decrease in hydrocodone orders due to upscheduling (Figure 3). However this p-chart also indicates a signal of increased hydrocodone orders the week of June 23, when no known source of assignable cause   variation occurred; and the difference between the 94.7% proportion for that week and the 91.0% average for the time un-der observation would not seem to be of great clinical importance. Comparing the p-chart and p'-chart from the example above illustrates the analyst's dilemma in balancing the desire to detect a true signal of special cause variation vs. expending unnecessary resources investigating and taking action due to a "false alarm". In the context of phase II SPC chart applications, the false alarm rate and probability of failing to detect true assignable cause variation are analogous to hypothesis testing type I and type II error rates, respectively, and the operating-characteristic (OC) curve can be used to decide which kind of control charts should be used for a given situation [2,8]. Additionally, the Average Run Length (ARL) can be used to evaluate the performance of control charts. For p-charts and p'-charts, the ARL represents the average number of subgroups plotted before an out-of-control signal is observed. When the subgroups are independent (uncorrelated), the ARL is given by with β denoting the proportion of subgroups falling within the control limits. The ideal control chart will have a very low false alarm rate (false positive rate) and high ARL when the process is in-control, yet quickly generate a signal of assignable cause variation (i.e., have a very low ARL) when a the process is out-of-control.
Comparison of p-chart and p'-chart performance is analogous to evaluation of diagnostic tests, with a process being out-of-control the equivalent of the "disease" that is screened for. In such situations, Youden's Index (sensitivity + specificity -1) is commonly used to compare performances. More precisely, Youden's J statistic is computed as withTP, TN, FP and FN representing true positives, true negatives, false positives and false negatives, respectively. From a purely statistical perspective, one could judge the method that gives the higher value of J to be superior. The highest value of J corresponds to the point on the Receiver Operating Characteristic (ROC) curve that is farthest from the diagonal "line of chance".
It should be noted that using J as the performance metric treats FP and FN as being equally costly, which is generally not true in reality. If the analyst wishes to specify a weight for the relative importance of sensitivity and specificity, a weighted statistic, J w , can be computed as with w denoting the user-defined weight where 0 ≤ w ≤ 1 [9]. When w=0.5, J w is equal to the usual unweighted J for which sensitivity and specificity are given the same impor- tance. When w>0.5, J w gives more weight to sensitivity than specificity. When w<0.5, J w gives more weight to specificity than sensitivity. This simulation study will compare the phase II performance of the p-chart and p'-chart and examine the utility of the VRT as a diagnostic tool for deciding which of these two SPC charts to use. Since no previous published study has addressed the question, the results of this simulation study will help data analysts make informed decisions when choosing between a p-chart and p'-chart for a given situation.

Methods
For the simulation study, 30 subgroups of size n were generated, with each observation in the subgroup representing the outcome of a Bernoulli trial with the simulated probability p of the event occurring. These 30 simulated subgroups were intended to mimic phase I of SPC chart applications with an in-control process. For the 31 st subgroup, data from all 31 simulated subgroups were used to compute the centerline of the p-chart and p'-chart, but the probability of the event in the 31 st subgroup was simulated as p * where p * =p+δ. For each simulation, the values of δ were varied in increments of 0.01.
A simulation study was undertaken with a mean in-control proportion of p=0.1, with p * ranging from 0.01 to 0.3 in steps of 0.01. Another simulation study was undertaken with a mean in-control proportion of p=0.5, with p * ranging from 0.3 to 0.7 in steps of 0.1. Thus, the simulated proportion in the 31 st subgroup was varied around the in-control proportion to examine the sensitivity to detect shifts and the false alarm rate when the simulated proportion was equal to the in-control proportion. A separate simulation was performed for each value of n ranging from 10 to 2000 in steps of 10. The values of n and p were allowed to vary from subgroup to subgroup by letting these parameters be generated from a truncated normal distribution with mean μ and variance σ 2 , under the constraint that 0 <p< 1 and n> 1, with the simulated value of n rounded to the nearest integer. When n was allowed to vary, its simulated value of μ varied from 10 to 2000 in steps of 10, with the simulated standard deviation set to 5 µ σ = .
When p was allowed to vary across subgroups, its simulated value of μ was either 0.1 or 0.5 as described above, with a simulated standard deviation of . 5 µ σ = In order to investigate the effect of greater between subgroup variation, an additional simulation study was performed for both in-control proportions (p=0.1 and p=0.5) using the same variability in n described previously, but with the between subgroup standard deviation of p increased to . 3

µ σ =
The 30 subgroups were simulated for 10,000 iterations for each combination of n and p values. At the conclusion of each iteration of the 30 subgroups' simulated values, the 31 st subgroup was simulated 10,000 times for each value of p * . For each value of p * , the proportion of the 10,000 iterations with the 31 st subgroup's proportion falling within the control limits was computed separately for the p-chart, p'-chart and use of the VRT to decide between these two charts. The 31 st subgroup mimicked phase II application of SPC charts. The ability of the three methods to detect a shift in the process in the 31 st subgroup was assessed separately for each value of n and p via OC curves. The ARL was calculated using equation (17) with the observed proportion of the 10,000 simulated iterations falling within the control limits used to estimate β in equation (19). All simulations were performed using R 3.2.3 for Windows [10].

Results
When the subgroup sizes were relatively small, for example n = 20, there was little difference between the methods for the in-control proportion of 0.1 but as the subgroup sizes increased p-charts exhibited a higher false alarm rate with a corresponding increase in sensitivity to detect out-of-control proportion shifts (Figure 4 and Table 2). When the in-control proportion was 0.5, the false alarm rate and sensitivity to detect out-of-control proportion shifts was markedly higher for p-charts than the other two methods for all subgroup sizes ( Figure 5 and Table 3). For smaller subgroup sizes the VRT yielded results between the p-chart and the p'-chart, but as the subgroup sizes increased, the VRT more closely followed the p'-chart. For both in-control proportions for all subgroup sizes, differences between the p-chart and p'-chart were greater when the variation of the in-control proportion was higher. The proportion of 31 st subgroup's 10,000 iterations falling outside of the control limits for all simulation scenarios are provided in the Appendix.

Discussion
The results of this simulation study show that neither the p-chart nor the p'-chart perform well when subgroup sizes are large or when between subgroup variation in p is high. When n is large or when the between subgroup variation in p is high, p-charts have relatively high sensitivity to detect shifts in p, but exhibit a high false alarm rate when the process is in-control. On the other hand, p'-charts have low sensitivity to detect out-of-control shifts but have relatively low false alarm rates when subgroup sizes are large or when the between subgroup variation in p is high. Although the VRT provides a compromise between the two types of charts for small subgroup sizes, it does not meaningfully improve performance when n is large or between subgroup variation in p is high because in these situations the VRT results follow the p'-chart results very closely.
Performance of the two charts diverge with increases in the subgroup size and higher between subgroup variation doi: 10.7243/2053-7662-6-3   of the in-control proportion. So it appears that the p'-chart accomplishes its intended purpose by decreasing the false alarm rate in the presence of random common cause variation inherent to the system, but with the tradeoff that smaller signals of assignable cause variation will often not be detected.
In theory and in a systematic simulation study such as this, the distinction between common cause random variation and non-random assignable cause variation (i.e., a change in the true proportion) is clear. But in real-world applications, the source of variability will generally not be readily apparent to the data analyst whose objective is simply to act when it is practical and economically beneficial to do so based on a doi: 10.7243/2053-7662-6-3 signal from the available data [8].
The results of the simulation study that are provided in the Appendix allow the analyst to know, for a given subgroup size and standard deviation, the expected proportion of subgroups falling outside of the control limits for both types of control chart for an in-control process and for out-of-control shifts of varying magnitude. The results for p=0.1 can be used for p=0.9 since the binomial distribution is symmetrical around 0.5, and linear interpolation can be used to estimate out-ofcontrol proportions for values of p between 0.1 and 0.5 (and from 0.5 to 0.9, applying the symmetrical property of the binomial distribution).
Returning to the pharmacy example, considering the results from May 19 through Sept 29 to be the phase I data, the average proportion of inpatient narcotic orders that were for hydrocodone-containing products was 0.913 with a standard deviation of 0.016 and an average subgroup size of 716.4. If we round these estimates to a mean proportion of 0.9, standard deviation of 0.02 and subgroup size of 720, exploiting the symmetrical property of the binomial distribution we can use the simulation results in the Appendix for the in-control proportion of p=0.1 with SD=0.02 and n=720. Suppose an a priori determination that process shifts of 5% are important to detect, which would correspond to p * = p + δ= 0.1 +0.05=0.15 in the simulation study. For this scenario, from the Appendix we see that the p-chart sensitivity is 0.6598, versus 0.2245 for the p'-chart, and looking at the proportion of iterations giving false signals for the in-control process reveals specificities of 1-0.1359=0.8641 and 1-0.0044=0.9956, for p-charts and p'-charts, respectively. From equation (20), these results yield J=0.5239 for the p-chart and J=0.2201 for the p'-chart. Thus, for this situation, if FN and FP are given equal weight, the p-chart would be considered superior when comparing values of J, so the analyst should interpret the p-chart results. Using the p-chart, the October 6 signal of assignable cause variation would be detected to alert pharmacy staff that there is a need for timely action (i.e., physician education   Table 3. The estimated ARL for p-charts, p'-charts and the VRT for various simulated values of p* and the mean subgroup size (n) when the in-control proportion was 0.5 with a standard deviation of p of 0.5/5 and 0.5/3.