Federated Epidemic Surveillance
Abstract
Epidemic surveillance is a challenging task, especially when crucial data is fragmented across institutions and data custodians are unable or unwilling to share it. This study aims to explore the feasibility of a simple federated surveillance approach. The idea is to conduct hypothesis tests for a rise in counts behind each custodian’s firewall and then combine p-values from these tests using techniques from meta-analysis. We propose a hypothesis testing framework to identify surges in epidemic-related data streams and conduct experiments on real and semi-synthetic data to assess the power of different p-value combination methods to detect surges without needing to combine the underlying counts. Our findings show that relatively simple combination methods achieve a high degree of fidelity and suggest that infectious disease outbreaks can be detected without needing to share even aggregate data across institutions.
1 Introduction
The prompt detection of outbreaks is critical for public health authorities to take timely and effective measures. Providing early warning regarding either the emergence of a new pathogen or a renewed wave of an existing epidemic allows for preparatory action to reduce transmission and prepare for increased load on the healthcare system. However, real-time surveillance is challenging, particularly in countries such as the United States where relevant data is typically held by many separate entities such as hospitals, laboratories, insurers and local governments. These entities are often unable or unwilling to routinely share even aggregated time series such as the total number of patients with a specific diagnosis. Even when sharing aggregates is permitted from a privacy perspective (e.g., such disclosure is often allowable under U.S. HIPAA rules), a number of other barriers can arise due to competitiveness, commercial value, reputation, and other sources of institutional reluctance. For example, absolute numbers may be viewed as propriety if they are reflective of market share, or may be thought to reveal unwanted information about the relative performance of different institutions. Accordingly, public health authorities must mandate reporting for particular conditions of interest to create effective surveillance pipelines. This process is both cumbersome and reactive: a new reporting pipeline cannot be created until well into a public health emergency.
We propose and evaluate the feasibility of an alternative approach that we refer to as federated epidemic surveillance. The core concept is that health information, including even aggregate counts, never leaves the systems of individual data custodians. Rather, each custodian shares only specified statistics of their data, for example, the p-value from a specified hypothesis test. These statistics are then aggregated to detect trends that represent potential new outbreaks. Leveraging inputs from a variety of data custodians provides significantly improved statistical power: trends that are only weakly evident in any individual dataset may be much more apparent when the evidence is pooled together. To illustrate, consider COVID-19 hospitalizations in Seattle reported by four facilities to the US Department of Health & Human Services (HHS), as shown in Figure 1. As the patterns observed at different facilities vary substantially, it would be difficult to catch the overall trend by looking at any single facility. However, if the combined data from all facilities are available, a rapid increase in hospitalizations is clearly visible starting in March. Our goal is to detect outbreaks with comparable statistical power as if the data could be pooled together, but without individual data providers disclosing their time series of counts.
Our analysis shows that federated surveillance is indeed possible, often attaining performance similar to that with fully centralized data. We analyze a simple two-step approach: first, conduct separate hypothesis tests on the occurrence of a “surge” at different sites and subsequently use a meta-analysis framework to combine the resulting p-values into a single hypothesis test for an outbreak. More elaborate approaches (e.g. based on homomorphic computation or other cryptographic techniques) could allow more sophisticated computations under strong privacy guarantees. However, our goal is to demonstrate that high-performance federated surveillance is achievable using simple, easily explainable methods, since explainability to lay audiences is crucial for engendering trust and gaining acceptance. Our results indicate that effective epidemic surveillance is possible in environments with decentralized data, suggesting federated surveillance as a potential step towards modernizing surveillance systems in preparation for current and future public health threats. The implementation of the experiments can be found at https://github.com/Rachel-Lyu/FederatedEpidemicSurveillance.
2 Results
We explore the potential for simple federated surveillance methods to detect surges in a condition of interest using a variety of real and semi-synthetic data. To start, we more formally introduce our objective. Precisely defining what constitutes a surge or outbreak is difficult. We operationalize a surge as a sufficiently large increase in the rate of new cases over a specified length of time. Formally, we model the time series of interest (e.g., cases or hospitalizations with a particular condition) as following a Poisson process for some time-varying rate parameter . At a testing time , we compare to a baseline period , defined as , and say that a surge occurs when the rate increases by at least a factor of during the testing period compared to the baseline period. For simplicity, we model counts in the baseline period as following a Poisson distribution with a constant parameter : . Similarly, during the testing period, we model for a new parameter . We say that a surge occurs when . We will analyze methods that test this hypothesis using the realized time series , effectively asking whether a rise in counts must be attributed to a rise in the rate of new cases or whether it could be explained by Poisson-distributed noise in observations instead. A concise list of all the notations employed is provided in “Parameters and notations” Section of Supplementary Material. Importantly, none of our results rely on the assumption that the data actually follows this generative process; indeed, we will evaluate using real epidemiological time series where such assumptions are not satisfied. Rather, our aim is to show that decentralized versions of this simplified hypothesis test can successfully detect surges.
Formally, we test the null hypothesis that the Poisson rate ratio is not larger than . We apply the uniformly most powerful (UMP) unbiased test for this hypothesis [RL05, Fay14] with p-value , where is a Binomial random variable . That is, to calculate the p-value of the Binomial test, we sum up the probabilities of observing more extreme values than if counts were uniformly split between the baseline and test periods. Of note, in this paper, we only discuss the unadjusted p-values. However, in practical applications, controlling the False Discovery Rate (FDR) of online multiple tests over time is crucial, while the consideration regarding p-value correcting and thresholding is a separate topic which we can refer to other articles like [RWR23]. “Poisson rate ratio test for detecting a surge” in Methods Section includes more details about the hypothesis test.
In the federated setting, each data custodian computes p-values for this hypothesis test using only their own time series counts. The p-values are then combined using methods from meta-analysis (see “Overview of meta-analysis methods” in Methods Section for more discussions). Considering are p-values obtained from independent hypothesis tests and the joint null hypothesis for the p-values is [HRD18], several commonly used statistics and their corresponding distributions can be computed accordingly. We listed some popular ones in Table 1.
Methods | Statistics | Distributions under the null |
---|---|---|
Stouffer’s | ||
Fisher’s | ||
Pearson’s | ||
Tippett’s |
2.1 Efficacy of Federated Surveillance
We start by studying the statistical power and sensitivity of federated surveillance methods compared to centralized data, i.e., whether decentralized hypothesis tests allow comparable accuracy in detecting surges compared to the (unattainable) ideal setting where all data could be pooled for a single test. We assess decentralized methods using both their theoretical expected accuracy on data drawn from our simplified generative model and on two real COVID-19 datasets.
Figure 2 shows the expected statistical power of each meta-analysis method for combining p-values on data drawn from our generative model, compared to the statistical power of a centralized version of the same hypothesis test and to a version that uses only the counts from a single data provider. We fix a threshold of for a surge. The axis varies the true rate of growth, with a higher power to detect surges when they deviate more significantly from the null. To ensure a fair comparison, we calibrate the rejection threshold for each method to match the nominal rejection rate when the true growth rate is exactly 30% (i.e., precisely satisfying the null). To perform simulations, we set a total count over both training and testing periods to be 200. The counts are binomially simulated in the testing period with probability parameter , where is the growth rate to be tested. The total and testing period counts are multinomially allocated between 2 sites (Figure 2(a)) and 8 sites (Figure 2(b)) with a uniform distribution across the sites (an assumption we will revisit later).
We find that, in this idealized setting, the top-performing federated method (Stouffer’s method) almost exactly matches the power of the centralized data test. Conversely, significant power is lost by using only the p-value from a single site, indicating that sharing information across sites is necessary for good performance. The other meta-analysis methods exhibit lower power than Stouffer’s; in later sections, we will examine the settings in which different meta-analysis methods for combining the p-values lead to better or worse performance. More experiments are shown in “Power curves and calibrations” Section of Supplementary Material.
However, this idealized setting is highly simplified based on our assumptions. Real-world epidemic data deviates from such assumptions in multiple ways, including non-stationary time series and non-uniform distribution of patients among facilities. To validate the robustness of our federated surveillance framework in the real world, we use two datasets providing a more realistic representation of the complexities and challenges, allowing us to assess the performance of the methods under more diverse conditions. The first dataset is COVID-19 hospitalization reported in the “COVID-19 Reported Patient Impact and Hospital Capacity by Facility” dataset, provided by the U.S. Department of Health & Human Services, covering the period from 2020-07-10 to 2023-03-03. The second dataset is a daily time series of the total number of outpatient insurance claims with a primary diagnosis of COVID-19 in each county, published on Delphi Epidata [RBJ+21, MBG+21, SRB+21](https://cmu-delphi.github.io/delphi-epidata/) and based on counts provided by Change Healthcare, covering the period from 2020-08-02 to 2022-07-30. See “Datasets” Section of Supplementary Material for more detailed information of the datasets.
In our analysis of these datasets, we add up the counts of facility-level hospitalization to generate county-level p-value alerts, and county-level insurance claim counts to create state-level p-value alerts. The term “alert” in this context indicates the occurrence of a significant increase. Formally, a p-value alert indicates that the confidence level of rejecting the null hypothesis is below a predetermined threshold . For more details of evaluation, see “Evaluation of the surge detection task” in Methods Section. Due to variations in reporting frequency and the number of sites, the hospitalization data tends to have larger counts distributed in fewer sites, while the claim data has smaller counts distributed in more sites. By applying our methods to these datasets, we obtain the recall-precision curves as Figure 3. The recall and precision of the combined p-values are evaluated against the ground truth as the p-value alerts on the centralized data, i.e., the total counts in that geographic region. The results demonstrate that the federated test, with the appropriate combination method, can effectively reconstruct centralized information. In our analysis of facility-level hospitalization data, Stouffer’s method achieved a recall of 0.95 when we set precision to 0.90, corresponding to a false discovery rate (FDR) of 0.10, and achieving an area under the curve (AUC) of 0.98. When examining county-level claim data, Fisher’s method attained a recall of 0.76 at a precision of 0.90, and an AUC of 0.95. Using data only from the largest single facility produced lower accuracy. Specifically, for hospitalization and claim data, when fixing the precision to 0.90, their recall is both 0.62. Their AUCs are 0.88 and 0.90. Importantly, this data is directly drawn from the real world, and need not satisfy the assumptions of our generative process. The results demonstrate the potential of federated methods for early detection of outbreaks, even for p-values combined with only the simplest meta-analysis framework.
The selection of the highest-performing meta-analysis method should depend on the characteristics of the underlying time series. In the next subsection, we will use semi-synthetic data to explore how such characteristics impact the performance of the different meta-analysis methods.
2.2 Comparative performance of the different combination methods
Based on our discussion on theoretical analysis (see Methods Section) and the experiments on the real-world data above, we have drawn preliminary conclusions regarding our proposed test. Firstly, employing a combined test using a meta-analysis framework generally yields superior performance compared to relying solely on a single entity, even if the entity contributes a significant portion of the counts. Secondly, the optimal choice of method depends on specific features of the data. If the reporting sites are comparable in size and the counts have relatively large magnitudes, Stouffer’s method is preferable. However, if the sites’ sizes are uneven, -based Fisher’s method provides superior performance.
To better understand the impact of various aspects of the data on the combination process, we can selectively modify one factor while keeping others constant. Several factors might influence the success of federated surveillance methods. Firstly, as the number of reporting entities increases, combining p-values becomes increasingly complex. This arises because each entity introduces a blend of variabilities - changes in population size, prevalence, observation noise, and potential negative correlations between sites - that complicates the separation of distinct effects in a single p-value. In that case, both multiplicative and additive effects on the p-values and their approximations are amplified with more facilities, which leads to more biased results. Secondly, the magnitude of the counts affects which p-value combination method performs best, as evident from the power formula in Equation 5 in Methods Section. Thirdly, the imbalance in the proportions of the data providers in terms of the cases in a region challenges the robustness of the combination methods.
In order to rigorously analyze these factors while effectively controlling for other variables, we employ a semi-synthetic data analysis (see “Semi-synthetic data analysis” in Methods Section) utilizing real daily COVID-related claims data. We first define the Poisson rate parameter for the simulation as the 7-day moving average of the real data, with the growth rate of the Poisson rate defined accordingly. Simulated observations are then generated by drawing Poisson-distributed samples from the rate parameter for each time step. The growth rate alert is then defined as whether the growth rate of the Poisson rate is larger than a threshold. Finally, we simulate the split of the sampled time series into a set of individual sites by drawing counts from a multinomial distribution. By varing the parameters of this distribution, we can control the degree of dispersion of data across sites. In the real world, the Poisson rates are not available even when the counts for all parties are known; thus the true growth rates and the correct growth rate alerts are unknown. For semi-synthetic data, the ground truth for each of these quantities is known, allowing us to compare how well different algorithms match it. In particular, we can distinguish between loss of accuracy due to the Poisson-distributed noise (reflected in the gap between the centralized method and ground truth) and loss due to decentralization of the data.
In Figure 4, each plot represents a comparison of the statistical powers of methods when the FDR is set at 0.10, in an analysis where one dimension is altered while all other dimensions are fixed. Figure 4(a) varies the number of sites while maintaining equal shares among them. Figure 4(b) scales the underlying Poisson rate parameter of the simulation by a multiplier, allowing us to examine performance with smaller counts where observations become sparser. The power of simply selecting the p-values from the largest site, represented by the green line, is not fully shown in the plot due to its significantly lower performance compared to the other methods. Figure 4(c) explores the case where the number of sites is fixed at , and the degree of imbalance in the shares is varied. The level of imbalance is quantified using the normalized entropy metric , where a value of 1 indicates perfectly equal shares.
Our findings indicate that federated analysis performs well compared to the centralized setting. Relying solely on a single facility, even with a relatively large share of the counts, yields poor results. Fisher’s method demonstrates the highest stability when the number of sites varies but each has an equal share of the total. Therefore, when the data is distributed among numerous sites, Fisher’s method is the preferred choice for meta-analysis. Additionally, we observe that Fisher’s method performs slightly better when the magnitudes of the counts are relatively small. In other cases, Stouffer’s method performs better, perhaps reflecting that fact that the Gaussian approximation of the Binomial parameter is more accurate when the counts are larger. Furthermore, our analysis suggests that using only the largest site can outperform naive (unweighted) combination methods only when there is one dominant site in the entire region, as indicated by a normalized entropy of less than 0.7, which corresponds to the largest site having a share of at least 65%. For example, when the shares of five sites are (0.65, 0.1, 0.1, 0.1, 0.05), the normalized entropy of site shares is 0.70.
2.3 Enhancing Federated Surveillance with Auxiliary Information
In the previous analysis of federated surveillance, we applies a meta-analysis framework which uses only p-values from the reporting sites. We might be able to use auxiliary information to further optimize the framework. Intuitively, hypothesis tests of larger sites are expected to be less noisy and should be weighed more than smaller sites. If the relative shares of the different sites are known even approximately, we can incorporate this information and assign weights to the different tests, as weighting is a common approach for integrating evidence [HY22, ZW22, VW20, Whi05].
In Methods Section, we calculate the appropriate weighting scheme for different meta-analysis methods as a function of the relative shares of the reporting sites. Our framework differs from most previous meta-analysis studies [WMLE09, ZW22, Goo55, YBPN21, Lan61] because the target can be reconstructed from the summation of decentralized counts when complete information is available. In contrast, in other meta-analysis studies, like Genome-wide association studies (GWAS), researchers combine the tests on whether a variant has an effect on a phenotype based on specific samples as the sample level information across different studies is unavailable. In such cases, the inverse of the standard error and the square root of the sample size are suggested to be used as weights [WMLE09]. In our framework, we can explore different approximations of p-values and decide how to optimally weight them. For Stouffer’s method, we show that weighting by the square roots of the shares recovers a Gaussian approximation to the centralized-data p-value. For Fisher’s method, we compare several previously proposed weighting schemes [YBPN21, Goo55, Lan61] in simulation and observe that the wFisher method [YBPN21] performs best. In addition to adding weights, we can make a further modification to Stouffer’s method when an estimate of the magnitudes of total reports of all sites is available. The modification involves incorporating a continuity correction term to make the results less conservative, which can be particularly useful when the total counts are small.
By comparing the performance of selecting the largest site with Stouffer’s and Fisher’s methods before and after incorporating weights, using our semi-synthetic data analysis framework, we observe in Figure 5 that after incorporating weights, both methods show improvements. Furthermore, the combined methods outperform selecting the largest site, even in the extreme setting where the largest site among all five accounts for 80% of the share. Specifically, the weighted Stouffer’s method closely approximates the performance of the centralized setting when the shares between sites are similar, while wFisher demonstrates consistently superior performance when the shares of sites are unbalanced.
The performance of both the naive and weighted versions of Stouffer’s and Fisher’s methods is depicted in Figure 6(a) and Figure 6(b), where the weighted versions of both methods show improvement as anticipated. For HHS hospitalizations (Figure 6(a)), after weighting, Fisher’s method’s AUC increases from 0.93 to 0.94, with recall at precision of 0.90 increases from 0.71 to 0.77; Stouffer’s method’s AUC increases from 0.98 to 0.99, with recall at precision of 0.90 increases from 0.95 to 0.99. For outpatient insurance claims Figure 6(b), after weighting, Fisher’s method’s AUC increases from 0.95 to 0.98, with recall at precision of 0.90 increases from 0.76 to 0.94; Stouffer’s method’s AUC increases from 0.87 to 0.93, with recall at precision of 0.90 increases from 0.65 to 0.84. Additionally, the inclusion of a continuity correction improves the AUC from 0.93 to 0.94 and recall at precision of 0.90 from 0.84 to 0.90 for daily-reported claim data with smaller counts. For weekly-reported hospitalization data, the counts are larger and the continuity correction has little impact.
Finally, we test whether these patterns from the semi-synthetic experiments also hold in a more realistic setting. In the real world, the real-time shares of different sites may not be readily available, requiring the estimation of weights using auxiliary data. Various approaches can be employed depending on the available data. One potential setting is where sites report their total counts infrequently (e.g., monthly or quarterly), where they cannot be combined in real-time for surveillance, but may be used to estimate weights provided that the relative shares of different sites changes more slowly than the counts themselves. We test a range of possible reporting lags of auxiliary information used in weights and find that even relatively infrequent updates (e.g., weights estimated every 12 weeks) lead to improved performance with AUC around 0.93 compared to uniform weights with AUC = 0.87. The impact of the reporting cycle and lag of the auxiliary information on real datasets is illustrated in Figure 6(c). Generally speaking, these factors have a small influence on the improvement achieved through weighted combinations. A second potential scenario is to use relatively static proxies for facility size to estimates weights, such as bed or ICU capacity. Empirically, we find that estimating the shares of providers using the bed or ICU usage generally improves performance, albeit with a marginal improvement shown here, compared to not having weights at all. However, this approach generally underperforms relative to methods that utilize specific information about COVID counts, as detailed in Figure 6(d).
3 Discussion
Timely and accurate detection of outbreaks is critical to enable decision-making and responsive policy in public health emergencies. However, this task is difficult when critical data is distributed across multiple data custodians who may be unable or reluctant to share it. This study presents a framework for federated surveillance, which addresses these challenges by performing statistical analysis behind each data custodian’s firewall followed by a meta-analysis to aggregate the evidence. Our results validate the potential for timely detection of emerging trends in population health without direct disclosure of any health data, even aggregated counts.
Our results also show the relative strengths and weaknesses of different meta-analysis methods under Poisson assumption for performing this aggregation on epidemiological data. We find that the relative performance of different p-value combination methods depends on the number of reporting sites, their relative sizes, and the expected magnitude of the counts. Stouffer’s method performs best where data is concentrated in a smaller number of sites and the magnitude of reports is relatively large. On the other hand, Fisher’s method exhibits robustness in more challenging settings characterized by a larger number of data holders and greater imbalances of shares among them. The inclusion of additional information, such as the sites’ shares and estimated total counts within a given region, enables additional improvements in performance. Across all settings and datasets that we consider, we find that at least one meta-analysis method results in statistical power that closely approximates the best attainable if all data were available for a single analysis. Our experiments primarily focus on the Poisson rate ratio test, yet the conclusions drawn are not confined to this context alone. With minor modifications to the framework, similar results can be achieved for other hypothesis tests (such as the Poisson test itself or under different data distribution assumptions, including the (Log) Normal distribution which uses Gaussian approximation of sufficiently large counts, or overdispersed Poisson and Negative Binomial distribution, which accounts for additional over-dispersion [GMS95].
Federated surveillance provides a simple, readily implementable framework for addressing the practical barriers to including already-existing health system data in public health surveillance systems. While more complex methodologies such as homomorphic encryption could provide stronger theoretical guarantees, the methods presented here are more readily understood by the lay public and hence more likely to be acceptable to health data custodians. Our work demonstrates that relatively simple meta-analysis methods can enable significantly more accurate and timely warnings of changes in population health without the creation of centralized datasets.
4 Methods
4.1 Poisson rate ratio test for detecting a surge
Our model assumes that in a short time period such as a week or a month, if there is no surge, the counts follow a Poisson distribution with a rate parameter , which can be estimated based on observations from the previous period. However, in the presence of a sudden surge, the distribution changes, and the rate parameter increases.
To identify surges, we propose conducting a hypothesis test to determine whether the increase in Poisson rates exceeds a user-defined threshold, denoted as . This threshold can be tailored to the inherent characteristics of different indicators, allowing for an adjustable tradeoff between sensitivity and specificity. Specifically, a surge is defined as a Poisson rate that increases by a factor of at least during the testing period with Poisson rate parameter , compared to the baseline period with parameter . Formally, we test the null hypothesis
(1) |
We propose utilizing the UMP unbiased test, which is a method of testing two Poisson rates ratio first proposed by Przyborowski and Wilenski [PW40]. This test is based on conditioning on the summation of the counts of the whole period including the baseline period and testing period. We consider and as independently distributed according to and , so that their joint distribution can be written as
(2) |
By Theorem 4.4.1 in [RL05], there exist UMP unbiased tests concerning the ratio . By the Theorem, the tests are performed conditionally on the integer points of the hyperplane segment, which is the total counts over both periods equals to , in the positive quadrant of the space. The conditional distribution of given total counts is the binomial distribution corresponding to trials and probability of success. Then the UMP unbiased test can be written as Equation 3.
(3) |
Essentially, the UMP test corresponds to a Binomial test that examines the indicator during the testing period conditioning on the total counts of both the baseline and testing periods. Using this test, we can easily determine the p-value through a one-tailed exact test. The formulation of the p-value is shown as Equation 4.
(4) |
where p-value is the probability of count being greater than or equal to given the total counts over the baseline period and testing period, and the summation in the second line for count from 0 to . This test is known for being exact while conservative, as the actual significance level is always below the nominal level [GNTS08].
The power of a hypothesis test is defined as the probability of rejecting the null hypothesis when the alternative hypothesis is in fact true. In the case of the Binomial test under the null hypothesis, we need to determine the critical value for at a given type I error rate . This critical value represents the minimum number of successes in the sample required to reject the null hypothesis in favor of the alternative hypothesis. Mathematically, the critical value is determined by finding that satisfies the condition . Under the alternative hypothesis, characterized by a higher growth rate of the Poisson rate , the power is computed as . The power quantifies the test’s ability to detect a surge when it truly exists.
The analytical formula for calculating power in the discrete distribution makes it difficult to see the contribution of different parameters from its form directly, as the critical value and the counts should be integers. However, in the figures of the power analysis, we used the exact power rather than the approximation. As a practical alternative, a Gaussian approximation of Binomial distribution with continuity correction can be employed. This approximation will ignore the rounding errors on the power calculation by allowing for the decimals in the values while maintaining an acceptable level of precision. With continuity correction, we can compute the power as Equation 5. The details of the proof are in “Power computation” Section of Supplementary Material.
(5) |
The expression inside the function comprises three terms, each capturing a specific aspect of the analysis. The first term quantifies the impact of the total counts’ magnitude, while the second term relates to the type I error rate. The third term corresponds to the continuity correction term, which can be ignored when the sample size is sufficiently large. This formula provides an approximation of the power and allows for a more intuitive understanding of the influence of different parameters.
4.2 Overview of meta-analysis methods
Meta-analysis is known for its ability to enhance statistical power by combining signals of moderate significance, effectively controlling false positives, and enabling comparisons and contrasts across tests and time [YBPN21]. The properties of various p-value combination methods have been extensively studied. For instance, Heard and Rubin-Delanchy [HRD18] observed that Tippett’s and Fisher’s methods are more sensitive to smaller p-values, while Pearson’s methods are more sensitive to larger p-values. They also suggested that Fisher’s and Pearson’s methods are more suitable for testing positive-valued data under the alternative hypothesis, with Fisher’s method performing better for larger values and Pearson’s method for smaller values. Additionally, Stouffer’s method is often preferred for testing real-valued data that approximates a Gaussian distribution.
Among the various combination methods, Stouffer’s and Fisher’s methods have gained significant attention in the literature due to their popularity in meta-analysis. Elston [Els91] noted that when the number of sites is very large, Fisher’s method will give a combined p-value that is close to 0 when the actual p-value is below ; and will give a combined p-value that is close to 1 when the actual p-value is above . Similarly, Stouffer’s method has the threshold point as , while Pearson’s method has . Rice [Ric90] suggested that methods like Stouffer’s are more appropriate when all tests are homogeneous and the combined p-value can be interpreted as a “consensus p-value”. On the other hand, Fisher’s method is particularly useful when testing against broad alternatives which specifically tests whether at least one component test is significant. It has shown superiority in certain scenarios, such as in GWAS where there may be significant differences in effect sizes between different populations [YBPN21]. In such cases, Stouffer’s and Lancaster’s [Lan61] methods tend to lose power where there are only a few studies show strong evidence of rejecting the hypothesis. This highlights Fisher’s method’s advantage in handling potential negative correlations between entities, which can arise due to factors like competition.
The meta-analysis methods outlined in Table 1 are based on the assumption that test statistics adhere to continuous distributions, with the joint null hypothesis for the p-values being [HRD18]. However, this conventional understanding of p-value distributions doesn’t align with the hypothesis test we introduce in our study. There are several reasons contributing to this discrepancy. First, for discrete distributions, such as the Binomial, p-values aren’t uniformly distributed under the null hypothesis due to truncation. Second, given the target as the centralized counts, the decentralized counts exhibit a negative correlation, making the dependence structure for decentralized p-values more complicated. Moreover, when the null hypothesis is rejected, the distribution of p-values undergoes a shift, necessitating strategies to mitigate the resulting decrease in power during meta-analysis. To address these issues, we present a perspective from the explicit mathematical format of the p-values in both centralized and distributed settings, followed by a discussion on how to combine them with minimal loss.
In the Binomial test of the distributed setting, the p-value for site can be expressed as Equation 6.
(6) |
where p-value for site is the probability of count being greater than or equal to given the total counts of that site over the baseline period and testing period, and the summation in the second line is calculated for count from 0 to where .
It is evident that directly combining the p-values of different sites from Equation 6 to Equation 5 without any loss is not feasible, as we need to deal with n choose r, but we have no access to the counts themselves. Therefore, our goal lies in finding better approximations and developing improved methods for combining the approximations, trying our best to use less count information. The selection among various meta-analysis methodologies mirrors the quest for a more precise approximation. We will further explore combination techniques, emphasizing Stouffer’s and Fisher’s methods, and elucidate how mathematical formulations assist in prudently integrating auxiliary information. For instance, we might consider the estimated contributions from different data providers or attribute weights to studies. Weighting stands as a prevalent strategy for assimilating evidence, while discerning the optimal weighting scheme also demands consideration. Beyond introducing weights, Stouffer’s method can be further refined given estimates of the aggregate reports from all sites . This refinement introduces a continuity correction term, tempering over-conservative results when the overall counts are small.
4.3 Stouffer’s method
Stouffer’s method is employed by utilizing the Gaussian approximation of the Binomial parameter, which is based on the central limit theorem. This approach is commonly used when analyzing the Binomial and Poisson distributions, especially when the counts are sufficiently large. To test the success probability using Stouffer’s method, the distribution of is approximated by . The z-score can then be calculated as [BCD02]. The p-value of the Binomial exact test is determined by the cumulative distribution function . Define rounding error or fluctuation term , which takes values in the interval . Thus, the true p-value in terms of the z-score with error terms can be expressed as Equation 7 [BCD02, BR10, Ess45].
(7) |
It should be noted that the denominator in the first-order error term indicates that Stouffer’s method may be unreliable for small sample sizes or when the probability is close to 0 or 1 [BCD01].
After applying the Gaussian approximations, the combination of p-values becomes the next focus. One limitation of the naive meta-analysis methods is the assumption of equal contributions across studies, which may not hold true, especially when the studies have significantly different sizes. Determining appropriate weights for different studies poses a challenging task. In the case of Stouffer’s method, some studies in GWAS suggest using the inverse of the standard error or the square root of the sample size as weights [WMLE09]. Our proposed test is a special case of meta-analysis, as the centralized counts are the summation of decentralized counts. In this case, we can combine the approximations without any loss by introducing weights. By obtaining a centralized p-value approximates using distributed p-value approximates , it turns out that the weights for aggregating the p-values is the square root of the shares of each entity, which is formulated as Equation 8.
(8) |
Furthermore, improvements can be made by incorporating a continuity correction when an estimated value for , representing the total counts of all entities, is available. Due to the discreteness of the Binomial distribution and the continuity of the normal distribution, the correction is helpful when is not sufficiently large. One commonly used correction is Yates’ correction in the Binomial test, which involves subtracting from the absolute difference between the observed count and the expected count . Considering our case where , the approximation of the p-value can be rewritten as .
Similarly, we can derive the combination formula as Equation 9.
(9) |
The additional term accounts for the effect of the continuity correction on the combined p-value. This correction becomes more necessary when the number of entities is large, but the total counts during the baseline and testing procedures are relatively small, indicating more dispersed data. In such cases, the correction term becomes more significant. If other types of coarse-grained evidence are available to estimate the counts’ magnitudes, a correction term can be added to make the combined p-value less conservative.
4.4 Fisher’s method
The statistical tests based on Fisher’s method and Pearson’s method involve taking the logarithm of the p-values and summing them. The rationale behind the log sum approaches is that the p-value is upper and lower bounded by exponential functions such as Equation 10. See “Proof for Equation 10” Section of Supplementary Material for the details of the proof.
(10) |
where represents the relative entropy (Kullback-Leibler divergence) between and , which is .
The logarithm of the p-value are upper bounded by , allowing the summation of logarithmic p-values from different sources to be meaningful. The formula indicates that the choice of and is relatively flexible for Fisher’s method, compared with the error term of Stouffer’s method which is in proportion to . Our experiments also support the idea that when the reported magnitude is small and the testing period is short, like the the “Counts of claims with confirmed COVID-19”, Fisher’s method is more reliable than Stouffer’s.
Different weighting strategies for Fisher’s method have been investigated, and various modifications have been proposed [ZW22, Goo55, Lan61, YBPN21]. However, the optimal weighting scheme remains uncertain. Some studies suggest employing adaptively weighted statistics combined with permutation tests [LT11] or using Monte Carlo algorithms to approximate the rejection region and determine optimal weights. Another approach involves constructing Good’s statistic [Goo55], which is a weighted statistic defined as with weight for site . Under the null hypothesis, it follows a chi-squared distribution with degrees of freedom (DF). Here, we use the weighting scheme , where represents the share of each site. This weighting ensures that the resulting chi-squared statistic has a total DF equal to , i.e., .
Additionally, some methods leverage the fact that the quantile of the Gamma distribution is , i.e., , where represents Fisher’s methods. For example, Lancaster’s method [Lan61] sets and transforms each to the th quantile of the Gamma distribution with . This transformation yields . By additivity, we have . In summary, Lancaster’s method generalizes Fisher’s method by assigning different weights to the DF of each source, resulting in a larger total DF compared to Fisher’s method. However, Yoon et al. [YBPN21] demonstrated that the large DF cause the individual distributions to approach the normal distribution, leading to a significant decrease in power. Yoon et al. consequently proposed the wFisher method, which employs a similar weighting scheme but shrinks the total DFs to match those of the original Fisher’s method. Specifically, they constructed the statistics . We observe that the wFisher method exhibits greater stability compared to other weighting methods. The wFisher framework can be written as Equation 11.
(11) |
4.5 Evaluation of the surge detection task
In the surge detection task, our evaluation centers on the timeliness of detecting surges, which is represented by binary sequences denoting the presence or absence of a surge. Alerts based on p-values are triggered when these values dip below a predetermined threshold, while growth rate alerts are activated when the growth rate of the Poisson rate parameter exceeds a set threshold. For real data analysis, the p-value alerts from the centralized setting serve as the ground truth. In contrast, the semi-synthetic data analysis uses the growth rate alerts as the ground truth. Once the ground truth is established, both centralized and decentralized p-value alerts are assessed by comparing them to this benchmark.
For each ground truth alert, the reconstructed alerts are considered true positives if they fall within a specified time window, e.g., no earlier than one week before and no later than two weeks after. Otherwise, the reconstructed alerts are classified as false positives. Moreover, any true alerts not matched by the constructed alerts are deemed false negatives. Following this rule, Precision and Recall metrics can be calculated. Precision represents the ratio of true positives (TP) to the summation of true positives and false positives (FP) (TP/(TP + FP)). Recall, on the other hand, denotes the ratio of true positives to the summation of true positives and false negatives (FN) (TP/(TP + FN)). These metrics are computed for different confidence level thresholds. Finally, the Precision-Recall metric is obtained, and the power (equal to Recall) is evaluated while controlling the FDR, which equals 1 - Precision, at 0.10, allowing for an assessment of method performance. It should be noted that the term “power” in this context refers to the power of the classification task, as opposed to the power associated with a Binomial test that was mentioned earlier.
4.6 Semi-synthetic data analysis
The semi-synthetic analysis is conducted under the assumption of noisy data, where the observed signal deviates from the true underlying prevalence. There are several objectives of this analysis. Firstly, it aims to investigate the effects of various factors, such as the number of sites, magnitudes of reports, and the imbalance of shares, while controlling for other dimensions. Secondly, the analysis facilitates the comparison of systematic errors arising from noise in the centralized and federated settings, as well as the assessment of the combination loss during the meta-analysis process. By utilizing real data as a starting point and employing semi-synthetic data analysis, we are able to control and examine the impact of different data features including the total count magnitudes and shares of entities.
The generation of the semi-synthetic data starts with the counts of outpatient insurance claims with a primary diagnosis of COVID-19 in each county provided by Change, covering the period from 2020-08-02 to 2022-07-30 on a daily basis. Initially, a 7-day moving average smoother is applied to the total counts of all counties in a state. We then treat the resulting smoothed values as the underlying occurrence rate confirmed COVID-19 cases each day, from which Poisson-distributed observed counts are simulated. That is, Poisson sampling is performed to generate simulated observations, assuming that the observed data is drawn from a Poisson distribution with the smoothed data serving as the rate parameter. After getting the state-level counts, we multinomially allocate them into different sites with the designed parameter, regarding them as count-level counts. Once the simulated counts are obtained, the next step involves computing the growth rate of the ground truth prevalence and the p-values for the hypothesis test at each time point. Subsequently, alerts are determined based on these computed values and the predetermined thresholds.
In contrast to the previous setting, where we established the ground truth through hypothesis testing on centralized data, we now define it based on the growth of the Poisson rate surpassing a predetermined threshold. This shift is due to the gap between the Poisson rate used for simulation and the Poisson counts in the centralized setting. The discrepancy between the growth rate alert of the Poisson rate and the centralized p-value alert can be ascribed to the inherent properties of the Poisson assumption and the introduced Poisson rate ratio test. Additionally, the distinction between centralized and decentralized p-value alerts directly results from the meta-analysis procedure. By examining and contrasting the errors arising from this bifurcated process, we demonstrate that the recombination cost exerts a comparable influence on the performance as does systematic noise. Moreover, the weighted and corrected Stouffer’s methods and the weighted Fisher’s method exhibit stability across various settings using auxiliary information, illustrating their robustness in practical scenarios.
References
- [BCD01] Lawrence D Brown, T Tony Cai, and Anirban DasGupta. Interval estimation for a binomial proportion, 2001.
- [BCD02] Lawrence D Brown, T Tony Cai, and Anirban DasGupta. Confidence intervals for a binomial proportion and asymptotic expansions, 2002.
- [BR10] Rabi N Bhattacharya and R Ranga Rao. Normal approximation and asymptotic expansions. SIAM, 2010.
- [Els91] RC Elston. On fisher’s method of combining p-values, 1991.
- [Ess45] Carl-Gustav Esseen. Fourier analysis of distribution functions. a mathematical study of the laplace-gaussian law, 1945.
- [Fay14] Michael P Fay. Testing the ratio of two poisson rates, 2014.
- [GMS95] William Gardner, Edward P Mulvey, and Esther C Shaw. Regression analyses of counts and rates: Poisson, overdispersed poisson, and negative binomial models., 1995.
- [GNTS08] Kangxia Gu, Hon Keung Tony Ng, Man Lai Tang, and William R Schucany. Testing the ratio of two poisson rates, 2008.
- [Goo55] IJ Good. On the weighted combination of significance tests, 1955.
- [HRD18] Nicholas A Heard and Patrick Rubin-Delanchy. Choosing between methods of combining-values, 2018.
- [HY22] Chia-Ding Hou and Ti-Sung Yang. Distribution of weighted lancaster’s statistic for combining independent or dependent p-values, with applications to human genetic studies, 2022.
- [Lan61] HO Lancaster. The combination of probabilities: an application of orthonormal functions, 1961.
- [LT11] Jia Li and George C Tseng. An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies, 2011.
- [MBG+21] Daniel J McDonald, Jacob Bien, Alden Green, Addison J Hu, Nat DeFries, Sangwon Hyun, Natalia L Oliveira, James Sharpnack, Jingjing Tang, Robert Tibshirani, et al. Can auxiliary indicators improve covid-19 forecasting and hotspot prediction?, 2021.
- [PW40] J Przyborowski and H Wilenski. Homogeneity of results in testing samples from poisson series: With an application to testing clover seed for dodder, 1940.
- [RBJ+21] Alex Reinhart, Logan Brooks, Maria Jahja, Aaron Rumack, Jingjing Tang, Sumit Agrawal, Wael Al Saeed, Taylor Arnold, Amartya Basu, Jacob Bien, et al. An open repository of real-time covid-19 indicators, 2021.
- [Ric90] William R Rice. A consensus combined p-value test and the family-wide significance of component tests, 1990.
- [RL05] Joseph P Romano and EL Lehmann. Testing statistical hypotheses, 2005.
- [RWR23] David S Robertson, James MS Wason, and Aaditya Ramdas. Online multiple hypothesis testing, 2023.
- [SRB+21] Joshua A Salomon, Alex Reinhart, Alyssa Bilinski, Eu Jing Chua, Wichada La Motte-Kerr, Minttu M Rönn, Marissa B Reitsma, Katherine A Morris, Sarah LaRocca, Tamer H Farag, et al. The us covid-19 trends and impact survey: Continuous real-time measurement of covid-19 symptoms, risks, protective behaviors, testing, and vaccination, 2021.
- [VW20] Vladimir Vovk and Ruodu Wang. Combining p-values via averaging, 2020.
- [Whi05] Michael C Whitlock. Combining probability from independent tests: the weighted z-method is superior to fisher’s approach, 2005.
- [WMLE09] Sungho Won, Nathan Morris, Qing Lu, and Robert C Elston. Choosing an optimal method to combine p-values, 2009.
- [YBPN21] Sora Yoon, Bukyung Baik, Taesung Park, and Dougu Nam. Powerful p-value combination methods to detect incomplete association, 2021.
- [ZW22] Hong Zhang and Zheyang Wu. The generalized fisher’s combination and accurate p-value calculation under dependence, 2022.