Federated Epidemic Surveillance

Ruiqi Lyu,1∗ Roni Rosenfeld,2 Bryan Wilder2

1Computational Biology Department, School of Computer Science, Carnegie Mellon University,
5000 Forbes Avenue, Pittsburgh, PA 15213, USA
2Machine Learning Department, School of Computer Science, Carnegie Mellon University,
5000 Forbes Avenue, Pittsburgh, PA 15213, USA

To whom correspondence should be addressed; E-mail: [email protected]
Abstract

Epidemic surveillance is a challenging task, especially when crucial data is fragmented across institutions and data custodians are unable or unwilling to share it. This study aims to explore the feasibility of a simple federated surveillance approach. The idea is to conduct hypothesis tests for a rise in counts behind each custodian’s firewall and then combine p-values from these tests using techniques from meta-analysis. We propose a hypothesis testing framework to identify surges in epidemic-related data streams and conduct experiments on real and semi-synthetic data to assess the power of different p-value combination methods to detect surges without needing to combine the underlying counts. Our findings show that relatively simple combination methods achieve a high degree of fidelity and suggest that infectious disease outbreaks can be detected without needing to share even aggregate data across institutions.

1 Introduction

The prompt detection of outbreaks is critical for public health authorities to take timely and effective measures. Providing early warning regarding either the emergence of a new pathogen or a renewed wave of an existing epidemic allows for preparatory action to reduce transmission and prepare for increased load on the healthcare system. However, real-time surveillance is challenging, particularly in countries such as the United States where relevant data is typically held by many separate entities such as hospitals, laboratories, insurers and local governments. These entities are often unable or unwilling to routinely share even aggregated time series such as the total number of patients with a specific diagnosis. Even when sharing aggregates is permitted from a privacy perspective (e.g., such disclosure is often allowable under U.S. HIPAA rules), a number of other barriers can arise due to competitiveness, commercial value, reputation, and other sources of institutional reluctance. For example, absolute numbers may be viewed as propriety if they are reflective of market share, or may be thought to reveal unwanted information about the relative performance of different institutions. Accordingly, public health authorities must mandate reporting for particular conditions of interest to create effective surveillance pipelines. This process is both cumbersome and reactive: a new reporting pipeline cannot be created until well into a public health emergency.

We propose and evaluate the feasibility of an alternative approach that we refer to as federated epidemic surveillance. The core concept is that health information, including even aggregate counts, never leaves the systems of individual data custodians. Rather, each custodian shares only specified statistics of their data, for example, the p-value from a specified hypothesis test. These statistics are then aggregated to detect trends that represent potential new outbreaks. Leveraging inputs from a variety of data custodians provides significantly improved statistical power: trends that are only weakly evident in any individual dataset may be much more apparent when the evidence is pooled together. To illustrate, consider COVID-19 hospitalizations in Seattle reported by four facilities to the US Department of Health & Human Services (HHS), as shown in Figure 1. As the patterns observed at different facilities vary substantially, it would be difficult to catch the overall trend by looking at any single facility. However, if the combined data from all facilities are available, a rapid increase in hospitalizations is clearly visible starting in March. Our goal is to detect outbreaks with comparable statistical power as if the data could be pooled together, but without individual data providers disclosing their time series of counts.

Our analysis shows that federated surveillance is indeed possible, often attaining performance similar to that with fully centralized data. We analyze a simple two-step approach: first, conduct separate hypothesis tests on the occurrence of a “surge” at different sites and subsequently use a meta-analysis framework to combine the resulting p-values into a single hypothesis test for an outbreak. More elaborate approaches (e.g. based on homomorphic computation or other cryptographic techniques) could allow more sophisticated computations under strong privacy guarantees. However, our goal is to demonstrate that high-performance federated surveillance is achievable using simple, easily explainable methods, since explainability to lay audiences is crucial for engendering trust and gaining acceptance. Our results indicate that effective epidemic surveillance is possible in environments with decentralized data, suggesting federated surveillance as a potential step towards modernizing surveillance systems in preparation for current and future public health threats. The implementation of the experiments can be found at https://github.com/Rachel-Lyu/FederatedEpidemicSurveillance.

Refer to caption
Figure 1: Adult patients hospitalized with confirmed COVID (7 day sum) of total and four largest facilities that together account for 95.12% of the market share in Seattle.

2 Results

We explore the potential for simple federated surveillance methods to detect surges in a condition of interest using a variety of real and semi-synthetic data. To start, we more formally introduce our objective. Precisely defining what constitutes a surge or outbreak is difficult. We operationalize a surge as a sufficiently large increase in the rate of new cases over a specified length of time. Formally, we model the time series ktsubscript𝑘𝑡k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of interest (e.g., cases or hospitalizations with a particular condition) as following a Poisson process ktPoi(λt)similar-tosubscript𝑘𝑡Poisubscript𝜆𝑡k_{t}\sim\mathrm{Poi}(\lambda_{t})italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_Poi ( italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for some time-varying rate parameter λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. At a testing time T𝑇Titalic_T, we compare to a baseline period B𝐵Bitalic_B, defined as {T,,T1}𝑇𝑇1\{T-\ell,\dots,T-1\}{ italic_T - roman_ℓ , … , italic_T - 1 }, and say that a surge occurs when the rate increases by at least a factor of θ𝜃\thetaitalic_θ during the testing period compared to the baseline period. For simplicity, we model counts in the baseline period as following a Poisson distribution with a constant parameter λBsubscript𝜆𝐵\lambda_{B}italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT: kBjPoi(λB),j=T,,T1formulae-sequencesimilar-tosubscript𝑘𝐵𝑗Poisubscript𝜆𝐵𝑗𝑇𝑇1k_{Bj}\sim\mathrm{Poi}(\lambda_{B}),j=T-\ell,\dots,T-1italic_k start_POSTSUBSCRIPT italic_B italic_j end_POSTSUBSCRIPT ∼ roman_Poi ( italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) , italic_j = italic_T - roman_ℓ , … , italic_T - 1. Similarly, during the testing period, we model kTPoi(λT)similar-tosubscript𝑘𝑇Poisubscript𝜆𝑇k_{T}\sim\mathrm{Poi}(\lambda_{T})italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ roman_Poi ( italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) for a new parameter λTsubscript𝜆𝑇\lambda_{T}italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. We say that a surge occurs when λT/λB>1+θsubscript𝜆𝑇subscript𝜆𝐵1𝜃\lambda_{T}/\lambda_{B}>1+\thetaitalic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT > 1 + italic_θ. We will analyze methods that test this hypothesis using the realized time series ktsubscript𝑘𝑡k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, effectively asking whether a rise in counts must be attributed to a rise in the rate of new cases or whether it could be explained by Poisson-distributed noise in observations instead. A concise list of all the notations employed is provided in “Parameters and notations” Section of Supplementary Material. Importantly, none of our results rely on the assumption that the data actually follows this generative process; indeed, we will evaluate using real epidemiological time series where such assumptions are not satisfied. Rather, our aim is to show that decentralized versions of this simplified hypothesis test can successfully detect surges.

Formally, we test the null hypothesis that the Poisson rate ratio λT/λBsubscript𝜆𝑇subscript𝜆𝐵\lambda_{T}/\lambda_{B}italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is not larger than 1+θ1𝜃1+\theta1 + italic_θ. We apply the uniformly most powerful (UMP) unbiased test for this hypothesis [RL05, Fay14] with p-value Pr[rkT]Pr𝑟subscript𝑘𝑇\Pr[r\geq k_{T}]roman_Pr [ italic_r ≥ italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ], where r𝑟ritalic_r is a Binomial random variable rBin(j=TT1kBj+kT,(1+θ)/(1+θ+))similar-to𝑟Binsuperscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵𝑗subscript𝑘𝑇1𝜃1𝜃r\sim\mathrm{Bin}\left(\sum_{j=T-\ell}^{T-1}k_{Bj}+k_{T},(1+\theta)/(1+\theta+% \ell)\right)italic_r ∼ roman_Bin ( ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B italic_j end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , ( 1 + italic_θ ) / ( 1 + italic_θ + roman_ℓ ) ). That is, to calculate the p-value of the Binomial test, we sum up the probabilities of observing more extreme values than kTsubscript𝑘𝑇k_{T}italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT if counts were uniformly split between the baseline and test periods. Of note, in this paper, we only discuss the unadjusted p-values. However, in practical applications, controlling the False Discovery Rate (FDR) of online multiple tests over time is crucial, while the consideration regarding p-value correcting and thresholding is a separate topic which we can refer to other articles like [RWR23]. “Poisson rate ratio test for detecting a surge” in Methods Section includes more details about the hypothesis test.

In the federated setting, each data custodian computes p-values for this hypothesis test using only their own time series counts. The p-values are then combined using methods from meta-analysis (see “Overview of meta-analysis methods” in Methods Section for more discussions). Considering p1,,pNsubscript𝑝1subscript𝑝𝑁p_{1},\dots,p_{N}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT are p-values obtained from N𝑁Nitalic_N independent hypothesis tests and the joint null hypothesis for the p-values is H0:piU[0,1],i=1,,N:subscript𝐻0formulae-sequencesimilar-tosubscript𝑝𝑖𝑈01𝑖1𝑁H_{0}:p_{i}\sim U[0,1],i=1,\dots,Nitalic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U [ 0 , 1 ] , italic_i = 1 , … , italic_N [HRD18], several commonly used statistics and their corresponding distributions can be computed accordingly. We listed some popular ones in Table 1.

Methods Statistics Distributions under the null
Stouffer’s i=1NΦ1(pi)superscriptsubscript𝑖1𝑁superscriptΦ1subscript𝑝𝑖\sum_{i=1}^{N}\Phi^{-1}(p_{i})∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) N(0,N)𝑁0𝑁N(0,N)italic_N ( 0 , italic_N )
Fisher’s 2i=1Nlogpi2superscriptsubscript𝑖1𝑁subscript𝑝𝑖-2\sum_{i=1}^{N}\log p_{i}- 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT χ2N2subscriptsuperscript𝜒22𝑁\chi^{2}_{2N}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_N end_POSTSUBSCRIPT
Pearson’s 2i=1Nlog(1pi)2superscriptsubscript𝑖1𝑁1subscript𝑝𝑖-2\sum_{i=1}^{N}\log(1-p_{i})- 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) χ2N2subscriptsuperscript𝜒22𝑁\chi^{2}_{2N}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_N end_POSTSUBSCRIPT
Tippett’s min{p1,,pN}𝑚𝑖𝑛subscript𝑝1subscript𝑝𝑁min\{p_{1},\dots,p_{N}\}italic_m italic_i italic_n { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } Beta(1,N)𝐵𝑒𝑡𝑎1𝑁Beta(1,N)italic_B italic_e italic_t italic_a ( 1 , italic_N )
Table 1: Common meta-analysis methods.

2.1 Efficacy of Federated Surveillance

We start by studying the statistical power and sensitivity of federated surveillance methods compared to centralized data, i.e., whether decentralized hypothesis tests allow comparable accuracy in detecting surges compared to the (unattainable) ideal setting where all data could be pooled for a single test. We assess decentralized methods using both their theoretical expected accuracy on data drawn from our simplified generative model and on two real COVID-19 datasets.

Figure 2 shows the expected statistical power of each meta-analysis method for combining p-values on data drawn from our generative model, compared to the statistical power of a centralized version of the same hypothesis test and to a version that uses only the counts from a single data provider. We fix a threshold of θ=0.3𝜃0.3\theta=0.3italic_θ = 0.3 for a surge. The x𝑥xitalic_x axis varies the true rate of growth, with a higher power to detect surges when they deviate more significantly from the null. To ensure a fair comparison, we calibrate the rejection threshold for each method to match the nominal α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 rejection rate when the true growth rate is exactly 30% (i.e., precisely satisfying the null). To perform simulations, we set a total count over both training and testing periods to be 200. The counts are binomially simulated in the testing period with probability parameter (1+θ)/(1+θ+)1superscript𝜃1superscript𝜃(1+\theta^{\prime})/(1+\theta^{\prime}+\ell)( 1 + italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / ( 1 + italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + roman_ℓ ), where θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the growth rate to be tested. The total and testing period counts are multinomially allocated between 2 sites (Figure 2(a)) and 8 sites (Figure 2(b)) with a uniform distribution across the sites (an assumption we will revisit later).

We find that, in this idealized setting, the top-performing federated method (Stouffer’s method) almost exactly matches the power of the centralized data test. Conversely, significant power is lost by using only the p-value from a single site, indicating that sharing information across sites is necessary for good performance. The other meta-analysis methods exhibit lower power than Stouffer’s; in later sections, we will examine the settings in which different meta-analysis methods for combining the p-values lead to better or worse performance. More experiments are shown in “Power curves and calibrations” Section of Supplementary Material.

Refer to caption

(a) Two sites provide the p-values.

Refer to caption

(b) Eight sites provide the p-values.

Figure 2: Power analysis of federated surveillance methods.

However, this idealized setting is highly simplified based on our assumptions. Real-world epidemic data deviates from such assumptions in multiple ways, including non-stationary time series and non-uniform distribution of patients among facilities. To validate the robustness of our federated surveillance framework in the real world, we use two datasets providing a more realistic representation of the complexities and challenges, allowing us to assess the performance of the methods under more diverse conditions. The first dataset is COVID-19 hospitalization reported in the “COVID-19 Reported Patient Impact and Hospital Capacity by Facility” dataset, provided by the U.S. Department of Health & Human Services, covering the period from 2020-07-10 to 2023-03-03. The second dataset is a daily time series of the total number of outpatient insurance claims with a primary diagnosis of COVID-19 in each county, published on Delphi Epidata [RBJ+21, MBG+21, SRB+21](https://cmu-delphi.github.io/delphi-epidata/) and based on counts provided by Change Healthcare, covering the period from 2020-08-02 to 2022-07-30. See “Datasets” Section of Supplementary Material for more detailed information of the datasets.

In our analysis of these datasets, we add up the counts of facility-level hospitalization to generate county-level p-value alerts, and county-level insurance claim counts to create state-level p-value alerts. The term “alert” in this context indicates the occurrence of a significant increase. Formally, a p-value alert indicates that the confidence level of rejecting the null hypothesis is below a predetermined threshold α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. For more details of evaluation, see “Evaluation of the surge detection task” in Methods Section. Due to variations in reporting frequency and the number of sites, the hospitalization data tends to have larger counts distributed in fewer sites, while the claim data has smaller counts distributed in more sites. By applying our methods to these datasets, we obtain the recall-precision curves as Figure 3. The recall and precision of the combined p-values are evaluated against the ground truth as the p-value alerts on the centralized data, i.e., the total counts in that geographic region. The results demonstrate that the federated test, with the appropriate combination method, can effectively reconstruct centralized information. In our analysis of facility-level hospitalization data, Stouffer’s method achieved a recall of 0.95 when we set precision to 0.90, corresponding to a false discovery rate (FDR) of 0.10, and achieving an area under the curve (AUC) of 0.98. When examining county-level claim data, Fisher’s method attained a recall of 0.76 at a precision of 0.90, and an AUC of 0.95. Using data only from the largest single facility produced lower accuracy. Specifically, for hospitalization and claim data, when fixing the precision to 0.90, their recall is both 0.62. Their AUCs are 0.88 and 0.90. Importantly, this data is directly drawn from the real world, and need not satisfy the assumptions of our generative process. The results demonstrate the potential of federated methods for early detection of outbreaks, even for p-values combined with only the simplest meta-analysis framework.

The selection of the highest-performing meta-analysis method should depend on the characteristics of the underlying time series. In the next subsection, we will use semi-synthetic data to explore how such characteristics impact the performance of the different meta-analysis methods.

Refer to caption

(a) Facility-level hospitalizations: p-values from weekly counts combined from multiple separate facilities to generate county-level p-value alerts.

Refer to caption

(b) County-level insurance claims: p-values from daily counts combined from multiple separate counties to generate state-level p-value alerts.

Figure 3: Real data analysis of federated surveillance methods.

2.2 Comparative performance of the different combination methods

Based on our discussion on theoretical analysis (see Methods Section) and the experiments on the real-world data above, we have drawn preliminary conclusions regarding our proposed test. Firstly, employing a combined test using a meta-analysis framework generally yields superior performance compared to relying solely on a single entity, even if the entity contributes a significant portion of the counts. Secondly, the optimal choice of method depends on specific features of the data. If the reporting sites are comparable in size and the counts have relatively large magnitudes, Stouffer’s method is preferable. However, if the sites’ sizes are uneven, logp𝑝\log proman_log italic_p-based Fisher’s method provides superior performance.

To better understand the impact of various aspects of the data on the combination process, we can selectively modify one factor while keeping others constant. Several factors might influence the success of federated surveillance methods. Firstly, as the number of reporting entities increases, combining p-values becomes increasingly complex. This arises because each entity introduces a blend of variabilities - changes in population size, prevalence, observation noise, and potential negative correlations between sites - that complicates the separation of distinct effects in a single p-value. In that case, both multiplicative and additive effects on the p-values and their approximations are amplified with more facilities, which leads to more biased results. Secondly, the magnitude of the counts affects which p-value combination method performs best, as evident from the power formula in Equation 5 in Methods Section. Thirdly, the imbalance in the proportions of the data providers in terms of the cases in a region challenges the robustness of the combination methods.

In order to rigorously analyze these factors while effectively controlling for other variables, we employ a semi-synthetic data analysis (see “Semi-synthetic data analysis” in Methods Section) utilizing real daily COVID-related claims data. We first define the Poisson rate parameter for the simulation as the 7-day moving average of the real data, with the growth rate of the Poisson rate defined accordingly. Simulated observations are then generated by drawing Poisson-distributed samples from the rate parameter for each time step. The growth rate alert is then defined as whether the growth rate of the Poisson rate is larger than a threshold. Finally, we simulate the split of the sampled time series into a set of individual sites by drawing counts from a multinomial distribution. By varing the parameters of this distribution, we can control the degree of dispersion of data across sites. In the real world, the Poisson rates are not available even when the counts for all parties are known; thus the true growth rates and the correct growth rate alerts are unknown. For semi-synthetic data, the ground truth for each of these quantities is known, allowing us to compare how well different algorithms match it. In particular, we can distinguish between loss of accuracy due to the Poisson-distributed noise (reflected in the gap between the centralized method and ground truth) and loss due to decentralization of the data.

In Figure 4, each plot represents a comparison of the statistical powers of methods when the FDR is set at 0.10, in an analysis where one dimension is altered while all other dimensions are fixed. Figure 4(a) varies the number of sites while maintaining equal shares among them. Figure 4(b) scales the underlying Poisson rate parameter of the simulation by a multiplier, allowing us to examine performance with smaller counts where observations become sparser. The power of simply selecting the p-values from the largest site, represented by the green line, is not fully shown in the plot due to its significantly lower performance compared to the other methods. Figure 4(c) explores the case where the number of sites is fixed at N=5𝑁5N=5italic_N = 5, and the degree of imbalance in the shares is varied. The level of imbalance is quantified using the normalized entropy metric S=(i=1Nsilogsi)/(logN)𝑆superscriptsubscript𝑖1𝑁subscript𝑠𝑖subscript𝑠𝑖𝑁S=(-\sum_{i=1}^{N}s_{i}\log s_{i})/(\log N)italic_S = ( - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / ( roman_log italic_N ), where a value of 1 indicates perfectly equal shares.

Our findings indicate that federated analysis performs well compared to the centralized setting. Relying solely on a single facility, even with a relatively large share of the counts, yields poor results. Fisher’s method demonstrates the highest stability when the number of sites varies but each has an equal share of the total. Therefore, when the data is distributed among numerous sites, Fisher’s method is the preferred choice for meta-analysis. Additionally, we observe that Fisher’s method performs slightly better when the magnitudes of the counts are relatively small. In other cases, Stouffer’s method performs better, perhaps reflecting that fact that the Gaussian approximation of the Binomial parameter is more accurate when the counts are larger. Furthermore, our analysis suggests that using only the largest site can outperform naive (unweighted) combination methods only when there is one dominant site in the entire region, as indicated by a normalized entropy of less than 0.7, which corresponds to the largest site having a share of at least 65%. For example, when the shares of five sites are (0.65, 0.1, 0.1, 0.1, 0.05), the normalized entropy of site shares is 0.70.

Refer to caption

(a) Numbers of reporting sites (of equal share).

Refer to caption

(b) Effect of varying the total magnitude of counts, equally split between eight sites. The x𝑥xitalic_x axis gives the magnitude relative to the real data; values smaller than 1 indicate a reduction in the expected counts.

Refer to caption

(c) Five sites with different shares. The larger the normalized entropy, the more equal the site shares are.

Figure 4: Federated surveillance methods on semi-synthetic data, varying site-level data generating process.

2.3 Enhancing Federated Surveillance with Auxiliary Information

In the previous analysis of federated surveillance, we applies a meta-analysis framework which uses only p-values from the reporting sites. We might be able to use auxiliary information to further optimize the framework. Intuitively, hypothesis tests of larger sites are expected to be less noisy and should be weighed more than smaller sites. If the relative shares of the different sites are known even approximately, we can incorporate this information and assign weights to the different tests, as weighting is a common approach for integrating evidence [HY22, ZW22, VW20, Whi05].

In Methods Section, we calculate the appropriate weighting scheme for different meta-analysis methods as a function of the relative shares of the reporting sites. Our framework differs from most previous meta-analysis studies [WMLE09, ZW22, Goo55, YBPN21, Lan61] because the target can be reconstructed from the summation of decentralized counts when complete information is available. In contrast, in other meta-analysis studies, like Genome-wide association studies (GWAS), researchers combine the tests on whether a variant has an effect on a phenotype based on specific samples as the sample level information across different studies is unavailable. In such cases, the inverse of the standard error and the square root of the sample size are suggested to be used as weights [WMLE09]. In our framework, we can explore different approximations of p-values and decide how to optimally weight them. For Stouffer’s method, we show that weighting by the square roots of the shares recovers a Gaussian approximation to the centralized-data p-value. For Fisher’s method, we compare several previously proposed weighting schemes [YBPN21, Goo55, Lan61] in simulation and observe that the wFisher method [YBPN21] performs best. In addition to adding weights, we can make a further modification to Stouffer’s method when an estimate of the magnitudes of total reports of all sites is available. The modification involves incorporating a continuity correction term to make the results less conservative, which can be particularly useful when the total counts are small.

By comparing the performance of selecting the largest site with Stouffer’s and Fisher’s methods before and after incorporating weights, using our semi-synthetic data analysis framework, we observe in Figure 5 that after incorporating weights, both methods show improvements. Furthermore, the combined methods outperform selecting the largest site, even in the extreme setting where the largest site among all five accounts for 80% of the share. Specifically, the weighted Stouffer’s method closely approximates the performance of the centralized setting when the shares between sites are similar, while wFisher demonstrates consistently superior performance when the shares of sites are unbalanced.

Refer to caption
Figure 5: Semi-synthetic analysis of the weighted methods.

The performance of both the naive and weighted versions of Stouffer’s and Fisher’s methods is depicted in Figure 6(a) and Figure 6(b), where the weighted versions of both methods show improvement as anticipated. For HHS hospitalizations (Figure 6(a)), after weighting, Fisher’s method’s AUC increases from 0.93 to 0.94, with recall at precision of 0.90 increases from 0.71 to 0.77; Stouffer’s method’s AUC increases from 0.98 to 0.99, with recall at precision of 0.90 increases from 0.95 to 0.99. For outpatient insurance claims Figure 6(b), after weighting, Fisher’s method’s AUC increases from 0.95 to 0.98, with recall at precision of 0.90 increases from 0.76 to 0.94; Stouffer’s method’s AUC increases from 0.87 to 0.93, with recall at precision of 0.90 increases from 0.65 to 0.84. Additionally, the inclusion of a continuity correction improves the AUC from 0.93 to 0.94 and recall at precision of 0.90 from 0.84 to 0.90 for daily-reported claim data with smaller counts. For weekly-reported hospitalization data, the counts are larger and the continuity correction has little impact.

Finally, we test whether these patterns from the semi-synthetic experiments also hold in a more realistic setting. In the real world, the real-time shares of different sites may not be readily available, requiring the estimation of weights using auxiliary data. Various approaches can be employed depending on the available data. One potential setting is where sites report their total counts infrequently (e.g., monthly or quarterly), where they cannot be combined in real-time for surveillance, but may be used to estimate weights provided that the relative shares of different sites changes more slowly than the counts themselves. We test a range of possible reporting lags of auxiliary information used in weights and find that even relatively infrequent updates (e.g., weights estimated every 12 weeks) lead to improved performance with AUC around 0.93 compared to uniform weights with AUC = 0.87. The impact of the reporting cycle and lag of the auxiliary information on real datasets is illustrated in Figure 6(c). Generally speaking, these factors have a small influence on the improvement achieved through weighted combinations. A second potential scenario is to use relatively static proxies for facility size to estimates weights, such as bed or ICU capacity. Empirically, we find that estimating the shares of providers using the bed or ICU usage generally improves performance, albeit with a marginal improvement shown here, compared to not having weights at all. However, this approach generally underperforms relative to methods that utilize specific information about COVID counts, as detailed in Figure 6(d).

Refer to caption

(a) Facility-level, weekly, larger counts.

Refer to caption

(b) County-level, daily, smaller counts.

Refer to caption

(c) Weighted and continuity corrected Stouffer’s method under different updating cycles (Upd.) and delays (Lag) of auxiliary information of claim data.

Refer to caption

(d) Weighted Stouffer’s method using different auxiliary information of hospitalization data.

Figure 6: Real data analysis of unweighted, weighted and continuity corrected methods.

3 Discussion

Timely and accurate detection of outbreaks is critical to enable decision-making and responsive policy in public health emergencies. However, this task is difficult when critical data is distributed across multiple data custodians who may be unable or reluctant to share it. This study presents a framework for federated surveillance, which addresses these challenges by performing statistical analysis behind each data custodian’s firewall followed by a meta-analysis to aggregate the evidence. Our results validate the potential for timely detection of emerging trends in population health without direct disclosure of any health data, even aggregated counts.

Our results also show the relative strengths and weaknesses of different meta-analysis methods under Poisson assumption for performing this aggregation on epidemiological data. We find that the relative performance of different p-value combination methods depends on the number of reporting sites, their relative sizes, and the expected magnitude of the counts. Stouffer’s method performs best where data is concentrated in a smaller number of sites and the magnitude of reports is relatively large. On the other hand, Fisher’s method exhibits robustness in more challenging settings characterized by a larger number of data holders and greater imbalances of shares among them. The inclusion of additional information, such as the sites’ shares and estimated total counts within a given region, enables additional improvements in performance. Across all settings and datasets that we consider, we find that at least one meta-analysis method results in statistical power that closely approximates the best attainable if all data were available for a single analysis. Our experiments primarily focus on the Poisson rate ratio test, yet the conclusions drawn are not confined to this context alone. With minor modifications to the framework, similar results can be achieved for other hypothesis tests (such as the Poisson test itself or under different data distribution assumptions, including the (Log) Normal distribution which uses Gaussian approximation of sufficiently large counts, or overdispersed Poisson and Negative Binomial distribution, which accounts for additional over-dispersion [GMS95].

Federated surveillance provides a simple, readily implementable framework for addressing the practical barriers to including already-existing health system data in public health surveillance systems. While more complex methodologies such as homomorphic encryption could provide stronger theoretical guarantees, the methods presented here are more readily understood by the lay public and hence more likely to be acceptable to health data custodians. Our work demonstrates that relatively simple meta-analysis methods can enable significantly more accurate and timely warnings of changes in population health without the creation of centralized datasets.

4 Methods

4.1 Poisson rate ratio test for detecting a surge

Our model assumes that in a short time period such as a week or a month, if there is no surge, the counts follow a Poisson distribution with a rate parameter λ𝜆\lambdaitalic_λ, which can be estimated based on observations from the previous period. However, in the presence of a sudden surge, the distribution changes, and the rate parameter λ𝜆\lambdaitalic_λ increases.

To identify surges, we propose conducting a hypothesis test to determine whether the increase in Poisson rates exceeds a user-defined threshold, denoted as θ𝜃\thetaitalic_θ. This threshold can be tailored to the inherent characteristics of different indicators, allowing for an adjustable tradeoff between sensitivity and specificity. Specifically, a surge is defined as a Poisson rate that increases by a factor of at least θ𝜃\thetaitalic_θ during the testing period with Poisson rate parameter λTsubscript𝜆𝑇\lambda_{T}italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, compared to the baseline period with parameter λBsubscript𝜆𝐵\lambda_{B}italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. Formally, we test the null hypothesis

H0:λTλB1+θ.:subscript𝐻0subscript𝜆𝑇subscript𝜆𝐵1𝜃H_{0}:\frac{\lambda_{T}}{\lambda_{B}}\leq 1+\theta.italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : divide start_ARG italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG ≤ 1 + italic_θ . (1)

We propose utilizing the UMP unbiased test, which is a method of testing two Poisson rates ratio first proposed by Przyborowski and Wilenski [PW40]. This test is based on conditioning on the summation of the counts of the whole period including the baseline period and testing period. We consider kBj,j=T,,T1formulae-sequencesubscript𝑘𝐵𝑗𝑗𝑇𝑇1k_{Bj},j=T-\ell,\dots,T-1italic_k start_POSTSUBSCRIPT italic_B italic_j end_POSTSUBSCRIPT , italic_j = italic_T - roman_ℓ , … , italic_T - 1 and kTsubscript𝑘𝑇k_{T}italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT as independently distributed according to Poi(λB)Poisubscript𝜆𝐵\mathrm{Poi}(\lambda_{B})roman_Poi ( italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) and Poi(λT)Poisubscript𝜆𝑇\mathrm{Poi}(\lambda_{T})roman_Poi ( italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), so that their joint distribution can be written as

P(kB,kT)=eλB+λTkB(T)!kB(T1)!kTexp[kTlogλTλB+(j=TT1kB+kT)logλB]𝑃subscript𝑘𝐵subscript𝑘𝑇superscript𝑒subscript𝜆𝐵subscript𝜆𝑇subscript𝑘𝐵𝑇subscript𝑘𝐵𝑇1subscript𝑘𝑇subscript𝑘𝑇subscript𝜆𝑇subscript𝜆𝐵superscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵subscript𝑘𝑇subscript𝜆𝐵P(k_{B},k_{T})=\frac{e^{\ell\lambda_{B}+\lambda_{T}}}{k_{B(T-\ell)}!...k_{B(T-% 1)}!k_{T}}\exp{\left[k_{T}\log\frac{\lambda_{T}}{\lambda_{B}}+(\sum_{j=T-\ell}% ^{T-1}k_{B}+k_{T})\log\lambda_{B}\right]}italic_P ( italic_k start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = divide start_ARG italic_e start_POSTSUPERSCRIPT roman_ℓ italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_B ( italic_T - roman_ℓ ) end_POSTSUBSCRIPT ! … italic_k start_POSTSUBSCRIPT italic_B ( italic_T - 1 ) end_POSTSUBSCRIPT ! italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG roman_exp [ italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT roman_log divide start_ARG italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG + ( ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) roman_log italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ] (2)

By Theorem 4.4.1 in [RL05], there exist UMP unbiased tests concerning the ratio λT/λBsubscript𝜆𝑇subscript𝜆𝐵\lambda_{T}/\lambda_{B}italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. By the Theorem, the tests are performed conditionally on the integer points of the hyperplane segment, which is the total counts over both periods equals to j=TT1kB+kTsuperscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵subscript𝑘𝑇\sum_{j=T-\ell}^{T-1}k_{B}+k_{T}∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, in the positive quadrant of the (kB(T),,kB(T1),kT)subscript𝑘𝐵𝑇subscript𝑘𝐵𝑇1subscript𝑘𝑇(k_{B(T-\ell)},\dots,k_{B(T-1)},k_{T})( italic_k start_POSTSUBSCRIPT italic_B ( italic_T - roman_ℓ ) end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_B ( italic_T - 1 ) end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) space. The conditional distribution of kTsubscript𝑘𝑇k_{T}italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT given total counts is the binomial distribution corresponding to j=TT1kB+kTsuperscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵subscript𝑘𝑇\sum_{j=T-\ell}^{T-1}k_{B}+k_{T}∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT trials and probability (1+θ)/(1+θ+)1𝜃1𝜃(1+\theta)/(1+\theta+\ell)( 1 + italic_θ ) / ( 1 + italic_θ + roman_ℓ ) of success. Then the UMP unbiased test can be written as Equation 3.

H0:λTλT+λB1+θ1+θ+:subscriptsuperscript𝐻0subscript𝜆𝑇subscript𝜆𝑇subscript𝜆𝐵1𝜃1𝜃H^{\prime}_{0}:\frac{\lambda_{T}}{\lambda_{T}+\ell\lambda_{B}}\leq\frac{1+% \theta}{1+\theta+\ell}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : divide start_ARG italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + roman_ℓ italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG ≤ divide start_ARG 1 + italic_θ end_ARG start_ARG 1 + italic_θ + roman_ℓ end_ARG (3)

Essentially, the UMP test corresponds to a Binomial test that examines the indicator during the testing period conditioning on the total counts of both the baseline and testing periods. Using this test, we can easily determine the p-value through a one-tailed exact test. The formulation of the p-value is shown as Equation 4.

p=Pr(rkT|j=TT1kBj+kT,1+θ1+θ+)=r=0c(nr)(1ρ)nrρr𝑝Pr𝑟conditionalsubscript𝑘𝑇superscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵𝑗subscript𝑘𝑇1𝜃1𝜃superscriptsubscript𝑟0𝑐binomial𝑛𝑟superscript1𝜌𝑛𝑟superscript𝜌𝑟p=\Pr(r\geq k_{T}|\sum_{j=T-\ell}^{T-1}k_{Bj}+k_{T},\frac{1+\theta}{1+\theta+% \ell})=\sum_{r=0}^{c}{n\choose r}(1-\rho)^{{n}-r}\rho^{r}italic_p = roman_Pr ( italic_r ≥ italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B italic_j end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , divide start_ARG 1 + italic_θ end_ARG start_ARG 1 + italic_θ + roman_ℓ end_ARG ) = ∑ start_POSTSUBSCRIPT italic_r = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( binomial start_ARG italic_n end_ARG start_ARG italic_r end_ARG ) ( 1 - italic_ρ ) start_POSTSUPERSCRIPT italic_n - italic_r end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT (4)

where p-value is the probability of count r𝑟ritalic_r being greater than or equal to kTsubscript𝑘𝑇k_{T}italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT given the total counts over the baseline period and testing period, and the summation in the second line for count r𝑟ritalic_r from 0 to c:=j=TT1kBj,n:=j=TT1kBj+kT,ρ:=/(1+θ+)formulae-sequenceassign𝑐superscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵𝑗formulae-sequenceassign𝑛superscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵𝑗subscript𝑘𝑇assign𝜌1𝜃c:=\sum_{j=T-\ell}^{T-1}k_{Bj},n:=\sum_{j=T-\ell}^{T-1}k_{Bj}+k_{T},\rho:=\ell% /(1+\theta+\ell)italic_c := ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B italic_j end_POSTSUBSCRIPT , italic_n := ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B italic_j end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_ρ := roman_ℓ / ( 1 + italic_θ + roman_ℓ ). This test is known for being exact while conservative, as the actual significance level is always below the nominal level [GNTS08].

The power of a hypothesis test is defined as the probability of rejecting the null hypothesis when the alternative hypothesis is in fact true. In the case of the Binomial test under the null hypothesis, we need to determine the critical value kcrsubscript𝑘𝑐𝑟k_{cr}italic_k start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT for kTsubscript𝑘𝑇k_{T}italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT at a given type I error rate α𝛼\alphaitalic_α. This critical value represents the minimum number of successes in the sample required to reject the null hypothesis in favor of the alternative hypothesis. Mathematically, the critical value is determined by finding kcrsubscript𝑘𝑐𝑟k_{cr}italic_k start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT that satisfies the condition Pr(rkcr|n=j=TT1kBj+kT,(1+θ)/(1+θ+)α\Pr(r\geq k_{cr}|n=\sum_{j=T-\ell}^{T-1}k_{Bj}+k_{T},(1+\theta)/(1+\theta+\ell% )\leq\alpharoman_Pr ( italic_r ≥ italic_k start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT | italic_n = ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B italic_j end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , ( 1 + italic_θ ) / ( 1 + italic_θ + roman_ℓ ) ≤ italic_α. Under the alternative hypothesis, characterized by a higher growth rate of the Poisson rate θ>θsuperscript𝜃𝜃\theta^{\prime}>\thetaitalic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_θ, the power is computed as Pr(rkcr|n=j=TT1kBj+kT,(1+θ)/(1+θ+))Pr𝑟conditionalsubscript𝑘𝑐𝑟𝑛superscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵𝑗subscript𝑘𝑇1superscript𝜃1superscript𝜃\Pr\left(r\geq k_{cr}|n=\sum_{j=T-\ell}^{T-1}k_{Bj}+k_{T},(1+\theta^{\prime})/% (1+\theta^{\prime}+\ell)\right)roman_Pr ( italic_r ≥ italic_k start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT | italic_n = ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B italic_j end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , ( 1 + italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / ( 1 + italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + roman_ℓ ) ). The power quantifies the test’s ability to detect a surge when it truly exists.

The analytical formula for calculating power in the discrete distribution makes it difficult to see the contribution of different parameters from its form directly, as the critical value and the counts should be integers. However, in the figures of the power analysis, we used the exact power rather than the approximation. As a practical alternative, a Gaussian approximation of Binomial distribution with continuity correction can be employed. This approximation will ignore the rounding errors on the power calculation by allowing for the decimals in the values while maintaining an acceptable level of precision. With continuity correction, we can compute the power as Equation 5. The details of the proof are in “Power computation” Section of Supplementary Material.

power=Φ(n(θθ)(1+θ+)(1+θ)Zα(1+θ+)1+θ(1+θ+)(1+θ)1+θ+2n(1+θ))powerΦ𝑛superscript𝜃𝜃1𝜃1superscript𝜃subscript𝑍𝛼1superscript𝜃1𝜃1𝜃1superscript𝜃1superscript𝜃2𝑛1superscript𝜃\mathrm{power}=\Phi\left(\frac{\sqrt{n\ell}(\theta^{\prime}-\theta)}{(1+\theta% +\ell)\sqrt{(1+\theta^{\prime})}}-\frac{Z_{\alpha}(1+\theta^{\prime}+\ell)% \sqrt{1+\theta}}{(1+\theta+\ell)\sqrt{(1+\theta^{\prime})}}-\frac{1+\theta^{% \prime}+\ell}{2\sqrt{n\ell(1+\theta^{\prime})}}\right)roman_power = roman_Φ ( divide start_ARG square-root start_ARG italic_n roman_ℓ end_ARG ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_θ ) end_ARG start_ARG ( 1 + italic_θ + roman_ℓ ) square-root start_ARG ( 1 + italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG end_ARG - divide start_ARG italic_Z start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( 1 + italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + roman_ℓ ) square-root start_ARG 1 + italic_θ end_ARG end_ARG start_ARG ( 1 + italic_θ + roman_ℓ ) square-root start_ARG ( 1 + italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG end_ARG - divide start_ARG 1 + italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + roman_ℓ end_ARG start_ARG 2 square-root start_ARG italic_n roman_ℓ ( 1 + italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG end_ARG ) (5)

The expression inside the Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) function comprises three terms, each capturing a specific aspect of the analysis. The first term quantifies the impact of the total counts’ magnitude, while the second term relates to the type I error rate. The third term corresponds to the continuity correction term, which can be ignored when the sample size n𝑛nitalic_n is sufficiently large. This formula provides an approximation of the power and allows for a more intuitive understanding of the influence of different parameters.

4.2 Overview of meta-analysis methods

Meta-analysis is known for its ability to enhance statistical power by combining signals of moderate significance, effectively controlling false positives, and enabling comparisons and contrasts across tests and time [YBPN21]. The properties of various p-value combination methods have been extensively studied. For instance, Heard and Rubin-Delanchy [HRD18] observed that Tippett’s and Fisher’s methods are more sensitive to smaller p-values, while Pearson’s methods are more sensitive to larger p-values. They also suggested that Fisher’s and Pearson’s methods are more suitable for testing positive-valued data under the alternative hypothesis, with Fisher’s method performing better for larger values and Pearson’s method for smaller values. Additionally, Stouffer’s method is often preferred for testing real-valued data that approximates a Gaussian distribution.

Among the various combination methods, Stouffer’s and Fisher’s methods have gained significant attention in the literature due to their popularity in meta-analysis. Elston [Els91] noted that when the number of sites is very large, Fisher’s method will give a combined p-value that is close to 0 when the actual p-value is below 1/e1𝑒1/e1 / italic_e; and will give a combined p-value that is close to 1 when the actual p-value is above 1/e1𝑒1/e1 / italic_e. Similarly, Stouffer’s method has the threshold point as 1/2121/21 / 2, while Pearson’s method has 11/e11𝑒1-1/e1 - 1 / italic_e. Rice [Ric90] suggested that methods like Stouffer’s are more appropriate when all tests are homogeneous and the combined p-value can be interpreted as a “consensus p-value”. On the other hand, Fisher’s method is particularly useful when testing against broad alternatives which specifically tests whether at least one component test is significant. It has shown superiority in certain scenarios, such as in GWAS where there may be significant differences in effect sizes between different populations [YBPN21]. In such cases, Stouffer’s and Lancaster’s [Lan61] methods tend to lose power where there are only a few studies show strong evidence of rejecting the hypothesis. This highlights Fisher’s method’s advantage in handling potential negative correlations between entities, which can arise due to factors like competition.

The meta-analysis methods outlined in Table 1 are based on the assumption that test statistics adhere to continuous distributions, with the joint null hypothesis for the p-values being H0:piU[0,1],i=1,,N:subscript𝐻0formulae-sequencesimilar-tosubscript𝑝𝑖𝑈01𝑖1𝑁H_{0}:p_{i}\sim U[0,1],i=1,\dots,Nitalic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U [ 0 , 1 ] , italic_i = 1 , … , italic_N [HRD18]. However, this conventional understanding of p-value distributions doesn’t align with the hypothesis test we introduce in our study. There are several reasons contributing to this discrepancy. First, for discrete distributions, such as the Binomial, p-values aren’t uniformly distributed under the null hypothesis due to truncation. Second, given the target as the centralized counts, the decentralized counts exhibit a negative correlation, making the dependence structure for decentralized p-values more complicated. Moreover, when the null hypothesis is rejected, the distribution of p-values undergoes a shift, necessitating strategies to mitigate the resulting decrease in power during meta-analysis. To address these issues, we present a perspective from the explicit mathematical format of the p-values in both centralized and distributed settings, followed by a discussion on how to combine them with minimal loss.

In the Binomial test of the distributed setting, the p-value for site i𝑖iitalic_i can be expressed as Equation 6.

pi=Pr(rkTi|j=TT1kBij+kTi,1+θ1+θ+)=r=0ci(nir)(1ρ)nirρrsubscript𝑝𝑖Pr𝑟conditionalsubscript𝑘𝑇𝑖superscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵𝑖𝑗subscript𝑘𝑇𝑖1𝜃1𝜃superscriptsubscript𝑟0subscript𝑐𝑖binomialsubscript𝑛𝑖𝑟superscript1𝜌subscript𝑛𝑖𝑟superscript𝜌𝑟p_{i}=\Pr(r\geq k_{Ti}|\sum_{j=T-\ell}^{T-1}k_{Bij}+k_{Ti},\frac{1+\theta}{1+% \theta+\ell})=\sum_{r=0}^{c_{i}}{n_{i}\choose r}(1-\rho)^{n_{i}-r}\rho^{r}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Pr ( italic_r ≥ italic_k start_POSTSUBSCRIPT italic_T italic_i end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B italic_i italic_j end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T italic_i end_POSTSUBSCRIPT , divide start_ARG 1 + italic_θ end_ARG start_ARG 1 + italic_θ + roman_ℓ end_ARG ) = ∑ start_POSTSUBSCRIPT italic_r = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( binomial start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_r end_ARG ) ( 1 - italic_ρ ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_r end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT (6)

where p-value for site i𝑖iitalic_i is the probability of count r𝑟ritalic_r being greater than or equal to kTisubscript𝑘𝑇𝑖k_{Ti}italic_k start_POSTSUBSCRIPT italic_T italic_i end_POSTSUBSCRIPT given the total counts of that site over the baseline period and testing period, and the summation in the second line is calculated for count r𝑟ritalic_r from 0 to where ci:=j=TT1kBij,ni:=j=TT1kBij+kTiformulae-sequenceassignsubscript𝑐𝑖superscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵𝑖𝑗assignsubscript𝑛𝑖superscriptsubscript𝑗𝑇𝑇1subscript𝑘𝐵𝑖𝑗subscript𝑘𝑇𝑖c_{i}:=\sum_{j=T-\ell}^{T-1}k_{Bij},n_{i}:=\sum_{j=T-\ell}^{T-1}k_{Bij}+k_{Ti}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B italic_i italic_j end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_j = italic_T - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_B italic_i italic_j end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_T italic_i end_POSTSUBSCRIPT.

It is evident that directly combining the p-values of different sites from Equation 6 to Equation 5 without any loss is not feasible, as we need to deal with n choose r, but we have no access to the counts themselves. Therefore, our goal lies in finding better approximations and developing improved methods for combining the approximations, trying our best to use less count information. The selection among various meta-analysis methodologies mirrors the quest for a more precise approximation. We will further explore combination techniques, emphasizing Stouffer’s and Fisher’s methods, and elucidate how mathematical formulations assist in prudently integrating auxiliary information. For instance, we might consider the estimated contributions from different data providers or attribute weights to studies. Weighting stands as a prevalent strategy for assimilating evidence, while discerning the optimal weighting scheme also demands consideration. Beyond introducing weights, Stouffer’s method can be further refined given estimates of the aggregate reports from all sites n=i=1Nni𝑛superscriptsubscript𝑖1𝑁subscript𝑛𝑖n=\sum_{i=1}^{N}n_{i}italic_n = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This refinement introduces a continuity correction term, tempering over-conservative results when the overall counts are small.

4.3 Stouffer’s method

Stouffer’s method is employed by utilizing the Gaussian approximation of the Binomial parameter, which is based on the central limit theorem. This approach is commonly used when analyzing the Binomial and Poisson distributions, especially when the counts are sufficiently large. To test the success probability ρ𝜌\rhoitalic_ρ using Stouffer’s method, the distribution of c/nρ𝑐𝑛𝜌c/n-\rhoitalic_c / italic_n - italic_ρ is approximated by N(0,ρ(1ρ)/n)𝑁0𝜌1𝜌𝑛N(0,\rho(1-\rho)/n)italic_N ( 0 , italic_ρ ( 1 - italic_ρ ) / italic_n ). The z-score can then be calculated as z=(cnρ)/nρ(1ρ)𝑧𝑐𝑛𝜌𝑛𝜌1𝜌z=(c-n\rho)/\sqrt{n\rho(1-\rho)}italic_z = ( italic_c - italic_n italic_ρ ) / square-root start_ARG italic_n italic_ρ ( 1 - italic_ρ ) end_ARG [BCD02]. The p-value of the Binomial exact test is determined by the cumulative distribution function FBin(c;n,ρ)subscript𝐹Bin𝑐𝑛𝜌F_{\mathrm{Bin}}(c;n,\rho)italic_F start_POSTSUBSCRIPT roman_Bin end_POSTSUBSCRIPT ( italic_c ; italic_n , italic_ρ ). Define rounding error or fluctuation term ϵr=1/2{(nρ+znρ(1ρ))(nρ+znρ(1ρ))}subscriptitalic-ϵ𝑟12𝑛𝜌𝑧𝑛𝜌1𝜌𝑛𝜌𝑧𝑛𝜌1𝜌\epsilon_{r}=1/2-\{(n\rho+z\sqrt{n\rho(1-\rho)})-\left\lfloor(n\rho+z\sqrt{n% \rho(1-\rho)})\right\rfloor\}italic_ϵ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 1 / 2 - { ( italic_n italic_ρ + italic_z square-root start_ARG italic_n italic_ρ ( 1 - italic_ρ ) end_ARG ) - ⌊ ( italic_n italic_ρ + italic_z square-root start_ARG italic_n italic_ρ ( 1 - italic_ρ ) end_ARG ) ⌋ }, which takes values in the interval [1/2,1/2]1212[-1/2,1/2][ - 1 / 2 , 1 / 2 ]. Thus, the true p-value in terms of the z-score with error terms can be expressed as Equation 7 [BCD02, BR10, Ess45].

p=Φ(z)+((12ρ)(1z2)6+ϵr)Φ(z)nρ(1ρ)+𝐎(n1)𝑝Φ𝑧12𝜌1superscript𝑧26subscriptitalic-ϵ𝑟Φ𝑧𝑛𝜌1𝜌𝐎superscript𝑛1p=\Phi(z)+\left(\frac{(1-2\rho)(1-z^{2})}{6}+\epsilon_{r}\right)\frac{\Phi(z)}% {\sqrt{n\rho(1-\rho)}}+\mathbf{O}(n^{-1})italic_p = roman_Φ ( italic_z ) + ( divide start_ARG ( 1 - 2 italic_ρ ) ( 1 - italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 6 end_ARG + italic_ϵ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) divide start_ARG roman_Φ ( italic_z ) end_ARG start_ARG square-root start_ARG italic_n italic_ρ ( 1 - italic_ρ ) end_ARG end_ARG + bold_O ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (7)

It should be noted that the denominator nρ(1ρ)𝑛𝜌1𝜌\sqrt{n\rho(1-\rho)}square-root start_ARG italic_n italic_ρ ( 1 - italic_ρ ) end_ARG in the first-order error term indicates that Stouffer’s method may be unreliable for small sample sizes or when the probability is close to 0 or 1 [BCD01].

After applying the Gaussian approximations, the combination of p-values becomes the next focus. One limitation of the naive meta-analysis methods is the assumption of equal contributions across studies, which may not hold true, especially when the studies have significantly different sizes. Determining appropriate weights for different studies poses a challenging task. In the case of Stouffer’s method, some studies in GWAS suggest using the inverse of the standard error or the square root of the sample size as weights [WMLE09]. Our proposed test is a special case of meta-analysis, as the centralized counts are the summation of decentralized counts. In this case, we can combine the approximations without any loss by introducing weights. By obtaining a centralized p-value p𝑝pitalic_p approximates Φ((i=1Nciρi=1Nni)/ρ(1ρ)i=1Nni)Φsuperscriptsubscript𝑖1𝑁subscript𝑐𝑖𝜌superscriptsubscript𝑖1𝑁subscript𝑛𝑖𝜌1𝜌superscriptsubscript𝑖1𝑁subscript𝑛𝑖\Phi\left((\sum_{i=1}^{N}c_{i}-\rho\sum_{i=1}^{N}n_{i})/\sqrt{\rho(1-\rho)\sum% _{i=1}^{N}n_{i}}\right)roman_Φ ( ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_ρ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / square-root start_ARG italic_ρ ( 1 - italic_ρ ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) using distributed p-value pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT approximates Φ((ciρni)/ρ(1ρ)ni)Φsubscript𝑐𝑖𝜌subscript𝑛𝑖𝜌1𝜌subscript𝑛𝑖\Phi\left((c_{i}-\rho n_{i})/\sqrt{\rho(1-\rho)n_{i}}\right)roman_Φ ( ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_ρ italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / square-root start_ARG italic_ρ ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ), it turns out that the weights for aggregating the p-values is the square root of the shares of each entity, which is formulated as Equation 8.

p=Φ(i=1NsiΦ1(pi))𝑝Φsuperscriptsubscript𝑖1𝑁subscript𝑠𝑖superscriptΦ1subscript𝑝𝑖p=\Phi\left(\sum_{i=1}^{N}\sqrt{s_{i}}\Phi^{-1}(p_{i})\right)italic_p = roman_Φ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT square-root start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (8)

Furthermore, improvements can be made by incorporating a continuity correction when an estimated value for n=i=1Nni𝑛superscriptsubscript𝑖1𝑁subscript𝑛𝑖n=\sum_{i=1}^{N}n_{i}italic_n = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, representing the total counts of all entities, is available. Due to the discreteness of the Binomial distribution and the continuity of the normal distribution, the correction is helpful when n𝑛nitalic_n is not sufficiently large. One commonly used correction is Yates’ correction in the Binomial test, which involves subtracting 1/2121/21 / 2 from the absolute difference between the observed count c𝑐citalic_c and the expected count nρ𝑛𝜌n\rhoitalic_n italic_ρ. Considering our case where c<nρ𝑐𝑛𝜌c<n\rhoitalic_c < italic_n italic_ρ, the approximation of the p-value can be rewritten as Φ((c+1/2ρn)/ρ(1ρ)n)Φ𝑐12𝜌𝑛𝜌1𝜌𝑛\Phi\left((c+1/2-\rho n)/\sqrt{\rho(1-\rho)n}\right)roman_Φ ( ( italic_c + 1 / 2 - italic_ρ italic_n ) / square-root start_ARG italic_ρ ( 1 - italic_ρ ) italic_n end_ARG ).

Similarly, we can derive the combination formula as Equation 9.

p=Φ(i=1NsiΦ1(pi)+1N2ρ(1ρ)n)𝑝Φsuperscriptsubscript𝑖1𝑁subscript𝑠𝑖superscriptΦ1subscript𝑝𝑖1𝑁2𝜌1𝜌𝑛p=\Phi\left(\sum_{i=1}^{N}\sqrt{s_{i}}\Phi^{-1}(p_{i})+\frac{1-N}{2\sqrt{\rho(% 1-\rho)n}}\right)italic_p = roman_Φ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT square-root start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG 1 - italic_N end_ARG start_ARG 2 square-root start_ARG italic_ρ ( 1 - italic_ρ ) italic_n end_ARG end_ARG ) (9)

The additional term (1N)/(2ρ(1ρ)n)1𝑁2𝜌1𝜌𝑛(1-N)/(2\sqrt{\rho(1-\rho)n})( 1 - italic_N ) / ( 2 square-root start_ARG italic_ρ ( 1 - italic_ρ ) italic_n end_ARG ) accounts for the effect of the continuity correction on the combined p-value. This correction becomes more necessary when the number of entities N𝑁Nitalic_N is large, but the total counts during the baseline and testing procedures are relatively small, indicating more dispersed data. In such cases, the correction term becomes more significant. If other types of coarse-grained evidence are available to estimate the counts’ magnitudes, a correction term can be added to make the combined p-value less conservative.

4.4 Fisher’s method

The statistical tests based on Fisher’s method and Pearson’s method involve taking the logarithm of the p-values and summing them. The rationale behind the log sum approaches is that the p-value FBin(c;n,ρ)subscript𝐹Bin𝑐𝑛𝜌F_{\mathrm{Bin}}(c;n,\rho)italic_F start_POSTSUBSCRIPT roman_Bin end_POSTSUBSCRIPT ( italic_c ; italic_n , italic_ρ ) is upper and lower bounded by exponential functions such as Equation 10. See “Proof for Equation 10” Section of Supplementary Material for the details of the proof.

12nexp(nD(cnρ))pexp(nD(cnρ))12𝑛𝑛𝐷conditional𝑐𝑛𝜌𝑝𝑛𝐷conditional𝑐𝑛𝜌\frac{1}{\sqrt{2n}}\exp\left(-nD(\frac{c}{n}\|\rho)\right)\leq p\leq\exp\left(% -nD(\frac{c}{n}\|\rho)\right)divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_n end_ARG end_ARG roman_exp ( - italic_n italic_D ( divide start_ARG italic_c end_ARG start_ARG italic_n end_ARG ∥ italic_ρ ) ) ≤ italic_p ≤ roman_exp ( - italic_n italic_D ( divide start_ARG italic_c end_ARG start_ARG italic_n end_ARG ∥ italic_ρ ) ) (10)

where D((c/n)|ρ)𝐷conditional𝑐𝑛𝜌D\left((c/n)|\rho\right)italic_D ( ( italic_c / italic_n ) | italic_ρ ) represents the relative entropy (Kullback-Leibler divergence) between (c/n,(nc)/n)𝑐𝑛𝑛𝑐𝑛(c/n,(n-c)/n)( italic_c / italic_n , ( italic_n - italic_c ) / italic_n ) and (ρ,1ρ)𝜌1𝜌(\rho,1-\rho)( italic_ρ , 1 - italic_ρ ), which is (c/n)log(c/(nρ))+((nc)/n)log((nc)/(n(1ρ)))𝑐𝑛𝑐𝑛𝜌𝑛𝑐𝑛𝑛𝑐𝑛1𝜌(c/n)\log\left(c/(n\rho)\right)+((n-c)/n)\log\left((n-c)/(n(1-\rho))\right)( italic_c / italic_n ) roman_log ( italic_c / ( italic_n italic_ρ ) ) + ( ( italic_n - italic_c ) / italic_n ) roman_log ( ( italic_n - italic_c ) / ( italic_n ( 1 - italic_ρ ) ) ).

The logarithm of the p-value are upper bounded by nD((c/n)ρ)𝑛𝐷conditional𝑐𝑛𝜌-nD\left((c/n)\|\rho\right)- italic_n italic_D ( ( italic_c / italic_n ) ∥ italic_ρ ), allowing the summation of logarithmic p-values from different sources to be meaningful. The formula indicates that the choice of ρ𝜌\rhoitalic_ρ and n𝑛nitalic_n is relatively flexible for Fisher’s method, compared with the error term of Stouffer’s method which is in proportion to (nρ(1ρ))1/2superscript𝑛𝜌1𝜌12(n\rho(1-\rho))^{-1/2}( italic_n italic_ρ ( 1 - italic_ρ ) ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT. Our experiments also support the idea that when the reported magnitude is small and the testing period is short, like the the “Counts of claims with confirmed COVID-19”, Fisher’s method is more reliable than Stouffer’s.

Different weighting strategies for Fisher’s method have been investigated, and various modifications have been proposed [ZW22, Goo55, Lan61, YBPN21]. However, the optimal weighting scheme remains uncertain. Some studies suggest employing adaptively weighted statistics combined with permutation tests [LT11] or using Monte Carlo algorithms to approximate the rejection region and determine optimal weights. Another approach involves constructing Good’s statistic [Goo55], which is a weighted statistic defined as 2i=1Nwilogpi2superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝑝𝑖-2\sum_{i=1}^{N}w_{i}\log p_{i}- 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for site i𝑖iitalic_i. Under the null hypothesis, it follows a chi-squared distribution with 2i=1Nwi2superscriptsubscript𝑖1𝑁subscript𝑤𝑖2\sum_{i=1}^{N}w_{i}2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT degrees of freedom (DF). Here, we use the weighting scheme wi=siNsubscript𝑤𝑖subscript𝑠𝑖𝑁w_{i}=s_{i}Nitalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_N, where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the share of each site. This weighting ensures that the resulting chi-squared statistic has a total DF equal to 2N2𝑁2N2 italic_N, i.e., 2i=1NsiNlogpiχ2N2similar-to2superscriptsubscript𝑖1𝑁subscript𝑠𝑖𝑁subscript𝑝𝑖subscriptsuperscript𝜒22𝑁-2\sum_{i=1}^{N}s_{i}N\log p_{i}\sim\chi^{2}_{2N}- 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_N roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_N end_POSTSUBSCRIPT.

Additionally, some methods leverage the fact that the 1p1𝑝1-p1 - italic_p quantile of the Gamma distribution Gam(α=1,β)Gam𝛼1𝛽\mathrm{Gam}(\alpha=1,\beta)roman_Gam ( italic_α = 1 , italic_β ) is logp/β𝑝𝛽-\log p/\beta- roman_log italic_p / italic_β, i.e., FGam(α=1,β)(logp/β)=1psubscript𝐹Gam𝛼1𝛽𝑝𝛽1𝑝F_{\mathrm{Gam}(\alpha=1,\beta)}(-\log p/\beta)=1-pitalic_F start_POSTSUBSCRIPT roman_Gam ( italic_α = 1 , italic_β ) end_POSTSUBSCRIPT ( - roman_log italic_p / italic_β ) = 1 - italic_p, where β=1/2𝛽12\beta=1/2italic_β = 1 / 2 represents Fisher’s methods. For example, Lancaster’s method [Lan61] sets β=1/2𝛽12\beta=1/2italic_β = 1 / 2 and transforms each pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the 1pi1subscript𝑝𝑖1-p_{i}1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTth quantile of the Gamma distribution with α=si/2𝛼subscript𝑠𝑖2\alpha=s_{i}/2italic_α = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 2. This transformation yields Xi=FGam(si/2,1/2)1(1pi)χsi2subscript𝑋𝑖subscriptsuperscript𝐹1Gamsubscript𝑠𝑖2121subscript𝑝𝑖similar-tosubscriptsuperscript𝜒2subscript𝑠𝑖X_{i}=F^{-1}_{\mathrm{Gam}(s_{i}/2,1/2)}(1-p_{i})\sim\chi^{2}_{s_{i}}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Gam ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 2 , 1 / 2 ) end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. By additivity, we have i=1NXiχi=1Nsi2similar-tosuperscriptsubscript𝑖1𝑁subscript𝑋𝑖subscriptsuperscript𝜒2superscriptsubscript𝑖1𝑁subscript𝑠𝑖\sum_{i=1}^{N}X_{i}\sim\chi^{2}_{\sum_{i=1}^{N}s_{i}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. In summary, Lancaster’s method generalizes Fisher’s method by assigning different weights to the DF of each source, resulting in a larger total DF compared to Fisher’s method. However, Yoon et al. [YBPN21] demonstrated that the large DF cause the individual distributions to approach the normal distribution, leading to a significant decrease in power. Yoon et al. consequently proposed the wFisher method, which employs a similar weighting scheme but shrinks the total DFs to match those of the original Fisher’s method. Specifically, they constructed the statistics i=1NFGam(wiN/2,1/2)1(1pi)χ2N2similar-tosuperscriptsubscript𝑖1𝑁subscriptsuperscript𝐹1Gamsubscript𝑤𝑖𝑁2121subscript𝑝𝑖subscriptsuperscript𝜒22𝑁\sum_{i=1}^{N}F^{-1}_{\mathrm{Gam}(w_{i}N/2,1/2)}(1-p_{i})\sim\chi^{2}_{2N}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Gam ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_N / 2 , 1 / 2 ) end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_N end_POSTSUBSCRIPT. We observe that the wFisher method exhibits greater stability compared to other weighting methods. The wFisher framework can be written as Equation 11.

p=1Fχ2N2(i=1NFGam(siN2,12)1(1pi))𝑝1subscript𝐹subscriptsuperscript𝜒22𝑁superscriptsubscript𝑖1𝑁subscriptsuperscript𝐹1Gamsubscript𝑠𝑖𝑁2121subscript𝑝𝑖p=1-F_{\chi^{2}_{2N}}\left(\sum_{i=1}^{N}F^{-1}_{\mathrm{Gam}(\frac{s_{i}N}{2}% ,\frac{1}{2})}(1-p_{i})\right)italic_p = 1 - italic_F start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Gam ( divide start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_N end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (11)

4.5 Evaluation of the surge detection task

In the surge detection task, our evaluation centers on the timeliness of detecting surges, which is represented by binary sequences denoting the presence or absence of a surge. Alerts based on p-values are triggered when these values dip below a predetermined threshold, while growth rate alerts are activated when the growth rate of the Poisson rate parameter exceeds a set threshold. For real data analysis, the p-value alerts from the centralized setting serve as the ground truth. In contrast, the semi-synthetic data analysis uses the growth rate alerts as the ground truth. Once the ground truth is established, both centralized and decentralized p-value alerts are assessed by comparing them to this benchmark.

For each ground truth alert, the reconstructed alerts are considered true positives if they fall within a specified time window, e.g., no earlier than one week before and no later than two weeks after. Otherwise, the reconstructed alerts are classified as false positives. Moreover, any true alerts not matched by the constructed alerts are deemed false negatives. Following this rule, Precision and Recall metrics can be calculated. Precision represents the ratio of true positives (TP) to the summation of true positives and false positives (FP) (TP/(TP + FP)). Recall, on the other hand, denotes the ratio of true positives to the summation of true positives and false negatives (FN) (TP/(TP + FN)). These metrics are computed for different confidence level thresholds. Finally, the Precision-Recall metric is obtained, and the power (equal to Recall) is evaluated while controlling the FDR, which equals 1 - Precision, at 0.10, allowing for an assessment of method performance. It should be noted that the term “power” in this context refers to the power of the classification task, as opposed to the power associated with a Binomial test that was mentioned earlier.

4.6 Semi-synthetic data analysis

The semi-synthetic analysis is conducted under the assumption of noisy data, where the observed signal deviates from the true underlying prevalence. There are several objectives of this analysis. Firstly, it aims to investigate the effects of various factors, such as the number of sites, magnitudes of reports, and the imbalance of shares, while controlling for other dimensions. Secondly, the analysis facilitates the comparison of systematic errors arising from noise in the centralized and federated settings, as well as the assessment of the combination loss during the meta-analysis process. By utilizing real data as a starting point and employing semi-synthetic data analysis, we are able to control and examine the impact of different data features including the total count magnitudes and shares of entities.

The generation of the semi-synthetic data starts with the counts of outpatient insurance claims with a primary diagnosis of COVID-19 in each county provided by Change, covering the period from 2020-08-02 to 2022-07-30 on a daily basis. Initially, a 7-day moving average smoother is applied to the total counts of all counties in a state. We then treat the resulting smoothed values as the underlying occurrence rate confirmed COVID-19 cases each day, from which Poisson-distributed observed counts are simulated. That is, Poisson sampling is performed to generate simulated observations, assuming that the observed data is drawn from a Poisson distribution with the smoothed data serving as the rate parameter. After getting the state-level counts, we multinomially allocate them into different sites with the designed parameter, regarding them as count-level counts. Once the simulated counts are obtained, the next step involves computing the growth rate of the ground truth prevalence and the p-values for the hypothesis test at each time point. Subsequently, alerts are determined based on these computed values and the predetermined thresholds.

In contrast to the previous setting, where we established the ground truth through hypothesis testing on centralized data, we now define it based on the growth of the Poisson rate surpassing a predetermined threshold. This shift is due to the gap between the Poisson rate used for simulation and the Poisson counts in the centralized setting. The discrepancy between the growth rate alert of the Poisson rate and the centralized p-value alert can be ascribed to the inherent properties of the Poisson assumption and the introduced Poisson rate ratio test. Additionally, the distinction between centralized and decentralized p-value alerts directly results from the meta-analysis procedure. By examining and contrasting the errors arising from this bifurcated process, we demonstrate that the recombination cost exerts a comparable influence on the performance as does systematic noise. Moreover, the weighted and corrected Stouffer’s methods and the weighted Fisher’s method exhibit stability across various settings using auxiliary information, illustrating their robustness in practical scenarios.

References

  • [BCD01] Lawrence D Brown, T Tony Cai, and Anirban DasGupta. Interval estimation for a binomial proportion, 2001.
  • [BCD02] Lawrence D Brown, T Tony Cai, and Anirban DasGupta. Confidence intervals for a binomial proportion and asymptotic expansions, 2002.
  • [BR10] Rabi N Bhattacharya and R Ranga Rao. Normal approximation and asymptotic expansions. SIAM, 2010.
  • [Els91] RC Elston. On fisher’s method of combining p-values, 1991.
  • [Ess45] Carl-Gustav Esseen. Fourier analysis of distribution functions. a mathematical study of the laplace-gaussian law, 1945.
  • [Fay14] Michael P Fay. Testing the ratio of two poisson rates, 2014.
  • [GMS95] William Gardner, Edward P Mulvey, and Esther C Shaw. Regression analyses of counts and rates: Poisson, overdispersed poisson, and negative binomial models., 1995.
  • [GNTS08] Kangxia Gu, Hon Keung Tony Ng, Man Lai Tang, and William R Schucany. Testing the ratio of two poisson rates, 2008.
  • [Goo55] IJ Good. On the weighted combination of significance tests, 1955.
  • [HRD18] Nicholas A Heard and Patrick Rubin-Delanchy. Choosing between methods of combining-values, 2018.
  • [HY22] Chia-Ding Hou and Ti-Sung Yang. Distribution of weighted lancaster’s statistic for combining independent or dependent p-values, with applications to human genetic studies, 2022.
  • [Lan61] HO Lancaster. The combination of probabilities: an application of orthonormal functions, 1961.
  • [LT11] Jia Li and George C Tseng. An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies, 2011.
  • [MBG+21] Daniel J McDonald, Jacob Bien, Alden Green, Addison J Hu, Nat DeFries, Sangwon Hyun, Natalia L Oliveira, James Sharpnack, Jingjing Tang, Robert Tibshirani, et al. Can auxiliary indicators improve covid-19 forecasting and hotspot prediction?, 2021.
  • [PW40] J Przyborowski and H Wilenski. Homogeneity of results in testing samples from poisson series: With an application to testing clover seed for dodder, 1940.
  • [RBJ+21] Alex Reinhart, Logan Brooks, Maria Jahja, Aaron Rumack, Jingjing Tang, Sumit Agrawal, Wael Al Saeed, Taylor Arnold, Amartya Basu, Jacob Bien, et al. An open repository of real-time covid-19 indicators, 2021.
  • [Ric90] William R Rice. A consensus combined p-value test and the family-wide significance of component tests, 1990.
  • [RL05] Joseph P Romano and EL Lehmann. Testing statistical hypotheses, 2005.
  • [RWR23] David S Robertson, James MS Wason, and Aaditya Ramdas. Online multiple hypothesis testing, 2023.
  • [SRB+21] Joshua A Salomon, Alex Reinhart, Alyssa Bilinski, Eu Jing Chua, Wichada La Motte-Kerr, Minttu M Rönn, Marissa B Reitsma, Katherine A Morris, Sarah LaRocca, Tamer H Farag, et al. The us covid-19 trends and impact survey: Continuous real-time measurement of covid-19 symptoms, risks, protective behaviors, testing, and vaccination, 2021.
  • [VW20] Vladimir Vovk and Ruodu Wang. Combining p-values via averaging, 2020.
  • [Whi05] Michael C Whitlock. Combining probability from independent tests: the weighted z-method is superior to fisher’s approach, 2005.
  • [WMLE09] Sungho Won, Nathan Morris, Qing Lu, and Robert C Elston. Choosing an optimal method to combine p-values, 2009.
  • [YBPN21] Sora Yoon, Bukyung Baik, Taesung Park, and Dougu Nam. Powerful p-value combination methods to detect incomplete association, 2021.
  • [ZW22] Hong Zhang and Zheyang Wu. The generalized fisher’s combination and accurate p-value calculation under dependence, 2022.