Methodology
See recent articles
- [1] arXiv:2409.09178 [pdf, html, other]
-
Title: Identification of distributions for risks based on the first moment and c-statisticComments: 8 pages, 1 figureSubjects: Methodology (stat.ME); Computation (stat.CO)
We show that for any family of distributions with support on [0,1] with strictly monotonic cumulative distribution function (CDF) that has no jumps and is quantile-identifiable (i.e., any two distinct quantiles identify the distribution), knowing the first moment and c-statistic is enough to identify the distribution. The derivations motivate numerical algorithms for mapping a given pair of expected value and c-statistic to the parameters of specified two-parameter distributions for probabilities. We implemented these algorithms in R and in a simulation study evaluated their numerical accuracy for common families of distributions for risks (beta, logit-normal, and probit-normal). An area of application for these developments is in risk prediction modeling (e.g., sample size calculations and Value of Information analysis), where one might need to estimate the parameters of the distribution of predicted risks from the reported summary statistics.
- [2] arXiv:2409.09236 [pdf, html, other]
-
Title: Off-Policy Evaluation with Irregularly-Spaced, Outcome-Dependent Observation TimesSubjects: Methodology (stat.ME)
While the classic off-policy evaluation (OPE) literature commonly assumes decision time points to be evenly spaced for simplicity, in many real-world scenarios, such as those involving user-initiated visits, decisions are made at irregularly-spaced and potentially outcome-dependent time points. For a more principled evaluation of the dynamic policies, this paper constructs a novel OPE framework, which concerns not only the state-action process but also an observation process dictating the time points at which decisions are made. The framework is closely connected to the Markov decision process in computer science and with the renewal process in the statistical literature. Within the framework, two distinct value functions, derived from cumulative reward and integrated reward respectively, are considered, and statistical inference for each value function is developed under revised Markov and time-homogeneous assumptions. The validity of the proposed method is further supported by theoretical results, simulation studies, and a real-world application from electronic health records (EHR) evaluating periodontal disease treatments.
- [3] arXiv:2409.09310 [pdf, html, other]
-
Title: Exact Posterior Mean and Covariance for Generalized Linear Mixed ModelsComments: Manuscript under reviewSubjects: Methodology (stat.ME)
A novel method is proposed for the exact posterior mean and covariance of the random effects given the response in a generalized linear mixed model (GLMM) when the response does not follow normal. The research solves a long-standing problem in Bayesian statistics when an intractable integral appears in the posterior distribution. It is well-known that the posterior distribution of the random effects given the response in a GLMM when the response does not follow normal contains intractable integrals. Previous methods rely on Monte Carlo simulations for the posterior distributions. They do not provide the exact posterior mean and covariance of the random effects given the response. The special integral computation (SIC) method is proposed to overcome the difficulty. The SIC method does not use the posterior distribution in the computation. It devises an optimization problem to reach the task. An advantage is that the computation of the posterior distribution is unnecessary. The proposed SIC avoids the main difficulty in Bayesian analysis when intractable integrals appear in the posterior distribution.
- [4] arXiv:2409.09355 [pdf, html, other]
-
Title: A Random-effects Approach to Regression Involving Many Categorical Predictors and Their InteractionsComments: 28 pagesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Linear model prediction with a large number of potential predictors is both statistically and computationally challenging. The traditional approaches are largely based on shrinkage selection/estimation methods, which are applicable even when the number of potential predictors is (much) larger than the sample size. A situation of the latter scenario occurs when the candidate predictors involve many binary indicators corresponding to categories of some categorical predictors as well as their interactions. We propose an alternative approach to the shrinkage prediction methods in such a case based on mixed model prediction, which effectively treats combinations of the categorical effects as random effects. We establish theoretical validity of the proposed method, and demonstrate empirically its advantage over the shrinkage methods. We also develop measures of uncertainty for the proposed method and evaluate their performance empirically. A real-data example is considered.
- [5] arXiv:2409.09440 [pdf, html, other]
-
Title: Group Sequential Testing of a Treatment Effect Using a Surrogate MarkerSubjects: Methodology (stat.ME)
The identification of surrogate markers is motivated by their potential to make decisions sooner about a treatment effect. However, few methods have been developed to actually use a surrogate marker to test for a treatment effect in a future study. Most existing methods consider combining surrogate marker and primary outcome information to test for a treatment effect, rely on fully parametric methods where strict parametric assumptions are made about the relationship between the surrogate and the outcome, and/or assume the surrogate marker is measured at only a single time point. Recent work has proposed a nonparametric test for a treatment effect using only surrogate marker information measured at a single time point by borrowing information learned from a prior study where both the surrogate and primary outcome were measured. In this paper, we utilize this nonparametric test and propose group sequential procedures that allow for early stopping of treatment effect testing in a setting where the surrogate marker is measured repeatedly over time. We derive the properties of the correlated surrogate-based nonparametric test statistics at multiple time points and compute stopping boundaries that allow for early stopping for a significant treatment effect, or for futility. We examine the performance of our testing procedure using a simulation study and illustrate the method using data from two distinct AIDS clinical trials.
- [6] arXiv:2409.09512 [pdf, html, other]
-
Title: Doubly robust and computationally efficient high-dimensional variable selectionSubjects: Methodology (stat.ME)
The variable selection problem is to discover which of a large set of predictors is associated with an outcome of interest, conditionally on the other predictors. This problem has been widely studied, but existing approaches lack either power against complex alternatives, robustness to model misspecification, computational efficiency, or quantification of evidence against individual hypotheses. We present tower PCM (tPCM), a statistically and computationally efficient solution to the variable selection problem that does not suffer from these shortcomings. tPCM adapts the best aspects of two existing procedures that are based on similar functionals: the holdout randomization test (HRT) and the projected covariance measure (PCM). The former is a model-X test that utilizes many resamples and few machine learning fits, while the latter is an asymptotic doubly-robust style test for a single hypothesis that requires no resamples and many machine learning fits. Theoretically, we demonstrate the validity of tPCM, and perhaps surprisingly, the asymptotic equivalence of HRT, PCM, and tPCM. In so doing, we clarify the relationship between two methods from two separate literatures. An extensive simulation study verifies that tPCM can have significant computational savings compared to HRT and PCM, while maintaining nearly identical power.
- [7] arXiv:2409.09660 [pdf, html, other]
-
Title: On the Proofs of the Predictive Synthesis FormulaComments: 11 pages, no figure, 1 tableSubjects: Methodology (stat.ME)
Bayesian predictive synthesis is useful in synthesizing multiple predictive distributions coherently. However, the proof for the fundamental equation of the synthesized predictive density has been missing. In this technical report, we review the series of research on predictive synthesis, then fill the gap between the known results and the equation used in modern applications. We provide two proofs and clarify the structure of predictive synthesis.
- [8] arXiv:2409.09865 [pdf, html, other]
-
Title: A general approach to fitting multistate cure models based on an extended-long-format data structureSubjects: Methodology (stat.ME)
A multistate cure model is a statistical framework used to analyze and represent the transitions that individuals undergo between different states over time, taking into account the possibility of being cured by initial treatment. This model is particularly useful in pediatric oncology where a fraction of the patient population achieves cure through treatment and therefore they will never experience some events. Our study develops a generalized algorithm based on the extended long data format, an extension of long data format where a transition can be split up to two rows each with a weight assigned reflecting the posterior probability of its cure status. The multistate cure model is fit on top of the current framework of multistate model and mixture cure model. The proposed algorithm makes use of the Expectation-Maximization (EM) algorithm and weighted likelihood representation such that it is easy to implement with standard package. As an example, the proposed algorithm is applied on data from the European Society for Blood and Marrow Transplantation (EBMT). Standard errors of the estimated parameters are obtained via a non-parametric bootstrap procedure, while the method involving the calculation of the second-derivative matrix of the observed log-likelihood is also presented.
- [9] arXiv:2409.09884 [pdf, html, other]
-
Title: Dynamic quantification of player value for fantasy basketballComments: 22 pagesSubjects: Methodology (stat.ME)
Previous work on fantasy basketball quantifies player value for category leagues without taking draft circumstances into account. Quantifying value in this way is convenient, but inherently limited as a strategy, because it precludes the possibility of dynamic adaptation. This work introduces a framework for dynamic algorithms, dubbed "H-scoring", and describes an implementation of the framework for head-to-head formats, dubbed $H_0$. $H_0$ models many of the main aspects of category league strategy including category weighting, positional assignments, and format-specific objectives. Head-to-head simulations provide evidence that $H_0$ outperforms static ranking lists. Category-level results from the simulations reveal that one component of $H_0$'s strategy is punting a subset of categories, which it learns to do implicitly.
- [10] arXiv:2409.10001 [pdf, html, other]
-
Title: Generalized Matrix Factor ModelSubjects: Methodology (stat.ME)
This article introduces a nonlinear generalized matrix factor model (GMFM) that allows for mixed-type variables, extending the scope of linear matrix factor models (LMFM) that are so far limited to handling continuous variables. We introduce a novel augmented Lagrange multiplier method, equivalent to the constraint maximum likelihood estimation, and carefully tailored to be locally concave around the true factor and loading parameters. This statistically guarantees the local convexity of the negative Hessian matrix around the true parameters of the factors and loadings, which is nontrivial in the matrix factor modeling and leads to feasible central limit theorems of the estimated factors and loadings. We also theoretically establish the convergence rates of the estimated factor and loading matrices for the GMFM under general conditions that allow for correlations across samples, rows, and columns. Moreover, we provide a model selection criterion to determine the numbers of row and column factors consistently. To numerically compute the constraint maximum likelihood estimator, we provide two algorithms: two-stage alternating maximization and minorization maximization. Extensive simulation studies demonstrate GMFM's superiority in handling discrete and mixed-type variables. An empirical data analysis of the company's operating performance shows that GMFM does clustering and reconstruction well in the presence of discontinuous entries in the data matrix.
- [11] arXiv:2409.10030 [pdf, html, other]
-
Title: On LASSO Inference for High Dimensional Predictive RegressionSubjects: Methodology (stat.ME); Econometrics (econ.EM); Machine Learning (stat.ML)
LASSO introduces shrinkage bias into estimated coefficients, which can adversely affect the desirable asymptotic normality and invalidate the standard inferential procedure based on the $t$-statistic. The desparsified LASSO has emerged as a well-known remedy for this issue. In the context of high dimensional predictive regression, the desparsified LASSO faces an additional challenge: the Stambaugh bias arising from nonstationary regressors. To restore the standard inferential procedure, we propose a novel estimator called IVX-desparsified LASSO (XDlasso). XDlasso eliminates the shrinkage bias and the Stambaugh bias simultaneously and does not require prior knowledge about the identities of nonstationary and stationary regressors. We establish the asymptotic properties of XDlasso for hypothesis testing, and our theoretical findings are supported by Monte Carlo simulations. Applying our method to real-world applications from the FRED-MD database -- which includes a rich set of control variables -- we investigate two important empirical questions: (i) the predictability of the U.S. stock returns based on the earnings-price ratio, and (ii) the predictability of the U.S. inflation using the unemployment rate.
- [12] arXiv:2409.10174 [pdf, other]
-
Title: Information criteria for the number of directions of extremes in high-dimensional dataSubjects: Methodology (stat.ME)
In multivariate extreme value analysis, the estimation of the dependence structure in extremes is a challenging task, especially in the context of high-dimensional data. Therefore, a common approach is to reduce the model dimension by considering only the directions in which extreme values occur. In this paper, we use the concept of sparse regular variation recently introduced by Meyer and Wintenberger (2021) to derive information criteria for the number of directions in which extreme events occur, such as a Bayesian information criterion (BIC), a mean-squared error-based information criterion (MSEIC), and a quasi-Akaike information criterion (QAIC) based on the Gaussian likelihood function. As is typical in extreme value analysis, a challenging task is the choice of the number $k_n$ of observations used for the estimation. Therefore, for all information criteria, we present a two-step procedure to estimate both the number of directions of extremes and an optimal choice of $k_n$. We prove that the AIC of Meyer and Wintenberger (2023) and the MSEIC are inconsistent information criteria for the number of extreme directions whereas the BIC and the QAIC are consistent information criteria. Finally, the performance of the different information criteria is compared in a simulation study and applied on wind speed data.
- [13] arXiv:2409.10221 [pdf, other]
-
Title: bayesCureRateModel: Bayesian Cure Rate Modeling for Time to Event Data in RComments: 34 pages, 7 figuresSubjects: Methodology (stat.ME); Computation (stat.CO)
The family of cure models provides a unique opportunity to simultaneously model both the proportion of cured subjects (those not facing the event of interest) and the distribution function of time-to-event for susceptibles (those facing the event). In practice, the application of cure models is mainly facilitated by the availability of various R packages. However, most of these packages primarily focus on the mixture or promotion time cure rate model. This article presents a fully Bayesian approach implemented in R to estimate a general family of cure rate models in the presence of covariates. It builds upon the work by Papastamoulis and Milienos (2024) by additionally considering various options for describing the promotion time, including the Weibull, exponential, Gompertz, log-logistic and finite mixtures of gamma distributions, among others. Moreover, the user can choose any proper distribution function for modeling the promotion time (provided that some specific conditions are met). Posterior inference is carried out by constructing a Metropolis-coupled Markov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for the latent cure indicators and Metropolis-Hastings steps with Langevin diffusion dynamics for parameter updates. The main MCMC algorithm is embedded within a parallel tempering scheme by considering heated versions of the target posterior distribution. The package is illustrated on a real dataset analyzing the duration of the first marriage under the presence of various covariates such as the race, age and the presence of kids.
- [14] arXiv:2409.10318 [pdf, html, other]
-
Title: Systematic comparison of Bayesian basket trial designs with unequal sample sizes and proposal of a new method based on power priorsSubjects: Methodology (stat.ME)
Basket trials examine the efficacy of an intervention in multiple patient subgroups simultaneously. The division into subgroups, called baskets, is based on matching medical characteristics, which may result in small sample sizes within baskets that are also likely to differ. Sparse data complicate statistical inference. Several Bayesian methods have been proposed in the literature that allow information sharing between baskets to increase statistical power. In this work, we provide a systematic comparison of five different Bayesian basket trial designs when sample sizes differ between baskets. We consider the power prior approach with both known and new weighting methods, a design by Fujikawa et al., as well as models based on Bayesian hierarchical modeling and Bayesian model averaging. The results of our simulation study show a high sensitivity to changing sample sizes for Fujikawa's design and the power prior approach. Limiting the amount of shared information was found to be decisive for the robustness to varying basket sizes. In combination with the power prior approach, this resulted in the best performance and the most reliable detection of an effect of the treatment under investigation and its absence.
- [15] arXiv:2409.10352 [pdf, html, other]
-
Title: Partial Ordering Bayesian Logistic Regression Model for Phase I Combination Trials and Computationally Efficient Approach to Operational Prior SpecificationSubjects: Methodology (stat.ME)
Recent years have seen increased interest in combining drug agents and/or schedules. Several methods for Phase I combination-escalation trials are proposed, among which, the partial ordering continual reassessment method (POCRM) gained great attention for its simplicity and good operational characteristics. However, the one-parameter nature of the POCRM makes it restrictive in more complicated settings such as the inclusion of a control group. This paper proposes a Bayesian partial ordering logistic model (POBLRM), which combines partial ordering and the more flexible (than CRM) two-parameter logistic model. Simulation studies show that the POBLRM performs similarly as the POCRM in non-randomised settings. When patients are randomised between the experimental dose-combinations and a control, performance is drastically improved.
Most designs require specifying hyper-parameters, often chosen from statistical considerations (operational prior). The conventional "grid search'' calibration approach requires large simulations, which are computationally costly. A novel "cyclic calibration" has been proposed to reduce the computation from multiplicative to additive. Furthermore, calibration processes should consider wide ranges of scenarios of true toxicity probabilities to avoid bias. A method to reduce scenarios based on scenario-complexities is suggested. This can reduce the computation by more than 500 folds while remaining operational characteristics similar to the grid search. - [16] arXiv:2409.10448 [pdf, html, other]
-
Title: Why you should also use OLS estimation of tail exponentsThiago Trafane Oliveira Santos (1), Daniel Oliveira Cajueiro (2) ((1) Central Bank of Brazil, Brasília, Brazil. Department of %Economics, University of Brasilia, Brazil. (2) Department of Economics, University of Brasilia, Brazil. National Institute of Science and Technology for Complex Systems (INCT-SC). Machine Learning Laboratory in Finance and Organizations (LAMFO), Brazil.)Subjects: Methodology (stat.ME); Econometrics (econ.EM); Statistics Theory (math.ST)
Even though practitioners often estimate Pareto exponents running OLS rank-size regressions, the usual recommendation is to use the Hill MLE with a small-sample correction instead, due to its unbiasedness and efficiency. In this paper, we advocate that you should also apply OLS in empirical applications. On the one hand, we demonstrate that, with a small-sample correction, the OLS estimator is also unbiased. On the other hand, we show that the MLE assigns significantly greater weight to smaller observations. This suggests that the OLS estimator may outperform the MLE in cases where the distribution is (i) strictly Pareto but only in the upper tail or (ii) regularly varying rather than strictly Pareto. We substantiate our theoretical findings with Monte Carlo simulations and real-world applications, demonstrating the practical relevance of the OLS method in estimating tail exponents.
New submissions for Tuesday, 17 September 2024 (showing 16 of 16 entries )
- [17] arXiv:2409.09066 (cross-list from econ.GN) [pdf, html, other]
-
Title: Replicating The Log of GravityComments: 9 pages, 0 figures, 1 tableSubjects: General Economics (econ.GN); Computation (stat.CO); Methodology (stat.ME)
This document replicates the main results from Santos Silva and Tenreyro (2006 in R. The original results were obtained in TSP back in 2006. The idea here is to be explicit regarding the conceptual approach to regression in R. For most of the replication I used base R without external libraries except when it was absolutely necessary. The findings are consistent with the original article and reveal that the replication effort is minimal, without the need to contact the authors for clarifications or incur into data transformations or filtering not mentioned in the article.
- [18] arXiv:2409.09243 (cross-list from econ.EM) [pdf, html, other]
-
Title: Unconditional Randomization Tests for InterferenceSubjects: Econometrics (econ.EM); Methodology (stat.ME)
In social networks or spatial experiments, one unit's outcome often depends on another's treatment, a phenomenon called interference. Researchers are interested in not only the presence and magnitude of interference but also its pattern based on factors like distance, neighboring units, and connection strength. However, the non-random nature of these factors and complex correlations across units pose challenges for inference. This paper introduces the partial null randomization tests (PNRT) framework to address these issues. The proposed method is finite-sample valid and applicable with minimal network structure assumptions, utilizing randomization testing and pairwise comparisons. Unlike existing conditional randomization tests, PNRT avoids the need for conditioning events, making it more straightforward to implement. Simulations demonstrate the method's desirable power properties and its applicability to general interference scenarios.
- [19] arXiv:2409.09894 (cross-list from cs.LG) [pdf, html, other]
-
Title: Estimating Wage Disparities Using Foundation ModelsSubjects: Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
One thread of empirical work in social science focuses on decomposing group differences in outcomes into unexplained components and components explained by observable factors. In this paper, we study gender wage decompositions, which require estimating the portion of the gender wage gap explained by career histories of workers. Classical methods for decomposing the wage gap employ simple predictive models of wages which condition on a small set of simple summaries of labor history. The problem is that these predictive models cannot take advantage of the full complexity of a worker's history, and the resulting decompositions thus suffer from omitted variable bias (OVB), where covariates that are correlated with both gender and wages are not included in the model. Here we explore an alternative methodology for wage gap decomposition that employs powerful foundation models, such as large language models, as the predictive engine. Foundation models excel at making accurate predictions from complex, high-dimensional inputs. We use a custom-built foundation model, designed to predict wages from full labor histories, to decompose the gender wage gap. We prove that the way such models are usually trained might still lead to OVB, but develop fine-tuning algorithms that empirically mitigate this issue. Our model captures a richer representation of career history than simple models and predicts wages more accurately. In detail, we first provide a novel set of conditions under which an estimator of the wage gap based on a fine-tuned foundation model is $\sqrt{n}$-consistent. Building on the theory, we then propose methods for fine-tuning foundation models that minimize OVB. Using data from the Panel Study of Income Dynamics, we find that history explains more of the gender wage gap than standard econometric models can measure, and we identify elements of history that are important for reducing OVB.
- [20] arXiv:2409.09903 (cross-list from stat.ML) [pdf, html, other]
-
Title: Learning large softmax mixtures with warm start EMSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
Mixed multinomial logits are discrete mixtures introduced several decades ago to model the probability of choosing an attribute from $p$ possible candidates, in heterogeneous populations. The model has recently attracted attention in the AI literature, under the name softmax mixtures, where it is routinely used in the final layer of a neural network to map a large number $p$ of vectors in $\mathbb{R}^L$ to a probability vector. Despite its wide applicability and empirical success, statistically optimal estimators of the mixture parameters, obtained via algorithms whose running time scales polynomially in $L$, are not known. This paper provides a solution to this problem for contemporary applications, such as large language models, in which the mixture has a large number $p$ of support points, and the size $N$ of the sample observed from the mixture is also large. Our proposed estimator combines two classical estimators, obtained respectively via a method of moments (MoM) and the expectation-minimization (EM) algorithm. Although both estimator types have been studied, from a theoretical perspective, for Gaussian mixtures, no similar results exist for softmax mixtures for either procedure. We develop a new MoM parameter estimator based on latent moment estimation that is tailored to our model, and provide the first theoretical analysis for a MoM-based procedure in softmax mixtures. Although consistent, MoM for softmax mixtures can exhibit poor numerical performance, as observed other mixture models. Nevertheless, as MoM is provably in a neighborhood of the target, it can be used as warm start for any iterative algorithm. We study in detail the EM algorithm, and provide its first theoretical analysis for softmax mixtures. Our final proposal for parameter estimation is the EM algorithm with a MoM warm start.
- [21] arXiv:2409.09973 (cross-list from math.ST) [pdf, other]
-
Title: Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level DataSubjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution. While this theory proves effective in many significant contexts, it falls short in certain common data fusion problems, such as two-sample instrumental variable analysis, settings that integrate data from epidemiological studies with diverse designs (e.g., prospective cohorts and retrospective case-control studies), and studies with variables prone to measurement error that are supplemented by validation studies. In this paper, we extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution. Assuming conditional and marginal distribution alignments, we provide universal results that characterize the class of all influence functions of regular asymptotically linear estimators and the efficient influence function of any pathwise differentiable parameter, irrespective of the number of data sources, the specific parameter of interest, or the statistical model for the target distribution. This theory paves the way for machine-learning debiased, semiparametric efficient estimation.
- [22] arXiv:2409.10374 (cross-list from stat.AP) [pdf, html, other]
-
Title: Nonlinear Causality in Brain Networks: With Application to Motor Imagery vs ExecutionSubjects: Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME)
One fundamental challenge of data-driven analysis in neuroscience is modeling causal interactions and exploring the connectivity of nodes in a brain network. Various statistical methods, relying on various perspectives and employing different data modalities, are being developed to examine and comprehend the underlying causal structures inherent to brain dynamics. This study introduces a novel statistical approach, TAR4C, to dissect causal interactions in multichannel EEG recordings. TAR4C uses the threshold autoregressive model to describe the causal interaction between nodes or clusters of nodes in a brain network. The perspective involves testing whether one node, which may represent a brain region, can control the dynamics of the other. The node that has such an impact on the other is called a threshold variable and can be classified as a causative because its functionality is the leading source operating as an instantaneous switching mechanism that regulates the time-varying autoregressive structure of the other. This statistical concept is commonly referred to as threshold non-linearity. Once threshold non-linearity has been verified between a pair of nodes, the subsequent essential facet of TAR modeling is to assess the predictive ability of the causal node for the current activity on the other and represent causal interactions in autoregressive terms. This predictive ability is what underlies Granger causality. The TAR4C approach can discover non-linear and time-dependent causal interactions without negating the G-causality perspective. The efficacy of the proposed approach is exemplified by analyzing the EEG signals recorded during the motor movement/imagery experiment. The similarities and differences between the causal interactions manifesting during the execution and the imagery of a given motor movement are demonstrated by analyzing EEG recordings from multiple subjects.
Cross submissions for Tuesday, 17 September 2024 (showing 6 of 6 entries )
- [23] arXiv:2306.00453 (replaced) [pdf, html, other]
-
Title: A Gaussian Sliding Windows Regression Model for Hydrological InferenceStefan Schrunner, Parham Pishrobat, Joseph Janssen, Anna Jenul, Jiguo Cao, Ali A. Ameli, William J. WelchSubjects: Methodology (stat.ME)
Statistical models are an essential tool to model, forecast and understand the hydrological processes in watersheds. In particular, the understanding of time lags associated with the delay between rainfall occurrence and subsequent changes in streamflow is of high practical importance. Since water can take a variety of flow paths to generate streamflow, a series of distinct runoff pulses may combine to create the observed streamflow time series. Current state-of-the-art models are not able to sufficiently confront the problem complexity with interpretable parametrization, thus preventing novel insights about the dynamics of distinct flow paths from being formed. The proposed Gaussian Sliding Windows Regression Model targets this problem by combining the concept of multiple windows sliding along the time axis with multiple linear regression. The window kernels, which indicate the weights applied to different time lags, are implemented via Gaussian-shaped kernels. As a result, straightforward process inference can be achieved since each window can represent one flow path. Experiments on simulated and real-world scenarios underline that the proposed model achieves accurate parameter estimates and competitive predictive performance, while fostering explainable and interpretable hydrological modeling.
- [24] arXiv:2306.15947 (replaced) [pdf, html, other]
-
Title: Separable pathway effects of semi-competing risks using multi-state sodelsSubjects: Methodology (stat.ME)
Semi-competing risks refer to the phenomenon where a primary event (such as mortality) can ``censor'' an intermediate event (such as relapse of a disease), but not vice versa. Under the multi-state model, the primary event consists of two specific types: the direct outcome event and an indirect outcome event developed from intermediate events. Within this framework, we show that the total treatment effect on the cumulative incidence of the primary event can be decomposed into three separable pathway effects, capturing treatment effects on population-level transition rates between states. We next propose two estimators for the counterfactual cumulative incidences of the primary event under hypothetical treatment components. One estimator is given by the generalized Nelson--Aalen estimator with inverse probability weighting under covariates isolation, and the other is given based on the efficient influence function. The asymptotic normality of these estimators is established. The first estimator only involves a propensity score model and avoid modeling the cause-specific hazards. The second estimator has robustness against the misspecification of submodels. As an illustration of its potential usefulness, the proposed method is applied to compare effects of different allogeneic stem cell transplantation types on overall survival after transplantation.
- [25] arXiv:2311.00528 (replaced) [pdf, html, other]
-
Title: On the Comparative Analysis of Average Treatment Effects Estimation via Data CombinationSubjects: Methodology (stat.ME)
There is growing interest in exploring causal effects in target populations via data combination. However, most approaches are tailored to specific settings and lack comprehensive comparative analyses. In this article, we focus on a typical scenario involving a source dataset and a target dataset. We first design six settings under covariate shift and conduct a comparative analysis by deriving the semiparametric efficiency bounds for the ATE in the target population. We then extend this analysis to six new settings that incorporate both covariate shift and posterior drift. Our study uncovers the key factors that influence efficiency gains and the ``effective sample size" when combining two datasets, with a particular emphasis on the roles of the variance ratio of potential outcomes between datasets and the derivatives of the posterior drift function. To the best of our knowledge, this is the first paper that explicitly explores the role of the posterior drift functions in causal inference. Additionally, we also propose novel methods for conducting sensitivity analysis to address violations of transportability between the two datasets. We empirically validate our findings by constructing locally efficient estimators and conducting extensive simulations. We demonstrate the proposed methods in two real-world applications.
- [26] arXiv:2311.14220 (replaced) [pdf, html, other]
-
Title: Assumption-Lean and Data-Adaptive Post-Prediction InferenceSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be costly, labor-intensive, or invasive to obtain. With the rapid development of machine learning (ML), scientists can now employ ML algorithms to predict gold-standard outcomes with variables that are easier to obtain. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce PoSt-Prediction Adaptive inference (PSPA) that allows valid and powerful inference based on ML-predicted data. Its "assumption-lean" property guarantees reliable statistical inference without assumptions on the ML prediction. Its "data-adaptive" feature guarantees an efficiency gain over existing methods, regardless of the accuracy of ML prediction. We demonstrate the statistical superiority and broad applicability of our method through simulations and real-data applications.
- [27] arXiv:2402.18745 (replaced) [pdf, other]
-
Title: Degree-heterogeneous Latent Class Analysis for High-dimensional Discrete DataSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
The latent class model is a widely used mixture model for multivariate discrete data. Besides the existence of qualitatively heterogeneous latent classes, real data often exhibit additional quantitative heterogeneity nested within each latent class. The modern latent class analysis also faces extra challenges, including the high-dimensionality, sparsity, and heteroskedastic noise inherent in discrete data. Motivated by these phenomena, we introduce the Degree-heterogeneous Latent Class Model and propose an easy-to-implement HeteroClustering algorithm for it. HeteroClustering uses heteroskedastic PCA with l2 normalization to remove degree effects and perform clustering in the top singular subspace of the data matrix. We establish the result of exact clustering under minimal signal-to-noise conditions. We further investigate the estimation and inference of the high-dimensional continuous item parameters in the model, which are crucial to interpreting and finding useful markers for latent classes. We provide comprehensive procedures for global testing and multiple testing of these parameters with valid error controls. The superior performance of our methods is demonstrated through extensive simulations and applications to three diverse real-world datasets from political voting records, genetic variations, and single-cell sequencing.
- [28] arXiv:2404.03878 (replaced) [pdf, other]
-
Title: Wasserstein F-tests for Fr\'echet regression on Bures-Wasserstein manifoldsSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
This paper considers the problem of regression analysis with random covariance matrix as outcome and Euclidean covariates in the framework of Fréchet regression on the Bures-Wasserstein manifold. Such regression problems have many applications in single cell genomics and neuroscience, where we have covariance matrix measured over a large set of samples. Fréchet regression on the Bures-Wasserstein manifold is formulated as estimating the conditional Fréchet mean given covariates $x$. A non-asymptotic $\sqrt{n}$-rate of convergence (up to $\log n$ factors) is obtained for our estimator $\hat{Q}_n(x)$ uniformly for $\left\|x\right\| \lesssim \sqrt{\log n}$, which is crucial for deriving the asymptotic null distribution and power of our proposed statistical test for the null hypothesis of no association. In addition, a central limit theorem for the point estimate $\hat{Q}_n(x)$ is obtained, giving insights to a test for covariate effects. The null distribution of the test statistic is shown to converge to a weighted sum of independent chi-squares, which implies that the proposed test has the desired significance level asymptotically. Also, the power performance of the test is demonstrated against a sequence of contiguous alternatives. Simulation results show the accuracy of the asymptotic distributions. The proposed methods are applied to a single cell gene expression data set that shows the change of gene co-expression network as people age.
- [29] arXiv:2406.04072 (replaced) [pdf, html, other]
-
Title: Variational Prior Replacement in Bayesian Inference and InversionJournal-ref: Geophysical Journal International (2024): ggae334Subjects: Methodology (stat.ME); Mathematical Physics (math-ph); Geophysics (physics.geo-ph)
Many scientific investigations require that the values of a set of model parameters are estimated using recorded data. In Bayesian inference, information from both observed data and prior knowledge is combined to update model parameters probabilistically by calculating the posterior probability distribution function. Prior information is often described by a prior probability distribution. Situations arise in which we wish to change prior information during the course of a scientific project. However, estimating the solution to any single Bayesian inference problem is often computationally costly, as it typically requires many model samples to be drawn, and the data set that would have been recorded if each sample was true must be simulated. Recalculating the Bayesian inference solution every time prior information changes can therefore be extremely expensive. We develop a mathematical formulation that allows the prior information that is embedded within a solution, to be changed using variational methods, without recalculating the original Bayesian inference. In this method, existing prior information is removed from a previously obtained posterior distribution and is replaced by new prior information. We therefore call the methodology variational prior replacement (VPR). We demonstrate VPR using a 2D seismic full waveform inversion example, in which VPR provides similar posterior solutions to those obtained by solving independent inference problems using different prior distributions. The former can be completed within minutes on a laptop computer, whereas the latter requires days of computations using high-performance computing resources. We demonstrate the value of the method by comparing the posterior solutions obtained using three different types of prior information: uniform, smoothing and geological prior distributions.
- [30] arXiv:2407.08911 (replaced) [pdf, html, other]
-
Title: Computationally efficient and statistically accurate conditional independence testing with spaCRTSubjects: Methodology (stat.ME); Applications (stat.AP)
We introduce the saddlepoint approximation-based conditional randomization test (spaCRT), a novel conditional independence test that effectively balances statistical accuracy and computational efficiency, inspired by applications to single-cell CRISPR screens. Resampling-based methods like the distilled conditional randomization test (dCRT) offer statistical precision but at a high computational cost. The spaCRT leverages a saddlepoint approximation to the resampling distribution of the dCRT test statistic, achieving very similar finite-sample statistical performance with significantly reduced computational demands. We prove that the spaCRT $p$-value approximates the dCRT $p$-value with vanishing relative error, and that these two tests are asymptotically equivalent. Through extensive simulations and real data analysis, we demonstrate that the spaCRT controls Type-I error and maintains high power, outperforming other asymptotic and resampling-based tests. Our method is particularly well-suited for large-scale single-cell CRISPR screen analyses, facilitating the efficient and accurate assessment of perturbation-gene associations.
- [31] arXiv:2408.06612 (replaced) [pdf, html, other]
-
Title: Double Robust high dimensional alpha test for linear factor pricing modelSubjects: Methodology (stat.ME)
In this paper, we investigate alpha testing for high-dimensional linear factor pricing models. We propose a spatial sign-based max-type test to handle sparse alternative cases. Additionally, we prove that this test is asymptotically independent of the spatial-sign-based sum-type test proposed by Liu et al. (2023). Based on this result, we introduce a Cauchy Combination test procedure that combines both the max-type and sum-type tests. Simulation studies and real data applications demonstrate that the new proposed test procedure is robust not only for heavy-tailed distributions but also for the sparsity of the alternative hypothesis.
- [32] arXiv:2409.07125 (replaced) [pdf, html, other]
-
Title: Integrating Multiple Data Sources with Interactions in Multi-Omics Using Cooperative LearningComments: 22 pages, 6 figuresSubjects: Methodology (stat.ME)
Modeling with multi-omics data presents multiple challenges such as the high-dimensionality of the problem ($p \gg n$), the presence of interactions between features, and the need for integration between multiple data sources. We establish an interaction model that allows for the inclusion of multiple sources of data from the integration of two existing methods, pliable lasso and cooperative learning. The integrated model is tested both on simulation studies and on real multi-omics datasets for predicting labor onset and cancer treatment response. The results show that the model is effective in modeling multi-source data in various scenarios where interactions are present, both in terms of prediction performance and selection of relevant variables.
- [33] arXiv:2206.10240 (replaced) [pdf, html, other]
-
Title: Core-Elements for Large-Scale Least Squares EstimationComments: Accepted by Statistics and ComputingSubjects: Computation (stat.CO); Methodology (stat.ME)
The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive applications in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an $O(\mathrm{nnz}(X)+rp^2)$ computational cost, where $X$ is an $n\times p$ predictor matrix, $r$ is the number of elements selected from each column of $X$, and $\mathrm{nnz}(\cdot)$ denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors.
- [34] arXiv:2307.02616 (replaced) [pdf, html, other]
-
Title: Federated Epidemic SurveillanceSubjects: Applications (stat.AP); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Methodology (stat.ME)
Epidemic surveillance is a challenging task, especially when crucial data is fragmented across institutions and data custodians are unable or unwilling to share it. This study aims to explore the feasibility of a simple federated surveillance approach. The idea is to conduct hypothesis tests for a rise in counts behind each custodian's firewall and then combine p-values from these tests using techniques from meta-analysis. We propose a hypothesis testing framework to identify surges in epidemic-related data streams and conduct experiments on real and semi-synthetic data to assess the power of different p-value combination methods to detect surges without needing to combine the underlying counts. Our findings show that relatively simple combination methods achieve a high degree of fidelity and suggest that infectious disease outbreaks can be detected without needing to share even aggregate data across institutions.
- [35] arXiv:2307.10272 (replaced) [pdf, html, other]
-
Title: A Shrinkage Likelihood Ratio Test for High-Dimensional Subgroup Analysis with a Logistic-Normal Mixture ModelComments: 34 pagesSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
In subgroup analysis, testing the existence of a subgroup with a differential treatment effect serves as protection against spurious subgroup discovery. Despite its importance, this hypothesis testing possesses a complicated nature: parameter characterizing subgroup classification is not identified under the null hypothesis of no subgroup. Due to this irregularity, the existing methods have the following two limitations. First, the asymptotic null distribution of test statistics often takes an intractable form, which necessitates computationally demanding resampling methods to calculate the critical value. Second, the dimension of personal attributes characterizing subgroup membership is not allowed to be of high dimension. To solve these two problems simultaneously, this study develops a shrinkage likelihood ratio test for the existence of a subgroup using a logistic-normal mixture model. The proposed test statistics are built on a modified likelihood function that shrinks possibly high-dimensional unidentified parameters toward zero under the null hypothesis while retaining power under the alternative.
- [36] arXiv:2401.04778 (replaced) [pdf, html, other]
-
Title: Generative neural networks for characteristic functionsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
We provide a simulation algorithm to simulate from a (multivariate) characteristic function, which is only accessible in a black-box format. The method is based on a generative neural network, whose loss function exploits a specific representation of the Maximum-Mean-Discrepancy metric to directly incorporate the targeted characteristic function. The algorithm is universal in the sense that it is independent of the dimension and that it does not require any assumptions on the given characteristic function. Furthermore, finite sample guarantees on the approximation quality in terms of the Maximum-Mean Discrepancy metric are derived. The method is illustrated in a simulation study.
- [37] arXiv:2403.13153 (replaced) [pdf, other]
-
Title: Tensor Time Series Imputation through Tensor Factor ModellingComments: 78 pages, 13 figuresSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
We propose tensor time series imputation when the missing pattern in the tensor data can be general, as long as any two data positions along a tensor fibre are both observed for enough time points. The method is based on a tensor time series factor model with Tucker decomposition of the common component. One distinguished feature of the tensor time series factor model used is that there can be weak factors in the factor loadings matrix for each mode. This reflects reality better when real data can have weak factors which drive only groups of observed variables, for instance, a sector factor in financial market driving only stocks in a particular sector. Using the data with missing entries, asymptotic normality is derived for rows of estimated factor loadings, while consistent covariance matrix estimation enables us to carry out inferences. As a first in the literature, we also propose a ratio-based estimator for the rank of the core tensor under general missing patterns. Rates of convergence are spelt out for the imputations from the estimated tensor factor models. Simulation results show that our imputation procedure works well, with asymptotic normality and corresponding inferences also demonstrated. Re-imputation performances are also gauged when we demonstrate that using slightly larger rank then estimated gives superior re-imputation performances. A Fama-French portfolio example with matrix returns and an OECD data example with matrix of Economic indicators are presented and analyzed, showing the efficacy of our imputation approach compared to direct vector imputation.
- [38] arXiv:2403.16828 (replaced) [pdf, html, other]
-
Title: Asymptotics of predictive distributions driven by sample means and variancesSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
Let $\alpha_n(\cdot)=P\bigl(X_{n+1}\in\cdot\mid X_1,\ldots,X_n\bigr)$ be the predictive distributions of a sequence $(X_1,X_2,\ldots)$ of $p$-dimensional random vectors. Suppose $$\alpha_n= \mathcal{N} _p (M_n,Q_n)$$ where $M_n=\frac{1}{n}\sum_{i=1}^nX_i$ and $Q_n=\frac{1}{n}\sum_{i=1}^n(X_i-M_n)(X_i-M_n)^t$. Then, there is a random probability measure $\alpha$ on the Borel subsets of $\mathbb{R}^p$ such that $\lVert\alpha_n-\alpha\rVert\overset{a.s.}\longrightarrow 0$ where $\lVert\cdot\rVert$ is total variation distance. An explicit expression for $\alpha$ is provided and the convergence rate of $\lVert\alpha_n-\alpha\rVert$ is shown to be arbitrarily close to $n^{-1/2}$. Moreover, it is still true that $\lVert\alpha_n-\alpha\rVert\overset{a.s.}\longrightarrow 0$ even if $\alpha_n=\mathcal{L}(M_n,Q_n)$ where $\mathcal{L}$ belongs to a class of distributions much larger than the normal. The predictives $\alpha_n$ are useful in various frameworks, including Bayesian predictive inference and predictive resampling. Finally, the asymptotic behavior of copula-based predictive distributions (introduced in [13]) is investigated and a numerical experiment is performed.