Manuscript under review

  • Regularizing BELIEF with Sequency Lasso. (with Zhang, W., Ni, H., Liu, Y., Zhang, K.)
    Abstract

    Abstract

  • SorBET: A Fast and Powerful Algorithm to Test Dependence of Variables. (with Zhang, W., Baiocchi, M., Li, Y., Zhang, K.)
    Abstract

    Abstract

  • Bayesian Analysis of Multiple Index Additive Models. (with Z. Liang, J. Liu, and X. Wang)
    Abstract

    Nowadays, it is common to reduce the number of covariates in statistical applications with a large number of them. This article presents a Bayesian Multi-Index Additive Model (BMIAM) which combines the two steps of dimension reduction and statistical modelling. The indexes are reparameterized with hyperspherical coordinates and ridge functions are modelled using Bayesian B-splines. A Markov-chain Monte Carlo algorithm has been developed to sample the posteriors of the proposed model. Furthermore, it has been demonstrated that the model's posterior consistency of prediction remains intact under certain regularity conditions. The performance of the proposed model has been demonstrated through simulations and real-world data applications, which show that the proposed methods provide reliable dimension reduction space estimates and good predictability of responses.

  • Rate optimal multiple testing procedure in high-dimensional regression. (with P. Ji) — arXiv preprint.
    Abstract

    Multiple testing and variable selection have gained much attention in statistical theory and methodology research. They are dealing with the same problem of identifying the important variables among many (Jin, 2012). However, there is little overlap in the literature. Research on variable selection has been focusing on selection consistency, i.e., both type I and type II errors converging to zero. This is only possible when the signals are sufficiently strong, contrary to many modern applications. For the regime where the signals are both rare and weak, it is inevitable that a certain amount of false discoveries will be allowed, as long as some error rate can be controlled. In this paper, motivated by the research by Ji and Jin (2012) and Jin (2012) in the rare/weak regime, we extend their UPS procedure for variable selection to multiple testing. Under certain conditions, the new UPT procedure achieves the fastest convergence rate of marginal false non-discovery rates, while controlling the marginal false discovery rate at any designated level alpha asymptotically. Numerical results are provided to demonstrate the advantage of the proposed method.

    arXiv

  • Model-free multiple testing for matrix-valued predictors with false discovery control. (with L. Yan and X. Zhang)
    Abstract

    Identifying influential variables in high-dimensional matrix-valued data while controlling the false discovery rate (FDR) is a critical challenge in modern data science. We propose a novel, model-free procedure specifically designed for simultaneous row and column selection in matrix predictor regression. Our approach utilizes folding selection subspaces (FSS) to formulate structured hypotheses and employs data splitting to construct mirror statistics from FSS estimators. This design bypasses restrictive model specification and the need for p-value computation. Key theoretical contributions include establishing the asymptotic distribution of FSS estimators and proving the mirror statistic is asymptotically symmetric with respect to zero under the null hypothesis. Using the symmetry, we develop a multiple hypothesis testing procedure with data-driven thresholds that provably controls the FDR for row and column at the desired level asymptotically. The framework is further extended to control element-wise FDR under specific structural assumptions. Extensive simulations and a real data analysis demonstrate the superior performance of the proposed method over existing approaches across various settings.

  • On the Testing of Multiple Hypothesis in Sliced Inverse Regression. (with X. Xing)
    Abstract

    We consider the multiple testing of the general regression framework aiming at studying the relationship between a univariate response and a p-dimensional predictor. To test the hypothesis of the effect of each predictor, we construct an Angular Balanced Statistic (ABS) based on the estimator of the sliced inverse regression without assuming a model of the conditional distribution of the response. According to the developed limiting distribution results in this paper, we have shown that ABS is asymptotically symmetric with respect to zero under the null hypothesis. We then propose a Model-free multiple Testing procedure using Angular balanced statistics (MTA) and show theoretically that the false discovery rate of this method is less than or equal to a designated level asymptotically. Numerical evidence has shown that the MTA method is much more powerful than its alternatives, subject to the control of the false discovery rate.

    arXiv

  • BEAUTY Powered BEAST. (with K. Zhang, W. Zhang, and W. Zhou)
    Abstract

    We study inference about the uniform distribution with the proposed binary expansion approximation of uniformity (BEAUTY) approach. Through an extension of the celebrated Euler's formula, we approximate the characteristic function of any copula distribution with a linear combination of means of binary interactions marginal binary expansions. This novel characterization enables a unification of many important existing tests through an approximation from some quadratic form of symmetry statistics, where the deterministic weight matrix characterizes the power properties of each test. To achieve a uniformly high power, we study test statistics with data-adaptive weights through an oracle approach, referred to as the binary expansion adaptive symmetry test (BEAST). By utilizing the properties of the binary expansion filtration, we show that the Neyman-Pearson test of uniformity can be approximated by an oracle weighted sum of symmetry statistics. The BEAST with this oracle leads all existing tests we considered in empirical power against all complex forms of alternatives. This oracle therefore sheds light on the potential of substantial improvements in power and on the form of optimal weights under each alternative. By approximating this oracle with data-adaptive weights, we develop the BEAST that improves the empirical power of many existing tests against a wide spectrum of common alternatives while providing clear interpretation of the form of non-uniformity upon rejection. We illustrate the BEAST with a study of the relationship between the location and brightness of stars.

    arXiv

Publications on Statistical Methodology

  • Ultra-high Dimensional Threshold Selection for Quantile Feature Screening with FDR Error Rate Control: A Case Study on High Blood Pressure Analysis. (with S. Sanni and Y. Yu). Biometrics, 2025. Tentatively accepted.
    Abstract

    Identifying important genetic risk factors for high blood pressure is paramount for targeted interventions and comprehensive management strategies for this prevalent health condition. With ultra-high-dimensional genetic factors, feature screening plays a vital role in retaining the important risk factors and screening out the unimportant ones. However, its efficacy depends on selecting the appropriate threshold in different data contexts. We propose a novel Quantile Mirror (QM) approach for data-adaptive threshold selection with false discovery rate (FDR) control in quantile feature screening, where abnormally high blood pressure levels are of the most interest. The symmetric QM statistics enable the estimation of the FDR. Multiple data splitting is employed for result stability, and different thresholds are allowed for different quantiles. Additionally, we introduce the Quantile REflection via Data Splitting (QREDS) and the Hard Threshold with QREDS feature screening procedures. We apply our proposed methods to the Framingham Heart Study (FHS) data. While many of the retained high-blood-pressure-related genetic features are validated from medical literature, some new risk factors are discovered. We show that our proposed procedures asymptotically control the FDR under suitable conditions. We further demonstrate the performance of the proposed methods through extensive numerical analysis and simulation studies.

  • Discussion on “Data Fission: Splitting a Single Data Point”. Journal of the American Statistical Association, 2025, Volume 120, Issue 178–179, pp. 178–179.

    Journal link

  • Randomized Multiarm Bandits: An Improved Adaptive Data Collection Method. (with B. Ji and T. Wang). Statistical Analysis and Data Mining, 2024, Volume 17, Issue 2, e11681.
    Abstract

    In many scientific experiments, multi-armed bandits are used as an adaptive data collection method. However, this adaptive process can lead to a dependence that renders many commonly used statistical inference methods invalid. An example of this is the sample mean, which is a natural estimator of the mean parameter but can be biased. This can cause test statistics based on this estimator to have an inflated type I error rate, and the resulting confidence intervals may have significantly lower coverage probabilities than their nominal values. To address this issue, we propose an alternative approach called randomized multiarm bandits (rMAB). This combines a randomization step with a chosen MAB algorithm, and by selecting the randomization probability appropriately, optimal regret can be achieved asymptotically. Numerical evidence shows that the bias of the sample mean based on the rMAB is much smaller than that of other methods. The test statistic and confidence interval produced by this method also perform much better than its competitors.

    Journal link

  • Bayesian Mixed Effect Higher-Order Hidden Markov Models with Applications to Predictive Healthcare Using Electronic Health Records. (with Y. Liao, Y. Xiang, and D. Ai). IISE Transactions, 2025, Volume 57, Issue 2, pp. 186–198.
    Abstract

    The disease progression dynamics observed in electronic health records often reflect patients’ health condition evolution, holding the promise of enabling the development of clinical predictive models. These dynamics, however, generally display significant variability among patients, due to some critical factors (e.g., gender and age) and patient-level heterogeneity. Moreover, future health state may not only depend on the current state, but also more distant history states due to the complicated disease progression. To capture this complex transition behavior and address mixed effects in clinical prediction problems, we propose a novel and flexible Bayesian Mixed-Effect Higher-Order Hidden Markov Model (MHOHMM), and develop a classifier based on MHOHMMs. A range of MHOHMMs are designed to capture different data structures and the optimal one is identified by using the k-fold cross-validation approach. An effective two-stage Markov chain Monte Carlo (MCMC) sampling algorithm is designed for model inference. A simulation study is conducted to evaluate the performance of the proposed sampling algorithm and the MHOHMM-based classification method. The practical utility of the proposed framework is demonstrated by a case study on the acute hypotensive episode prediction for intensive care unit patients. Our results show that the MHOHMM-based framework provides good prediction performance.

    Journal link

  • Controlling false discovery rate using Gaussian mirrors. (with X. Xing and J. Liu). Journal of the American Statistical Association, 2023, Volume 118, Issue 541, pp. 222–241.
    Abstract

    Simultaneously finding multiple influential variables and controlling the false discovery rate (FDR) for linear regression models is a fundamental problem. We here propose the Gaussian Mirror (GM) method, which creates for each predictor variable a pair of mirror variables by adding and subtracting a randomly generated Gaussian perturbation, and proceeds with a certain regression method, such as the ordinary least-square or the Lasso (the mirror variables can also be created after selection). The mirror variables naturally lead to test statistics effective for controlling the FDR. Under a mild assumption on the dependence among the covariates, we show that the FDR can be controlled at any designated level asymptotically. We also demonstrate through extensive numerical studies that the GM method is more powerful than many existing methods for selecting relevant variables subject to FDR control, especially for cases when the covariates are highly correlated and the influential variables are not overly sparse.

    Journal link  |  arXiv

  • On F-modelling based Empirical Bayes Estimation of Variances. (with Y. Kwon). Biometrika, 2023, Volume 110, Issue 1, pp. 69–81.
    Abstract

    We consider the problem of empirical Bayes estimation of multiple variances when provided with sample variances. Assuming an arbitrary prior on the variances, we derive different versions of the Bayes estimators using different loss functions. For one particular loss function, the resulting Bayes estimator relies on the marginal cumulative distribution function of the sample variances only. When replacing it with the empirical distribution function, we obtain an empirical Bayes version called F-modeling based empirical Bayes estimator of variances. We provide theoretical properties of this estimator and further demonstrate its advantages through extensive simulations and real data analysis.

    Journal link  |  arXiv

  • Local False Discovery Rate Based Methods for Multiple Testing of One-Way Classified Hypotheses. (with S. K. Sarkar). Electronic Journal of Statistics, 2022, Volume 16, Issue 2, pp. 6043–6085.
    Abstract

    This paper continues the line of research initiated in Liu et al. (2016) on developing a novel framework for multiple testing of hypotheses grouped in a one-way classified form using hypothesis-specific local false discovery rates (Lfdr's). It is built on an extension of the standard two-class mixture model from single to multiple groups, defining hypothesis-specific Lfdr as a function of the conditional Lfdr for the hypothesis given that it is within a significant group and the Lfdr for the group itself and involving a new parameter that measures grouping effect. This definition captures the underlying group structure for the hypotheses belonging to a group more effectively than the standard two-class mixture model. Two new Lfdr based methods, possessing meaningful optimalities, are produced in their oracle forms. One, designed to control false discoveries across the entire collection of hypotheses, is proposed as a powerful alternative to simply pooling all the hypotheses into a single group and using commonly used Lfdr based method under the standard single-group two-class mixture model. The other is proposed as an Lfdr analog of the method of Benjamini and Bogomolov (2014) for selective inference. It controls Lfdr based measure of false discoveries associated with selecting groups concurrently with controlling the average of within-group false discovery proportions across the selected groups. Simulation studies and real-data application show that our proposed methods are often more powerful than their relevant competitors.

    Journal link  |  arXiv

  • Where to find needles in a haystack? TEST, 2022, Volume 31, Issue 1, pp. 148–174.
    Abstract

    In many existing methods of multiple comparison, one starts with either Fisher's p-value or the local fdr. One commonly used p-value, defined as the tail probability exceeding the observed test statistic under the null distribution, fails to use information from the distribution under the alternative hypothesis. The targeted region of signals could be wrong when the likelihood ratio is not monotone. The oracle local fdr based approaches could be optimal because they use the probability density functions of the test statistic under both the null and alternative hypotheses. However, the data-driven version could be problematic because of the difficulty and challenge of probability density function estimation. In this paper, we propose a new method, Cdf and Local fdr Assisted multiple Testing method (CLAT), which is optimal for cases when the p-value based methods are optimal and for some other cases when p-value based methods are not. Additionally, CLAT only relies on the empirical distribution function which quickly converges to the oracle one. Both the simulations and real data analysis demonstrate the superior performance of the CLAT method. Furthermore, the computation is instantaneous based on a novel algorithm and is scalable to large data sets.

    Journal link  |  arXiv

  • Global testing under the sparse alternatives for single index models. (with Q. Lin and J. Liu). In Festschrift in Honor of R. Dennis Cook, 2021.
    Abstract

    For the single index model with Gaussian design, satisfying that rank var(E[x|y]) = 1 where the link function is unknown and the coefficient vector is a sparse p-dimensional unit vector with at most s nonzero entries, we are interested in testing the null hypothesis that the coefficients, when viewed as a whole vector, is zero against the alternative that some entries are nonzero. Assuming that var(E[x|y]) is non-vanishing, we define the generalized signal-to-noise ratio (gSNR) of the model as the unique non-zero eigenvalue of var(E[x|y]). We have established the detection boundary for both the single index model and the single index model with additive noise. It is rather surprising that the detection boundary for the single index model with additive noise matches that for linear regression models. These results pave the road for thorough theoretical analysis of single/multiple index models in high dimensions.

    Journal link  |  arXiv

  • Sparse sliced inverse regression via lasso. (with Q. Lin and J. Liu). Journal of the American Statistical Association, 2019, Volume 114, Issue 528, pp. 1726–1739.
    Abstract

    For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if p/n goes to 0, where p is the dimension and n is the sample size. Thus, when p is of the same or a higher order of n, additional assumptions such as sparsity must be imposed in order to ensure consistency for SIR. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, we introduce a simple Lasso regression method to obtain an estimate of the SDR space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieves the optimal convergence rate under certain sparsity conditions when p is of order o(n²λ²), where λ is the generalized signal-to-noise ratio. We also demonstrate the superior performance of Lasso-SIR compared with existing approaches via extensive numerical studies and several real data examples.

    Journal link  |  arXiv

  • On consistency and sparsity for sliced inverse regression in high dimensions. (with Q. Lin and J. Liu). Annals of Statistics, 2018, Volume 46, No. 2, pp. 580–610.
    Abstract

    We provide here a framework to analyze the phase transition phenomenon of slice inverse regression (SIR), a supervised dimension reduction technique introduced by Li (1991). Under mild conditions, the asymptotic ratio ρ = lim p/n is the phase transition parameter and the SIR estimator is consistent if and only if ρ = 0. When dimension p is greater than n, we propose a diagonal thresholding screening SIR (DT-SIR) algorithm. This method provides us with an estimate of the eigen-space of the covariance matrix of the conditional expectation var(E[x|y]). The desired dimension reduction space is then obtained by multiplying the inverse of the covariance matrix on the eigen-space. Under certain sparsity assumptions on both the covariance matrix of predictors and the loadings of the directions, we prove the consistency of DT-SIR in estimating the dimension reduction space in high dimensional data analysis. Extensive numerical experiments demonstrate superior performances of the proposed method in comparison to its competitors.

    Journal link  |  arXiv

  • Sample size determination for a three-arm equivalence trial of Poisson and Negative binomial responses. (with Y. W. Chang and Y. Tsong). Journal of Biopharmaceutical Statistics, 2017, Volume 27, Issue 2, pp. 239–256.
    Abstract

    Assessing equivalence or similarity has drawn much attention recently as many drug products have lost or will lose their patents in the next few years, especially certain best-selling biologics. To claim equivalence between the test treatment and the reference treatment when assay sensitivity is well established from historical data, one has to demonstrate both superiority of the test treatment over placebo and equivalence between the test treatment and the reference treatment. Thus, there is urgency for practitioners to derive a practical way to calculate sample size for a three-arm equivalence trial. The primary endpoints of a clinical trial may not always be continuous, but may be discrete. In this paper, the authors derive power function and discuss sample size requirement for a three-arm equivalence trial with Poisson and negative binomial clinical endpoints. In addition, the authors examine the effect of the dispersion parameter on the power and the sample size by varying its coefficient from small to large. In extensive numerical studies, the authors demonstrate that required sample size heavily depends on the dispersion parameter. Therefore, misusing a Poisson model for negative binomial data may easily lose power up to 20%, depending on the value of the dispersion parameter.

    Journal link

  • A new approach to multiple testing of grouped hypotheses. (with Y. Liu and S. K. Sarkar). Journal of Statistical Planning and Inference, 2015, Volume 179, pp. 1–14.
    Abstract

    A two-fold loop testing algorithm (TLTA) is proposed for testing grouped hypotheses controlling false discoveries. It is constructed by decomposing a posterior measure of false discoveries across all hypotheses into within- and between-group components, allowing a portion of the overall FDR level to be used to maintain control over within-group false discoveries. Numerical calculations performed under certain model assumption for the hidden states of the within-group hypotheses show its superior performance over its competitors that ignore the group structure, especially when only a few of the groups contain the signals, as expected in many modern applications. We offer data-driven version of the TLTA by estimating the parameters using EM algorithms and provide simulation evidence of its favorable performance relative to these competitors. Real data applications have also produced encouraging results for the TLTA.

    Journal link

  • Capturing the severity of type II errors in high-dimensional multiple testing. (with L. He and S. K. Sarkar). Journal of Multivariate Analysis, 2015, Volume 142, pp. 106–116.
    Abstract

    The severity of type II errors is frequently ignored when deriving a multiple testing procedure, even though utilizing it properly can greatly help in making correct decisions. This paper puts forward a theory behind developing a multiple testing procedure that can incorporate the type II error severity and is optimal in the sense of minimizing a measure of false non-discoveries among all procedures controlling a measure of false discoveries. The theory is developed under a general model allowing arbitrary dependence by taking a compound decision theoretic approach to multiple testing with a loss function incorporating the type II error severity. We present this optimal procedure in its oracle form and offer numerical evidence of its superior performance over relevant competitors.

    Journal link

  • A Bayesian approach to construct multiple confidence intervals of selected parameters with sparse signals. (with S. K. Sarkar). Statistica Sinica, 2015, Volume 25, Number 2, pp. 725–742.
    Abstract

    Selective inference using multiple confidence intervals is an emerging area of statistical research whose importance is being realized very recently. We consider making such inference in the context of analyzing data with sparse signals in a Bayesian framework. Although the traditional posterior credible intervals are immune to selection, they can have low power in detecting the true signals because of covering no-signal too often if the sparse nature of the data is not properly taken into account. We demonstrate this phenomenon using a canonical Bayes model with the parameters of interest following a zero-inflated mixture prior. We propose a new method of constructing multiple intervals for any given selection rule taking a Bayesian decision theoretic approach under such a model. It involves the local fdr, the posterior probability of a parameter being null which is commonly used in multiple testing. It controls an overall measure of error rate, the Bayes or posterior false coverage rate, at a desired level among the selected intervals. We apply this method to the regression problem and demonstrate via simulations as well as data analyses that it is much more powerful in terms of enclosing zero less frequently than the traditional and some alternative methods.

    Journal link

  • Applying multiple testing procedure to detect changes in East African vegetation. (with N. Clements, S. K. Sarkar, and D. Kim). Annals of Applied Statistics, 2014, Volume 8, No. 1, pp. 286–308.
    Abstract

    The study of vegetation fluctuations gives valuable information toward effective land use and development. We consider this problem for the East African region based on the Normalized Difference Vegetation Index (NDVI) series from satellite remote sensing data collected between 1982 and 2006 over 8-kilometer grid points. We detect areas with significant increasing or decreasing monotonic vegetation changes using a multiple testing procedure controlling the mixed directional false discovery rate (mdFDR). Specifically, we use a three-stage directional Benjamini-Hochberg (BH) procedure with proven mdFDR control under independence and a suitable adaptive version of it. The performance of these procedures is studied through simulations before applying them to the vegetation data. Our analysis shows increasing vegetation in the Northern hemisphere as well as coastal Tanzania and generally decreasing Southern hemisphere vegetation trends, which are consistent with historical evidence.

    Journal link  |  arXiv

  • Sample size determination for a three-arm equivalence trial of normally distributed responses. (with Y. W. Chang, Y. Tsong and X. Dong). Journal of Biopharmaceutical Statistics, 2014, Volume 24, Issue 6, pp. 1190–1202.
    Abstract

    The equivalence assessment is often conducted through a three-arm clinical trial (namely, test, reference, and placebo) and it usually consists of three tests. The first two tests are to demonstrate the superiority of the test and the reference treatment to the placebo, and they are followed by an equivalence test between the test treatment and the reference treatment. When the response variable is continuous, equivalence is commonly defined in terms of mean difference, mean ratio, or ratio of mean differences, that is, the mean difference of the test and the placebo to the mean difference of the reference and the placebo. These equivalence tests can be performed with both a hypothesis-testing approach and a confidence-interval approach. The advantage of applying the equivalence test by ratio of mean differences is that it can test both superiority of the test treatment over placebo and equivalence between the test and the reference simultaneously through a single hypothesis. In this article, we derive the test statistics and the power function for the ratio of mean differences hypothesis and solve the required sample size for a three-arm clinical trial. Examples of required sample size are given in this article, and are compared with the required sample size by the traditional mean difference equivalence test. After a careful examination, we suggest increasing the power of the ratio of mean differences approach by appropriately adjusting the lower limit of the equivalence interval.

    Journal link

  • An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data. (with W. Wang and Z. Wei). Annals of Applied Statistics, 2013, Volume 7, No. 4, pp. 2229–2248.
    Abstract

    Because of the decreasing cost and high digital resolution, next-generation sequencing (NGS) is expected to replace the traditional hybridization-based microarray technology. For genetics study, the first-step analysis of NGS data is often to identify genomic variants among sequenced samples. Several statistical models and tests have been developed for variant calling in NGS study. The existing approaches, however, are based on either conventional Bayesian or frequentist methods, which are unable to address the multiplicity and testing efficiency issues simultaneously. In this paper, we derive an optimal empirical Bayes testing procedure to detect variants for NGS study. We utilize the empirical Bayes technique to exploit the across-site information among many testing sites in NGS data. We prove that our testing procedure is valid and optimal in the sense of rejecting the maximum number of nonnulls while the Bayesian false discovery rate is controlled at a given nominal level. We show by both simulation studies and real data analysis that our testing efficiency can be greatly enhanced over the existing frequentist approaches that fail to pool and utilize information across the multiple testing sites.

    Journal link  |  arXiv

  • Empirical Bayes confidence intervals for selected parameters in high-dimensional data. (with J. T. G. Hwang). Journal of the American Statistical Association, 2013, Volume 108, Issue 502, pp. 607–618.
    Abstract

    Modern statistical problems often involve a large number of populations and hence a large number of parameters that characterize these populations. It is common for scientists to use data to select the most significant populations, such as those with the largest t statistics. The scientific interest often lies in studying and making inferences regarding these parameters, called the selected parameters, corresponding to the selected populations. The current statistical practices either apply a traditional procedure assuming there were no selection, a practice that is not valid, or they use the Bonferroni-type procedure that is valid but very conservative and often noninformative. In this article, we propose valid and sharp confidence intervals that allow scientists to select parameters and to make inferences for the selected parameters based on the same data. This type of confidence interval allows the users to zero in on the most interesting selected parameters without collecting more data. The validity of confidence intervals is defined as the controlling of Bayes coverage probability so that it is no less than a nominal level uniformly over a class of prior distributions for the parameter. When a mixed model is assumed and the random effects are the key parameters, this validity criterion is exactly the frequentist criterion, since the Bayes coverage probability is identical to the frequentist coverage probability. Assuming that the observations are normally distributed with unequal and unknown variances, we select parameters with the largest t statistics. We then construct sharp empirical Bayes confidence intervals for these selected parameters, which have either a large Bayes coverage probability or a small Bayes false coverage rate uniformly for a class of priors. Our intervals, applicable to any high-dimensional data, are applied to microarray data and are shown to be better than all the alternatives. It is also anticipated that the same intervals would be valid for any selection rule.

    Journal link

  • Empirical Bayes false coverage rate controlling confidence intervals. (with J. T. G. Hwang). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2012, Volume 74, Issue 5, pp. 871–891.
    Abstract

    Benjamini and Yekutieli suggested that it is important to account for multiplicity correction for confidence intervals when only some of the selected intervals are reported. They introduced the concept of the false coverage rate (FCR) for confidence intervals which is parallel to the concept of the false discovery rate in the multiple-hypothesis testing problem and they developed confidence intervals for selected parameters which control the FCR. Their approach requires the FCR to be controlled in the frequentist's sense, i.e. controlled for all the possible unknown parameters. In modern applications, the number of parameters could be large, as large as tens of thousands or even more, as in microarray experiments. We propose a less conservative criterion, the Bayes FCR, and study confidence intervals controlling it for a class of distributions. The Bayes FCR refers to the average FCR with respect to a distribution of parameters. Under such a criterion, we propose some confidence intervals, which, by some analytic and numerical calculations, are demonstrated to have the Bayes FCR controlled at level q for a class of prior distributions, including mixtures of normal distributions and zero, where the mixing probability is unknown. The confidence intervals are shrinkage-type procedures which are more efficient for the θ that have a sparsity structure, which is a common feature of microarray data. More importantly, the centre of the proposed shrinkage intervals reduces much of the bias due to selection. Consequently, the proposed empirical Bayes intervals are always shorter in average length than the intervals of Benjamini and Yekutieli and can be only 50% or 60% as long in some cases. We apply these procedures to the data of Choe and colleagues and obtain similar results.

    Journal link

  • Double shrinkage empirical Bayesian estimation for unknown and unequal variances. Statistics and Its Interface, 2010, Volume 3, pp. 533–541.
    Abstract

    In this paper, we construct a point estimator when assuming unequal and unknown variances by using the empirical Bayes approach in the classical normal mean problem. The proposed estimator shrinks both means and variances, and is thus called the double shrinkage estimator. Extensive numerical studies indicate that the double shrinkage estimator has lower Bayes risk than the estimator which shrinks the means alone, and the naive estimator which has no shrinkage at all. We further use a spike-in data set to assess different estimating procedures. It turns out that our proposed estimator performs the best and is thus strongly recommended for applications.

    Journal link

  • Empirical Bayes confidence intervals shrinking both means and variances. (with J. T. G. Hwang and J. Qiu). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2009, Volume 71, Issue 1, pp. 265–285.
    Abstract

    We construct empirical Bayes intervals for a large number p of means. The existing intervals in the literature assume that variances are either equal or unequal but known. When the variances are unequal and unknown, the suggestion is typically to replace them by unbiased estimators. However, when p is large, there would be advantage in “borrowing strength” from each other. We derive double-shrinkage intervals for means on the basis of our empirical Bayes estimators that shrink both the means and the variances. Analytical and simulation studies and application to a real data set show that, compared with the t-intervals, our intervals have higher coverage probabilities while yielding shorter lengths on average. The double-shrinkage intervals are on average shorter than the intervals from shrinking the means alone and are always no longer than the intervals from shrinking the variances alone. Also, the intervals are explicitly defined and can be computed immediately.

    Journal link

Publications on Statistical Applications

  • How TikTok “Locks In” Its Users with Reference to What Happened in the Homeland Version Douyin Application. (with Y. Yang, Z. Li, S. (Zi Tong) Peng, L. Dou, C. Li, and Y. Wang). Antitrust Report, pp. 2–23, November 2023.
  • A Feature Sampling Strategy for Analysis of High Dimensional Genomic Data. (with J. Zhang, K. Zhang, and Z. Wei). IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2019, Volume 16, No. 2, pp. 434–441.
    Abstract

    With the development of high throughput technology, it has become feasible and common to profile tens of thousands of gene activities simultaneously. These genomic data typically have sample size of hundreds or fewer, which is much less than the feature size (number of genes). In addition, the genes, in particular the ones from the same pathway, are often highly correlated. These issues impose a great challenge for selecting meaningful genes from a large number of (correlated) candidates in many genomic studies. Quite a few methods have been proposed to attack this challenge. Among them, regularization-based techniques, e.g., lasso, become much more appealing, because they can do model fitting and variable selection at the same time. However, the lasso regression has its known limitations. One is that the number of genes selected by the lasso couldn't exceed the number of samples. Another limitation is that, if causal genes are highly correlated, the lasso tends to select only one or few genes from them. Biologists, however, desire to identify them all. To overcome these limitations, we present here a novel, robust, and stable variable selection method. Through simulation studies and a real application to the transcriptome data, we demonstrate the superiority of the proposed method in selecting highly correlated causal genes. We also provide some theoretical justifications for this feature sampling strategy based on the mean and variance analyses.

    Journal link

  • Network analysis in detection of early-stage mild cognitive impairment. (with H. Ni, J. Qin, L. Zhou, J. Wang and F. Hou). Physica A: Statistical Mechanics and its Applications, 2017, Volume 478, pp. 113–119.
    Abstract

    The detection and intervention for early-stage mild cognitive impairment (EMCI) is of vital importance. However, the pathology of EMCI remains largely unknown, making it a challenge to the clinical diagnosis. In this paper, the resting-state functional magnetic resonance imaging (rs-fMRI) data derived from EMCI patients and normal controls are analyzed using the complex network theory. We construct the functional connectivity (FC) networks and employ the local false discovery rate approach to successfully detect the abnormal functional connectivities appeared in the EMCI patients. Our results demonstrate the abnormal functional connectivities have appeared in the EMCI patients, and the affected brain regions are mainly distributed in the frontal and temporal lobes. In addition, to quantitatively characterize the statistical properties of FCs in the complex network, we herein employ the entropy of the degree distribution index and some other well-established measures, i.e., clustering coefficient and the efficiency of graph. Eventually, we found that the index, better than the widely used measures, may serve as an assistant and potential marker for the detection of EMCI.

    Journal link

  • Determining candidate single nucleotide polymorphisms in acquired laryngotracheal stenosis. (with M. Anis, J. Khurana, E. Krynetskiy and A. Soliman). The Laryngoscope, 2018, Volume 128, Issue 3, pp. E111–E116.
    Abstract

    Despite wide adoption of strategies to prevent injury from prolonged intubation and tracheotomy, acquired laryngotracheal stenosis (ALTS) has not disappeared. ALTS’ persistence may be due to patient factors that confer unique susceptibility for some. We sought to identify genetic markers in genes associated with wound healing that could be associated with ALTS.

    Journal link

  • Translational genomics of acquired laryngotracheal stenosis. (with M. Anis, J. Khurana, E. Krynetskiy and A. Soliman). The Laryngoscope, 2014, Volume 124, Issue 5, pp. E175–E179.
    Abstract

    Acquired laryngotracheal stenosis (ALTS) results from abnormal mucosal wound healing after laryngeal and/or tracheal injury. Patients with ALTS often present late after significant reduction of the airway lumen and onset of symptoms. Motivated by the need for earlier detection of affected patients, we sought to investigate genetic markers for ALTS that would identify susceptible patients.

    Journal link