Manuscript under review
- Rate optimal multiple testing procedure in high-dimensional regression.(with P. Ji)
Multiple testing and variable selection have gained much attention in statistical theory and methodology research. They are dealing with the same problem of identifying the important variables among many (Jin, 2012). However, there is little overlap in the literature. Research on variable selection has been focusing on selection consistency, i.e., both type I and type II errors converging to zero. This is only possible when the signals are sufficiently strong, contrary to many modern applications. For the regime where the signals are both rare and weak, it is inevitable that a certain amount of false discoveries will be allowed, as long as some error rate can be controlled. In this paper, motivated by the research by Ji and Jin (2012) and Jin (2012) in the rare/weak regime, we extend their UPS procedure for variable selection to multiple testing. Under certain conditions, the new UPT procedure achieves the fastest convergence rate of marginal false non-discovery rates, while controlling the marginal false discovery rate at any designated level alpha asymptotically. Numerical results are provided to demonstrate the advantage of the proposed method.
@article{ji2014rate,
title={Rate optimal multiple testing procedure in high-dimensional regression},
author={Ji, Pengsheng and Zhao, Zhigen},
journal={arXiv preprint arXiv:1404.2961},
year={2014}
}
- On the Testing of Multiple Hypothesis in Sliced Inverse Regression
(with X. Xing)
We consider the multiple testing of the general regression framework aiming at studying the relationship between a univariate response and a p-dimensional predictor. To test the hypothesis of the effect of each predictor, we construct a mirror statistic based on the estimator of the sliced inverse regression without assuming a model of the conditional distribution of the response. According to the developed limiting distribution results in this paper, we have shown that the mirror statistic is asymptotically symmetric with respect to zero under the null hypothesis. We then propose the Model-free Multiple testing procedure using Mirror statistics and show theoretically that the false discovery rate of this method is less than or equal to a designated level asymptotically. Numerical evidence has shown that the proposed method is much more powerful than its alternatives, subject to the control of the false discovery rate.
@article{zhang2021beauty,
title={BEAUTY Powered BEAST},
author={Zhang, Kai and Zhao, Zhigen and Zhou, Wen},
journal={arXiv preprint arXiv:2103.00674},
year={2021}
}
- BEAUTY Powered BEAST. (with K. Zhang and W. Zhou)
We study inference about the uniform distribution with the proposed binary expansion approximation of uniformity (BEAUTY) approach. Through an extension of the celebrated Euler's formula, we approximate the characteristic function of any copula distribution with a linear combination of means of binary interactions marginal binary expansions. This novel characterization enables a unification of many important existing tests through an approximation from some quadratic form of symmetry statistics, where the deterministic weight matrix characterizes the power properties of each test. To achieve a uniformly high power, we study test statistics with data-adaptive weights through an oracle approach, referred to as the binary expansion adaptive symmetry test (BEAST). By utilizing the properties of the binary expansion filtration, we show that the Neyman-Pearson test of uniformity can be approximated by an oracle weighted sum of symmetry statistics. The BEAST with this oracle leads all existing tests we considered in empirical power against all complex forms of alternatives. This oracle therefore sheds light on the potential of substantial improvements in power and on the form of optimal weights under each alternative. By approximating this oracle with data-adaptive weights, we develop the BEAST that improves the empirical power of many existing tests against a wide spectrum of common alternatives while providing clear interpretation of the form of non-uniformity upon rejection. We illustrate the BEAST with a study of the relationship between the location and brightness of stars.
@article{zhang2021beauty,
title={BEAUTY Powered BEAST},
author={Zhang, Kai and Zhao, Zhigen and Zhou, Wen},
journal={arXiv preprint arXiv:2103.00674},
year={2021}
}
- Discussion on ”Data Fission: Splitting a Single Data Point”
Journal of American Statistical Association. To appear.
- Randomized Multiarm Bandits: An Improved Adaptive Data Collection Method.
Statistical Analysis and Data Mining (with B. Ji, and my student T. Wang).
In many scientific experiments, multi-armed bandits are used as an adaptive data collection method. However, this adaptive process can lead to a dependence that renders many commonly used statistical inference methods invalid. An example of this is the sample mean, which is a natural estimator of the mean parameter but can be biased. This can cause test statistics based on this estimator to have an inflated type I error rate, and the resulting confidence intervals may have significantly lower coverage probabilities than their nominal values. To address this issue, we propose an alternative approach called randomized multiarm bandits (rMAB). This combines a randomization step with a chosen MAB algorithm, and by selecting the randomization probability appropriately, optimal regret can be achieved asymptotically. Numerical evidence shows that the bias of the sample mean based on the rMAB is much smaller than that of other methods. The test statistic and confidence interval produced by this method also perform much better than its competitors.
- Bayesian Mixed Effect Higher-Order Hidden Markov Models with Applications to Predictive Healthcare Using Electronic Health Records.
IISE Transactions. (with Y. Liao, Y. Xiang, and D. Ai)
The disease progression dynamics observed in electronic health records often reflect patients’ health condition evolution, holding the promise of enabling the development of clinical predictive models. These dynamics, however, generally display significant variability among patients, due to some critical factors (e.g., gender and age) and patient-level heterogeneity. Moreover, future health state may not only depend on the current state, but also more distant history states due to the complicated disease progression. To capture this complex transition behavior and address mixed effects in clinical prediction problems, we propose a novel and flexible Bayesian Mixed-Effect Higher-Order Hidden Markov Model (MHOHMM), and develop a classifier based on MHOHMMs. A range of MHOHMMs are designed to capture different data structures and the optimal one is identified by using the k-fold cross-validation approach. An effective two-stage Markov chain Monte Carlo (MCMC) sampling algorithm is designed for model inference. A simulation study is conducted to evaluate the performance of the proposed sampling algorithm and the MHOHMM-based classification method. The practical utility of the proposed framework is demonstrated by a case study on the acute hypotensive episode prediction for intensive care unit patients. Our results show that the MHOHMM-based framework provides good prediction performance.
@article{Liao:Xiang:Zhao:Ai:2024,
author = {Ying Liao, Yisha Xiang, Zhigen Zhao and Di Ai},
title = {Bayesian mixed-effect higher-order hidden Markov models with applications to predictive healthcare using electronic health records}, journal = {IISE Transactions},
volume = {0},
number = {0},
pages = {1-13},
year = {2024}
}
- Controlling false discovery rate using gaussian mirrors. (with X. Xing and J. Liu)
Journal of American Statistical Association, Volume 118, Issue 541, 222-241. 2023.Simultaneously finding multiple influential variables and controlling the false discovery rate (FDR) for linear regression models is a fundamental problem. We here propose the Gaussian Mirror (GM) method, which creates for each predictor variable a pair of mirror variables by adding and subtracting a randomly generated Gaussian perturbation, and proceeds with a certain regression method, such as the ordinary least-square or the Lasso (the mirror variables can also be created after selection). The mirror variables naturally lead to test statistics effective for controlling the FDR. Under a mild assumption on the dependence among the covariates, we show that the FDR can be controlled at any designated level asymptotically. We also demonstrate through extensive numerical studies that the GM method is more powerful than many existing methods for selecting relevant variables subject to FDR control, especially for cases when the covariates are highly correlated and the influential variables are not overly sparse.
@article{Xing:Zhao:Liu:2021,
title={Controlling false discovery rate using gaussian mirrors},
author={Xing, Xin and Zhao, Zhigen and Liu, Jun S},
journal={arXiv preprint arXiv:1911.09761},
year={2021}
}
- On F-modelling based Empiricial Bayes Estimation of Variances. (with my student Y. Kwon)
Biometrika, Volume 110, Issue 1, March 2023, Pages 69–81.We consider the problem of empirical Bayes estimation of multiple variances when provided with sample variances. Assuming an arbitrary prior on the variances, we derive different versions of the Bayes estimators using different loss functions. For one particular loss function, the resulting Bayes estimator relies on the marginal cumulative distribution function of the sample variances only. When replacing it with the empirical distribution function, we obtain an empirical Bayes version called F-modeling based empirical Bayes estimator of variances. We provide theoretical properties of this estimator and further demonstrate its advantages through extensive simulations and real data analysis.
@article{kwon:zhao:2022,
title={On F-Modelling based Empirical Bayes Estimation of Variances},
author={Kwon, Yeil and Zhao, Zhigen},
journal={Biometrika},
year={2022}
}
Click here to download the F-EBV code
Example:
p = 1000
df = 5
sigmaSq = rgamma(p, 2, 2)
sSq = rchisq(p, df) * sigmaSq/df
FEB.est = FEB( sSq, df)
- Local False Discovery Rate Based Methods for Multiple Testing of One-Way Classified Hypotheses. (with S. K. Sarkar)
Electronic Journal of Statistics, 2022. Vol. 16, Issue 2, Pages 6043-6085This paper continues the line of research initiated in Liu et. al. (2016) on developing a novel framework for multiple testing of hypotheses grouped in a one-way classified form using hypothesis-specific local false discovery rates (Lfdr's). It is built on an extension of the standard two-class mixture model from single to multiple groups, defining hypothesis-specific Lfdr as a function of the conditional Lfdr for the hypothesis given that it is within a significant group and the Lfdr for the group itself and involving a new parameter that measures grouping effect. This definition captures the underlying group structure for the hypotheses belonging to a group more effectively than the standard two-class mixture model. Two new Lfdr based methods, possessing meaningful optimalities, are produced in their oracle forms. One, designed to control false discoveries across the entire collection of hypotheses, is proposed as a powerful alternative to simply pooling all the hypotheses into a single group and using commonly used Lfdr based method under the standard single-group two-class mixture model. The other is proposed as an Lfdr analog of the method of Benjamini and Bogomolov (2014) for selective inference. It controls Lfdr based measure of false discoveries associated with selecting groups concurrently with controlling the average of within-group false discovery proportions across the selected groups. Simulation studies and real-data application show that our proposed methods are often more powerful than their relevant competitors.
@article{sarkar2017local,
title={Local false discovery rate based methods for multiple testing of one-way classified hypotheses},
author={Sarkar, Sanat K and Zhao, Zhigen},
journal={Electronic Journal of Statistics},
Volume={16},
Issues={2},
Pages = {6043-6085},
year={2022}
}
- Where to find needles in a haystack?
TEST, 2022. Vol. 31. Issue 1. 148-174.In many existing methods of multiple comparison, one starts with either Fisher's p-value or the local fdr. One commonly used p-value, defined as the tail probability exceeding the observed test statistic under the null distribution, fails to use information from the distribution under the alternative hypothesis. The targeted region of signals could be wrong when the likelihood ratio is not monotone. The oracle local fdr based approaches could be optimal because they use the probability density functions of the test statistic under both the null and alternative hypotheses. However, the data-driven version could be problematic because of the difficulty and challenge of probability density function estimation. In this paper, we propose a new method, Cdf and Local fdr Assisted multiple Testing method (CLAT), which is optimal for cases when the p-value based methods are optimal and for some other cases when p-value based methods are not. Additionally, CLAT only relies on the empirical distribution function which quickly converges to the oracle one. Both the simulations and real data analysis demonstrate the superior performance of the CLAT method. Furthermore, the computation is instantaneous based on a novel algorithm and is scalable to large data sets.
@article{zhao2019,
title={Where to find needles in a haystack?},
author={Zhao, Zhigen},
journal={arXiv preprint arXiv:1910.02597},
year={2019}
}
- Global testing under the sparse alternatives for single index models. (with Q. Lin and J. Liu)
Festschrift in Honor of R. Dennis Cook. 2021.For the single index model with Gaussian design, satisfying that rank var(E[x|y])=1 where the link function is unknown and the coefficient vector is a sparse p-dimensional unit vector with at most s nonzero entries, we are interested in testing the null hypothesis that the coefficients, when viewed as a whole vector, is zero against the alternative that some entries are nonzero. Assuming that var(E[x|y]) is non-vanishing, we define the generalized signal-to-noise ratio (gSNR) of the model as the unique non-zero eigenvalue of var(E[x|y]). We have established the detection boundary for both the single index model and the single index model with additive noise. It is rather surprising that the detection boundary for the single index model with additive noise matches that for linear regression models. These results pave the road for thorough theoretical analysis of single/multiple index models in high dimensions.
@article{lin2018global,
title={Global testing under the sparse alternatives for single index models},
author={Lin, Qian and Zhao, Zhigen and Liu, Jun S},
journal={arXiv preprint arXiv:1805.01820},
year={2018}
}
- Sparse sliced inverse regression via lasso. (with Q. Lin and J. Liu)
Journal of American Statistical Association, 2019. Vol. 114, Issue 528, Pages 1726-1739.For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if p/n goes to 0, where p is the dimension and n is the sample size. Thus, when p is of the same or a higher order of n, additional assumptions such as sparsity must be imposed in order to ensure consistency for SIR. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, we introduce a simple Lasso regression method to obtain an estimate of the SDR space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieves the optimal convergence rate under certain sparsity conditions when p is of order o(n^2\lambda^2), where \lambda is the generalized signal-to-noise ratio. We also demonstrate the superior performance of Lasso-SIR compared with existing approaches via extensive numerical studies and several real data examples. Supplementary materials for this article are available
@article{lin2019sparse,
title={Sparse sliced inverse regression via lasso},
author={Lin, Qian and Zhao, Zhigen and Liu, Jun S},
journal={Journal of the American Statistical Association},
volume={114},
number={528},
pages={1726--1739},
year={2019},
publisher={Taylor \& Francis}
}
- On consistency and sparsity for sliced inverse regression in high dimensions. (with Q. Lin and J. Liu)
Annals of Statistics, 2018. Vol. 46, No. 2, Pages 580-610.We provide here a framework to analyze the phase transition phenomenon of slice inverse regression (SIR), a supervised dimension reduction technique introduced by Li (1991). Under mild conditions, the asymptotic ratio rho=lim p/n is the phase transition parameter and the SIR estimator is consistent if and only if rho=0. When dimension p is greater than n, we propose a diagonal thresholding screening SIR (DT-SIR) algorithm. This method provides us with an estimate of the eigen-space of the covariance matrix of the conditional expectation var(E[x|y]). The desired dimension reduction space is then obtained by multiplying the inverse of the covariance matrix on the eigen-space. Under certain sparsity assumptions on both the covariance matrix of predictors and the loadings of the directions, we prove the consistency of DT-SIR in estimating the dimension reduction space in high dimensional data analysis. Extensive numerical experiments demonstrate superior performances of the proposed method in comparison to its competitors.
@article{lin2018consistency,
title={On consistency and sparsity for sliced inverse regression in high dimensions},
author={Lin, Qian and Zhao, Zhigen and Liu, Jun S and others},
journal={The Annals of Statistics},
volume={46},
number={2},
pages={580--610},
year={2018},
publisher={Institute of Mathematical Statistics}
}
- Sample size determination for a three-arm equivalence trial of Poisson and Negative binomial responses. (with my student Y. W. Chang and Y. Tsong)
Journal of Biopharmaceutical Statistics, 2017. Vol. 27, Issue 2, 239-256.Assessing equivalence or similarity has drawn much attention recently as many drug products have lost or will lose their patents in the next few years, especially certain best-selling biologics. To claim equivalence between the test treatment and the reference treatment when assay sensitivity is well established from historical data, one has to demonstrate both superiority of the test treatment over placebo and equivalence between the test treatment and the reference treatment. Thus, there is urgency for practitioners to derive a practical way to calculate sample size for a three-arm equivalence trial. The primary endpoints of a clinical trial may not always be continuous, but may be discrete. In this paper, the authors derive power function and discuss sample size requirement for a three-arm equivalence trial with Poisson and negative binomial clinical endpoints. In addition, the authors examine the effect of the dispersion parameter on the power and the sample size by varying its coefficient from small to large. In extensive numerical studies, the authors demonstrate that required sample size heavily depends on the dispersion parameter. Therefore, misusing a Poisson model for negative binomial data may easily lose power up to 20%, depending on the value of the dispersion parameter.
@article{chang2017sample,
title={Sample size determination for a three-arm equivalence trial of Poisson and negative binomial responses},
author={Chang, Yu-Wei and Tsong, Yi and Zhao, Zhigen},
journal={Journal of biopharmaceutical statistics},
volume={27},
number={2},
pages={239--256},
year={2017},
publisher={Taylor \& Francis}
}
- A new approach to multiple testing of grouped hypotheses. (with my student Y. Liu and S. K. Sarkar)
Journal of Statistical Planning and Inference, 2015. Vol. 179, 1-14.A two-fold loop testing algorithm (TLTA) is proposed for testing grouped hypotheses controlling false discoveries. It is constructed by decomposing a posterior measure of false discoveries across all hypotheses into within- and between-group components, allowing a portion of the overall FDR level to be used to maintain control over within-group false discoveries. Numerical calculations performed under certain model assumption for the hidden states of the within-group hypotheses show its superior performance over its competitors that ignore the group structure, especially when only a few of the groups contain the signals, as expected in many modern applications. We offer data-driven version of the TLTA by estimating the parameters using EM algorithms and provide simulation evidence of its favorable performance relative to these competitors. Real data applications have also produced encouraging results for the TLTA.
@article{LIU20161,
title = {A new approach to multiple testing of grouped hypotheses},
journal = {Journal of Statistical Planning and Inference},
volume = {179},
pages = {1-14},
year = {2016},
issn = {0378-3758},
doi = {https://doi.org/10.1016/j.jspi.2016.07.004},
url = {https://www.sciencedirect.com/science/article/pii/S0378375816300854},
author = {Yanping Liu and Sanat K. Sarkar and Zhigen Zhao},
keywords = {False discovery rate, Grouped hypotheses, Large-scale multiple testing},
}
- Capturing the severity of Type II errors in high-dimensional multiple testing. (with L. He and S. K. Sarkar)
Journal of Multivariate Analysis, 2015. Vol. 142, 106-116.The severity of type II errors is frequently ignored when deriving a multiple testing procedure, even though utilizing it properly can greatly help in making correct decisions. This paper puts forward a theory behind developing a multiple testing procedure that can incorporate the type II error severity and is optimal in the sense of minimizing a measure of false non-discoveries among all procedures controlling a measure of false discoveries. The theory is developed under a general model allowing arbitrary dependence by taking a compound decision theoretic approach to multiple testing with a loss function incorporating the type II error severity. We present this optimal procedure in its oracle form and offer numerical evidence of its superior performance over relevant competitors.
@article{HE2015106,
title = {Capturing the severity of type II errors in high-dimensional multiple testing},
journal = {Journal of Multivariate Analysis},
volume = {142},
pages = {106-116},
year = {2015},
issn = {0047-259X},
doi = {https://doi.org/10.1016/j.jmva.2015.08.005},
url = {https://www.sciencedirect.com/science/article/pii/S0047259X1500189X},
author = {Li He and Sanat K. Sarkar and Zhigen Zhao},
}
- A Bayesian approach to construct multiple confidence intervals of selected parameters with sparse signals. (with S. K. Sarkar)
Statistica Sinica, 2015. Volume 25, Number 2, 725-742.Selective inference using multiple confidence intervals is an emerging area of statistical research whose importance is being realized very recently. We consider making such inference in the context of analyzing data with sparse signals in a Bayesian framework. Although the traditional posterior credible intervals are immune to selection, they can have low power in detecting the true signals because of covering no-signal too often if the sparse nature of the data is not properly taken into account. We demonstrate this phenomenon using a canonical Bayes model with the parameters of interest following a zero-inflated mixture prior. We propose a new method of constructing multiple intervals for any given selection rule taking a Bayesian decision theoretic approach under such a model. It involves the local fdr, the posterior probability of a parameter being null which is commonly used in multiple testing. It controls an overall measure of error rate, the Bayes or posterior false coverage rate, at a desired level among the selected intervals. We apply this method to the regression problem and demonstrate via simulations as well as data analyses that it is much more powerful in terms of enclosing zero less frequently than the traditional and some alternative methods.
@article{10.2307/24311042,
ISSN = {10170405, 19968507},
URL = {http://www.jstor.org/stable/24311042},
author = {Zhigen Zhao and Sanat K. Sarkar},
journal = {Statistica Sinica},
number = {2},
pages = {725--741},
publisher = {Institute of Statistical Science, Academia Sinica},
title = {A BAYESIAN APPROACH TO CONSTRUCTING MULTIPLE CONFIDENCE INTERVALS OF SELECTED PARAMETERS WITH SPARSE SIGNALS},
volume = {25},
year = {2015}
}
- Applying multiple testing procedure to detect changes in east African vegetation. (with N. Clements, S. K. Sarkar, and D. Kim)
Annals of Applied Statistics, 2014. Volume 8, No.1, 286-308.The study of vegetation fluctuations gives valuable information toward effective land use and development. We consider this problem for the East African region based on the Normalized Difference Vegetation Index (NDVI) series from satellite remote sensing data collected between 1982 and 2006 over 8-kilometer grid points. We detect areas with significant increasing or decreasing monotonic vegetation changes using a multiple testing procedure controlling the mixed directional false discovery rate (mdFDR). Specifically, we use a three-stage directional Benjamini-Hochberg (BH) procedure with proven mdFDR control under independence and a suitable adaptive version of it. The performance of these procedures is studied through simulations before applying them to the vegetation data. Our analysis shows increasing vegetation in the Northern hemisphere as well as coastal Tanzania and generally decreasing Southern hemisphere vegetation trends, which are consistent with historical evidence.
@article{10.1214/13-AOAS686,
author = {Nicolle Clements and Sanat K. Sarkar and Zhigen Zhao and Dong-Yun Kim},
title = {{Applying multiple testing procedures to detect change in East African vegetation}},
volume = {8},
journal = {The Annals of Applied Statistics},
number = {1},
publisher = {Institute of Mathematical Statistics},
pages = {286 -- 308},
keywords = {directional false discovery rate, East Africa vegetation, False discovery rate, NDVI},
year = {2014},
doi = {10.1214/13-AOAS686},
URL = {https://doi.org/10.1214/13-AOAS686}
}
- Sample size determination for a three-arm equivalence trial of normally distributed responses. (with my student Y. W. Chang, Y. Tsong and X. Dong)
Journal of Biopharmaceutical Statistics, 2014. Vol. 24, Issue 6, 1190-1202.The equivalence assessment is often conducted through a three-arm clinical trial (namely, test, reference, and placebo) and it usually consists of three tests. The first two tests are to demonstrate the superiority of the test and the reference treatment to the placebo, and they are followed by an equivalence test between the test treatment and the reference treatment. When the response variable is continuous, equivalence is commonly defined in terms of mean difference, mean ratio, or ratio of mean differences, that is, the mean difference of the test and the placebo to the mean difference of the reference and the placebo. These equivalence tests can be performed with both a hypothesis-testing approach and a confidence-interval approach. The advantage of applying the equivalence test by ratio of mean differences is that it can test both superiority of the test treatment over placebo and equivalence between the test and the reference simultaneously through a single hypothesis. In this article, we derive the test statistics and the power function for the ratio of mean differences hypothesis and solve the required sample size for a three-arm clinical trial. Examples of required sample size are given in this article, and are compared with the required sample size by the traditional mean difference equivalence test. After a careful examination, we suggest increasing the power of the ratio of mean differences approach by appropriately adjusting the lower limit of the equivalence interval.
@article{chang2014sample,
title={Sample size determination for a three-arm equivalence trial of normally distributed responses},
author={Chang, Yu-Wei and Tsong, Yi and Dong, Xiaoyu and Zhao, Zhigen},
journal={Journal of biopharmaceutical statistics},
volume={24},
number={6},
pages={1190--1202},
year={2014},
publisher={Taylor \& Francis}
}
- An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data. (with W. Wang and Z. Wei)
Annals of Applied Statistics, 2013. Volume 7, No.4, 2229-2248.Because of the decreasing cost and high digital resolution, next-generation sequencing (NGS) is expected to replace the traditional hybridization-based microarray technology. For genetics study, the first-step analysis of NGS data is often to identify genomic variants among sequenced samples. Several statistical models and tests have been developed for variant calling in NGS study. The existing approaches, however, are based on either conventional Bayesian or frequentist methods, which are unable to address the multiplicity and testing efficiency issues simultaneously. In this paper, we derive an optimal empirical Bayes testing procedure to detect variants for NGS study. We utilize the empirical Bayes technique to exploit the across-site information among many testing sites in NGS data. We prove that our testing procedure is valid and optimal in the sense of rejecting the maximum number of nonnulls while the Bayesian false discovery rate is controlled at a given nominal level. We show by both simulation studies and real data analysis that our testing efficiency can be greatly enhanced over the existing frequentist approaches that fail to pool and utilize information across the multiple testing sites.
@article{zhao2013empirical,
title={An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data},
author={Zhao, Zhigen and Wang, Wei and Wei, Zhi},
journal={The Annals of Applied Statistics},
pages={2229--2248},
year={2013},
publisher={JSTOR}
}
- Empirical Bayes confidence intervals for selected parameters in high dimensional data. (with J. T. Hwang)
Journal of American Statistical Association, 2013. Volume 108, Issue 502, 607-618.Modern statistical problems often involve a large number of populations and hence a large number of parameters that characterize these populations. It is common for scientists to use data to select the most significant populations, such as those with the largest t statistics. The scientific interest often lies in studying and making inferences regarding these parameters, called the selected parameters, corresponding to the selected populations. The current statistical practices either apply a traditional procedure assuming there were no selection, a practice that is not valid, or they use the Bonferroni-type procedure that is valid but very conservative and often noninformative. In this article, we propose valid and sharp confidence intervals that allow scientists to select parameters and to make inferences for the selected parameters based on the same data. This type of confidence interval allows the users to zero in on the most interesting selected parameters without collecting more data. The validity of confidence intervals is defined as the controlling of Bayes coverage probability so that it is no less than a nominal level uniformly over a class of prior distributions for the parameter. When a mixed model is assumed and the random effects are the key parameters, this validity criterion is exactly the frequentist criterion, since the Bayes coverage probability is identical to the frequentist coverage probability. Assuming that the observations are normally distributed with unequal and unknown variances, we select parameters with the largest t statistics. We then construct sharp empirical Bayes confidence intervals for these selected parameters, which have either a large Bayes coverage probability or a small Bayes false coverage rate uniformly for a class of priors. Our intervals, applicable to any high-dimensional data, are applied to microarray data and are shown to be better than all the alternatives. It is also anticipated that the same intervals would be valid for any selection rule. Supplementary materials for this article are available online.
@article{hwang2013empirical,
title={Empirical Bayes confidence intervals for selected parameters in high-dimensional data},
author={Hwang, JT Gene and Zhao, Zhigen},
journal={Journal of the American Statistical Association},
volume={108},
number={502},
pages={607--618},
year={2013},
publisher={Taylor \& Francis}
}
- Empirical Bayes false coverate rate controlling confidence interval. (with J. T. Hwang)
Journal of Royal Statistical Society, Series B, 2012. Volume 74, Issue 5, 871-891.Benjamini and Yekutieli suggested that it is important to account for multiplicity correction for confidence intervals when only some of the selected intervals are reported. They introduced the concept of the false coverage rate (FCR) for confidence intervals which is parallel to the concept of the false discovery rate in the multiple-hypothesis testing problem and they developed confidence intervals for selected parameters which control the FCR. Their approach requires the FCR to be controlled in the frequentist's sense, i.e. controlled for all the possible unknown parameters. In modern applications, the number of parameters could be large, as large as tens of thousands or even more, as in microarray experiments. We propose a less conservative criterion, the Bayes FCR, and study confidence intervals controlling it for a class of distributions. The Bayes FCR refers to the average FCR with respect to a distribution of parameters. Under such a criterion, we propose some confidence intervals, which, by some analytic and numerical calculations, are demonstrated to have the Bayes FCR controlled at level q for a class of prior distributions, including mixtures of normal distributions and zero, where the mixing probability is unknown. The confidence intervals are shrinkage-type procedures which are more efficient for the theta that have a sparsity structure, which is a common feature of microarray data. More importantly, the centre of the proposed shrinkage intervals reduces much of the bias due to selection. Consequently, the proposed empirical Bayes intervals are always shorter in average length than the intervals of Benjamini and Yekutieli and can be only 50% or 60% as long in some cases. We apply these procedures to the data of Choe and colleagues and obtain similar results.
@article{zhao2012empirical,
title={Empirical Bayes false coverage rate controlling confidence intervals},
author={Zhao, Zhigen and Gene Hwang, JT},
journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
volume={74},
number={5},
pages={871--891},
year={2012},
publisher={Wiley Online Library}
}
- Double shrinkage empirical Bayesian estimation for unknown and unequal variances.
Statistics and Its Interface, 2010. Volume 3, 533-541.In this paper, we construct a point estimator when assuming unequal and unknown variances by using the empirical Bayes approach in the classical normal mean problem. The proposed estimator shrinks both means and variances, and is thus called the double shrinkage estimator. Extensive numerical studies indicate that the double shrinkage estimator has lower Bayes risk than the estimator which shrinks the means alone, and the naive estimator which has no shrinkage at all. We further use a spike-in data set to assess different estimating procedures. It turns out that our proposed estimator performs the best and is thus strongly recommended for applications.
@article{zhao2010double,
title={Double shrinkage empirical Bayesian estimation for unknown and unequal variances},
author={Zhao, Zhigen},
journal={Statistics and Its Interface},
volume={3},
number={4},
pages={533--541},
year={2010},
publisher={International Press of Boston}
}
- Empirical Bayes confidence intervals shrinking both means and variances. (with J. T. Hwang and J. Qiu)
Journal of Royal Statistical Society, Series B, 2009. Volume 71, Issue 1, 265-285.We construct empirical Bayes intervals for a large number p of means. The existing intervals in the literature assume that variances inline image are either equal or unequal but known. When the variances are unequal and unknown, the suggestion is typically to replace them by unbiased estimators. However, when p is large, there would be advantage in 'borrowing strength' from each other. We derive double-shrinkage intervals for means on the basis of our empirical Bayes estimators that shrink both the means and the variances. Analytical and simulation studies and application to a real data set show that, compared with the t-intervals, our intervals have higher coverage probabilities while yielding shorter lengths on average. The double-shrinkage intervals are on average shorter than the intervals from shrinking the means alone and are always no longer than the intervals from shrinking the variances alone. Also, the intervals are explicitly defined and can be computed immediately.
@article{hqz2009empirical,
title={Empirical Bayes confidence intervals shrinking both means and variances},
author={Gene Hwang, JT and Qiu, Jing and Zhao, Zhigen},
journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
volume={71},
number={1},
pages={265--285},
year={2009},
publisher={Wiley Online Library}
}
Publications on Statistical Applications- How TikTok “Locks In” Its Users with Reference to What Happened in the Homeland Version Douyin Application (with Yang Yang, Zi'ang Li, Sabrina (Zi Tong) Peng, Liyu Dou, Changyu Li, and Yinqi Wang)
Antitrust report, 2-23, November, 2023. - A Feature Sampling Strategy for Analysis of High Dimensional Genomic Data. (with J. Zhang, K. Zhang, and Z. Wei)
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2019. Vol. 16, No. 2, 434-441.With the development of high throughput technology, it has become feasible and common to profile tens of thousands of gene activities simultaneously. These genomic data typically have sample size of hundreds or fewer, which is much less than the feature size (number of genes). In addition, the genes, in particular the ones from the same pathway, are often highly correlated. These issues impose a great challenge for selecting meaningful genes from a large number of (correlated) candidates in many genomic studies. Quite a few methods have been proposed to attack this challenge. Among them, regularization-based techniques, e.g., lasso, become much more appealing, because they can do model fitting and variable selection at the same time. However, the lasso regression has its known limitations. One is that the number of genes selected by the lasso couldn't exceed the number of samples. Another limitation is that, if causal genes are highly correlated, the lasso tends to select only one or few genes from them. Biologists, however, desire to identify them all. To overcome these limitations, we present here a novel, robust, and stable variable selection method. Through simulation studies and a real application to the transcriptome data, we demonstrate the superiority of the proposed method in selecting highly correlated causal genes. We also provide some theoretical justifications for this feature sampling strategy based on the mean and variance analyses.
@article{zhang2017feature,
title={A feature sampling strategy for analysis of high dimensional genomic data},
author={Zhang, Jie and Zhao, Zhigen and Zhang, Kai and Wei, Zhi},
journal={IEEE/ACM transactions on computational biology and bioinformatics},
volume={16},
number={2},
pages={434--441},
year={2017},
publisher={IEEE}
}
- Network analysis in detection of early-stage mild cognitive impairment. (with H. Ni, J. Qin, L. Zhou, J. Wang and F. Hou)
Physica A: Statistical Mechanics and its Applications, 2017. Vol. 142, 113-119.The detection and intervention for early-stage mild cognitive impairment (EMCI) is of vital importance. However, the pathology of EMCI remains largely unknown, making it be challenge to the clinical diagnosis. In this paper, the resting-state functional magnetic resonance imaging (rs-fMRI) data derived from EMCI patients and normal controls are analyzed using the complex network theory. We construct the functional connectivity (FC) networks and employ the local false discovery rate approach to successfully detect the abnormal functional connectivities appeared in the EMCI patients. Our results demonstrate the abnormal functional connectivities have appeared in the EMCI patients, and the affected brain regions are mainly distributed in the frontal and temporal lobes In addition, to quantitatively characterize the statistical properties of FCs in the complex network, we herein employ the entropy of the degree distribution index and some other well-established measures, i.e., clustering coefficient and the efficiency of graph. Eventually, we found that the index, better than the widely used and measures, may serve as an assistant and potential marker for the detection of EMCI.
@article{NI2017113,
title = {Network analysis in detection of early-stage mild cognitive impairment},
journal = {Physica A: Statistical Mechanics and its Applications},
volume = {478},
pages = {113-119},
year = {2017},
issn = {0378-4371},
doi = {https://doi.org/10.1016/j.physa.2017.02.044},
url = {https://www.sciencedirect.com/science/article/pii/S0378437117301905},
author = {Huangjing Ni and Jiaolong Qin and Luping Zhou and Zhigen Zhao and Jun Wang and Fengzhen Hou},
}
- Determining candidate single nucleotide polymorphisms in acquired laryngotracheal stenosis. (with M. Anis, J. Khurana, E. Krynetskiy and A. Soliman)
The Laryngoscope, 2018. Vol. 128, Issue 3, 111-116.
Despite wide adoption of strategies to prevent injury from prolonged intubation and tracheotomy, acquired laryngotracheal stenosis (ALTS) has not disappeared. ALTS’ persistence may be due to patient factors that confer unique susceptibility for some. We sought to identify genetic markers in genes associated with wound healing that could be associated with ALTS.The detection and intervention for early-stage mild cognitive impairment (EMCI) is of vital importance.
@article{anis2018determining,
title={Determining candidate single nucleotide polymorphisms in acquired laryngotracheal stenosis},
author={Anis, Mursalin M and Krynetskaia, Natalia and Zhao, Zhigen and Krynetskiy, Evgeny and Soliman, Ahmed MS},
journal={The Laryngoscope},
volume={128},
number={3},
pages={E111--E116},
year={2018},
publisher={Wiley Online Library}
}
- Translational genomics of acquired laryngotracheal stenosis. (with M. Anis, J. Khurana, E. Krynetskiy and A. Soliman)
The Laryngoscope, 2014. Vol. 124, Issue 5, 175-179.
Acquired laryngotracheal stenosis (ALTS) results from abnormal mucosal wound healing after laryngeal and/or tracheal injury. Patients with ALTS often present late after significant reduction of the airway lumen and onset of symptoms. Motivated by the need for earlier detection of affected patients, we sought to investigate genetic markers for ALTS that would identify susceptible patients.
@article{anis2014translational,
title={Translational genomics of acquired laryngotracheal stenosis},
author={Anis, Mursalin M and Zhao, Zhigen and Khurana, Jasvir and Krynetskiy, Evgeny and Soliman, Ahmed MS},
journal={The Laryngoscope},
volume={124},
number={5},
pages={E175--E179},
year={2014},
publisher={Wiley Online Library}
}
- Bayesian Mixed Effect Higher-Order Hidden Markov Models with Applications to Predictive Healthcare Using Electronic Health Records.
IISE Transactions. (with Y. Liao, Y. Xiang, and D. Ai)
- Randomized Multiarm Bandits: An Improved Adaptive Data Collection Method.
Statistical Analysis and Data Mining (with B. Ji, and my student T. Wang).