Guan-Hua Huang, Ph.D.
Institute of Statistics
National Chiao Tung University
1001 Ta Hsueh Road
Hsinchu 300, TAIWAN

Tel: 03-513-1334
Fax: 03-572-8745
Office: 423 Assembly Building 1

Home > Research

Latent Class Modeling

My primary research is focused on the development of statistical methods for problems in which the process of interest is unobservable. In many medical studies, the definitive outcome is inaccessible, and a valid surrogate endpoint is then measured in place of the clinically most meaningful endpoint. I have developed a latent variable model for analyzing this kind of data structure [1]. The model summarizes the unobservable definitive outcome as an underlying categorical variable and incorporates covariate effects on both underlying and measured variables. Significantly, I develop a model framework that guarantees identifiability of the two types of covariate effects [1, 3].

I also provide theory and practical methods for selecting the number of underlying variable categories [2]. I proposed approach is based on an analogous method used in factor analysis and does not require repeated model fitting under different numbers of categories.

Statisticians typically estimate the parameters of latent variable models using the Expectation-Maximization algorithm. I propose an alternative two-stage optimization-based approach to model fitting [4]. The proposed approach is theoretically justifiable, directly checks the conditional independence assumption, and converges much faster than the full likelihood approach when analyzing high-dimensional data.

I also propose a Bayesian framework to perform the joint estimation of the number of latent classes and model parameters [5]. The proposed approach applies the reversible jump Markov chain Monte Carlo to analyze finite mixtures of multivariate multinomial distributions. In the paper, we also develop a procedure for the unique labelling of the classes.

Relevant publications:
3. Huang GH* (2005). Model identifiability. Encyclopedia of Statistics in Behavioral Science. Editors: Brian S. Everitt and David C. Howell. Wiley, New York. Volume 3, 1249-1251.
4. Huang GH*, Wang SM, Hsu CC (2011). Optimization-based model fitting for latent class and latent profile analyses. Psychometrika 76, 584-611.
5. Pan JC, Huang GH* (2014). Bayesian inferences of latent class models with an unknown number of classes. Psychometrika 79, 621-646.
(Back to Top)

Genetic Analysis

Recently, I am working on genetic analysis studies. The first study is on endophenotype validation. Endophenotypes, which involve the same biological pathways as diseases but presumably are closer to the relevant gene action than diagnostic phenotypes, have emerged as an important concept in the genetic studies of complex diseases. In this paper, we develop a formal statistical methodology for validating endophenotypes. We also propose an index to be used as operational criteria of validation [6].

The second study is for the analysis of gene expression microarray data. Through the analysis of spike-in, RT-PCR and cross-laboratory benchmark datasets, we evaluate combinations of the most popular preprocessing and differential expression detection methods in terms of accuracy and inter-laboratory consistency [7]. Our results provide general guidelines for selecting preprocessing and differential expression methods in analyzing Affymetrix GeneChip array data.

The third study is on genotype imputation accuracy. Many researchers use the genotype imputation approach to predict the genotypes at rare variants that are not directly genotyped in the study sample. One important question in genotype imputation is how to choose a reference panel that will produce high imputation accuracy in a population of interest. Using whole genome sequence data from the Genetic Analysis Workshop 18 data set, this report compares genotype imputation accuracy among reference panels representing different degrees of genetic similarity to a study sample of admixed Mexican Americans [8].

The forth study uses a Bayesian formulation of a clustering procedure to identify gene-gene interactions under case-control studies, called the Algorithm via Bayesian Clustering to Detect Epistasis (ABCDE). The ABCDE uses Dirichlet process mixtures to model SNP marker partitions, and uses the Gibbs weighted Chinese restaurant sampling to simulate posterior distributions of these partitions. This study also develops permutation tests to validate the disease association for SNP subsets identified by the ABCDE, which can yield results that are more robust to model specification and prior assumptions [9].

Relevant publications:
6. Huang GH*, Hsieh CC, Chen CH, Chen WJ (2009). Statistical validation of endophenotypes using a surrogate endpoint analytic analogue. Genetic Epidemiology 33, 549-558.
8. Huang GH*, Tseng YC (2014). Genotype imputation accuracy with different reference panels in admixed populations. BMC Proceedings 8(Suppl 1):S64.
9. Chen SP, Huang GH* (2014). A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype data. Statistical Applications in Genetics and Molecular Biology 13, 275-297.
(Back to Top)

Joint Analysis of Transition Probabilities

I have also worked on a study for analyzing age-related maculopathy (ARM): a leading cause of vision loss in people aged 65 or older. ARM is distinctive in that it is a disease which can progress, regress, disappear, and reoccur. I develop a transition model for jointly studying the relationship of incidence, progression, regression and disappearance probabilities with risk factors [10]. The developed method can be widely applied to other diseases with similar transitional characteristics.

Relevant publications:
10. Huang GH* (2008). Integrated analysis of incidence, progression, regression and disappearance probabilities. BMC Medical Research Methodology 8:40.
(Back to Top)

Multiple Ordinal Measurements

Analysis of "multiply-measured" ordinal outcomes is another research topic. Co-authors and I have detailed challenges and strategies for analyzing such data [11]. We also apply generalized estimating equations methodology for analyzing multiple ordinal measurements and develop graphical diagnosis displays to evaluate the adequacy of models [12].

Relevant publications:
11. Bandeen-Roche K, Huang GH, Munoz B, Rubin GS (1999). Determination of risk factor associations with questionnaire outcomes: a methods case study. American Journal of Epidemiology 150,1165-1178.
12. Huang GH, Bandeen-Roche K, Rubin GS (2002). Building marginal models for multiple ordinal measurements. Journal of the Royal Statistical Society. Series C (Applied Statistics) 51, 37-57.
(Back to Top)


last updated February 13, 2015