152
pages
English
Documents
2010
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
152
pages
English
Documents
2010
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
Publié par
Publié le
01 janvier 2010
Nombre de lectures
45
Langue
English
Poids de l'ouvrage
1 Mo
Publié par
Publié le
01 janvier 2010
Langue
English
Poids de l'ouvrage
1 Mo
Bayesian Mixtures for Cluster Analysis
and Flexible Modeling of Distributions
Dissertation by
Arno Fritsch
submitted to the Department of Statistics,
Technische Universit˜at Dortmund, Germany
in fulflllment of the requirements for
the degree Doktor der Naturwissenschaft
Submitted March 2010
thOral examination held on 11 June 2010
Primary Supervisor: Prof. Dr. Katja Ickstadt
Secondary Supervisor: Prof. Dr. Claus Weihsi
Acknowledgements
Thanks to ...
• ... my supervisor Katja Ickstadt, who gave me enough time to pursue
my own research and always encouraged me in my work.
• ... to Claus Weihs for refereeing the thesis and to J˜org Rahnenfuhrer˜
and Marco Grzegorczyk for completing the examination committee.
• ... the European Union and the state of North Rhine-Westphalia for
paying me money during my time at the Centre of Applied Proteomics
(ZAP).
• ... my colleagues from the chair \Mathematische Statistik und
biometrische Anwendungen": Bj˜orn Bornkamp, Evgenia Freis, Martin
Sch˜afer and Jakob Wieczorek, for proofreading this thesis. You earned
your candy bars well!
• ... Brigitte Koths, Eva Brune and Jadwiga Schall for administrative
support.
• ... Uwe Ligges and his \Rechnerhiwis" Sebastian Krey and Olaf Mers-
mann for making my computer run almost all of the time.
• ... the people from the seventh oor for a nice working environment
and the talks during lunch and the weekly \cake break".
• ... Oliver Ku… for the idea and data for the goalkeeper application.
• ... Bj˜orn Bornkamp for many interesting discussions about Bayesian
statistics, for providing me with useful hints and references and for
help with programming in C.ii
• ... my family for supporting me with the non-statistical aspects of life.
• ... my sister Barbara for giving me the idea to study statistics.
• ... my daughter Zoe for being a well-behaved baby and allowing me to
get enough sleep to flnish my thesis in time.
• ... my wife Anna for being my reference prior.Contents
1 Introduction 1
2 Bayesian Mixtures 4
2.1 Finite . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Basic Deflnition . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Identiflability . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Advantages of a Bayesian Approach . . . . . . . . . . . 12
2.1.6 Choice of Number of Components K . . . . . . . . . . 13
2.2 Inflnite Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Stick-Breaking Priors . . . . . . . . . . . . . . . . . . . 15
2.2.2 The Dirichlet Process . . . . . . . . . . . . . . . . . . . 17
2.2.3 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.4 Extensions of the Dirichlet Process . . . . . . . . . . . 22
3 Flexible Modeling w. Bayesian Mixtures 24
3.1 Approximation of Distributions . . . . . . . . . . . . . . . . . 24
3.1.1 Density Estimation . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Consistency of Posterior Distribution . . . . . . . . . . 26
iiiCONTENTS iv
3.1.3 Hierarchical Models . . . . . . . . . . . . . . . . . . . . 28
3.2 Application: Goalkeepers’ Performance in Saving Penalties . . 29
3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Choice of Covariates . . . . . . . . . . . . . . . . . . . 33
3.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Introduction to Cluster Analysis 39
4.1 Classical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Partitioning Clustering . . . . . . . . . . . . . . . . . . 41
4.1.2 Hierarchical . . . . . . . . . . . . . . . . . . 43
4.2 Similarity Measures for Clusterings . . . . . . . . . . . . . . . 46
5 Cluster Analysis w. Bayesian Mixtures 52
5.1 Conditions for Cluster Analysis: An Example. . . . . . . . . . 53
5.2 Priors Induced by Bayesian Mixtures . . . . . . . . . . . . . . 57
5.2.1 Priors on Clusterings . . . . . . . . . . . . . . . . . . . 58
5.2.2 Priors on Number of Clusters . . . . . . . . . . . . . . 60
5.2.3 Priors on Pairwise Clustering Probabilities . . . . . . . 61
5.2.4 Prior Setting in Dirichlet Process Mixture Models . . . 62
5.3 Clustering With a Fixed Number of Clusters . . . . . . . . . . 64
5.3.1 Label-Switching . . . . . . . . . . . . . . . . . . . . . . 64
5.3.2 Identiflability Constraints . . . . . . . . . . . . . . . . 65
5.3.3 Relabeling Algorithms . . . . . . . . . . . . . . . . . . 66
5.4 Clustering With a Varying Number of Clusters . . . . . . . . . 69
5.4.1 The Posterior Similarity Matrix . . . . . . . . . . . . . 70
5.4.2 Clustering Methods Based on the Posterior Similarity
Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 72CONTENTS v
5.4.3 Optimization of Criteria . . . . . . . . . . . . . . . . . 78
5.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.1 Simulation Study . . . . . . . . . . . . . . . . . . . . . 82
5.5.2 Leukemia Data . . . . . . . . . . . . . . . . . . . . . . 89
5.5.3 Galactose Data . . . . . . . . . . . . . . . . . . . . . . 93
5.5.4 Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6 Conclusions and Outlook 99
A Additional Proof 104
B Additional Graphs and Tables 107
C Details on MCMC Sampler 110
D Sensitivity Analysis of Simulation 112
E R Package mcclust 114
Bibliography 129List of Figures
1.1 Densitiesofthreemixturesoftwonormalsexhibitingbimodal-
ity, heavy tails and skewness . . . . . . . . . . . . . . . . . . . 2
2.1 Illustration of stick-breaking construction process. . . . . . . 16
2.2 Chinese restaurant process.. . . . . . . . . . . . . . . . . . . . 19
3.1 Counts of penalties per goalkeeper and histogram of relative
frequencies of saved penalties per goalkeeper. . . . . . . . . . . 30
3.2 Posterior expected random efiects distributions. . . . . . . . . 35
3.3 Posterior expected probabilities of saving a penalty for the
Normal and DP model. . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Dendrogram of genetic distances. . . . . . . . . . . . . . . . . 45
5.1 Probability that cluster membership Z is not uncertain. . . . . 55
5.2 Two mixtures of two normals with interesting features. . . . . 57
5.3 Priors induced by two difierent choices of p(fi) on° =1=(fi+1). 63
5.4 Illustration of label-switching. . . . . . . . . . . . . . . . . . . 65
5.5 Visualizations of the posterior similarity matrix. . . . . . . . . 72
5.6 Adjusted Rand Index with true clustering for clusterings esti-
mated with six difierent methods. . . . . . . . . . . . . . . . . 85
5.7 Principal components of leukemia data. . . . . . . . . . . . . . 90
viLIST OF FIGURES vii
5.8 Posterior similarity matrix for leukemia data. . . . . . . . . . 91
5.9 Clusterings of leukemia data. . . . . . . . . . . . . . . . . . . 92
5.10 Mean expression levels of genes from the galactose pathway. . 94
5.11 Iris data. Petal width against petal length and sepal length. . 97
B.1 VI-distancetotrueclusteringforclusteringsofsimulationstudy.107
B.2 Pairwise posterior probabilities … for two observations i. . . . 109ij
B.3 Principal components of galactose data. . . . . . . . . . . . . . 109List of Tables
3.1 Averagedeviance,efiectivenumberofparametersandDICfor
the difierent models. . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Estimated odds ratios with 95% credible intervals in the DP
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 The contingency table of two clusterings . . . . . . . . . . . . 48
5.1 Pairwisepriorclusteringprobabilitiesforseveralmixturemodels 61
5.2 Optimization results for posterior expectation of Binder’s loss 84
5.3 Mean number of clusters found in the simulation study. . . . 87
5.4 Mean number of singletons and large clusters for equal cluster
size data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Similarity measure with true clustering for yeast galactose data. 95
5.6 Contingency tables of iris grouping with clusterings estimated
by difierent methods. . . . . . . . . . . . . . . . . . . . . . . . 96
B.1 RankingofgoalkeepersbasedontheDirichletprocessmixture
model (3.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
D.1 AverageadjustedRandindexwiththetrueclusteringforequal
cluster size data and difierent prior settings. . . . . . . . . . . 113
viiiChapter 1
Introduction
Classical statistical methods often make parametric assumptions about dis-
tributions, the most common one being the assumption of a normal distribu-
tion. Although parametric distributions provide a reasonable approximation
to the truth in many cases, there are also many situations were their use
is not justifled. A wide variety of nonparametric methods have been devel-
oped to alleviate this problem. These are, for example, based on ranks or the
empiricaldistributionfunction.Manynonparametricmethods,however,lack
interpretability.Finitemixturemodelsassumethatadistributionisacombi-
nation of several parametric distributions. They ofier a compromise between
theinterpretabilityofparametricmodelsandthe exibilityofnonparametric
models.Althoughtheyarestrictlyspeakingstillparametricmodels,theyare