17
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
17
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
Publié par
Langue
English
MODAL CLUSTERING IN A UNIVARIATE CLASS OF
PRODUCT PARTITION MODELS
David B. Dahl
dbdahl@stat.wisc.edu
Department of Statistics, and
Department of Biostatistics & Medical Informatics
University of Wisconsin – Madison
November 3, 2003
Technical Report #1085
Department of Statistics
University of Wisconsin – Madison
The author wishes to thank his Ph.D. adviser Michael Newton for helpful advice and Philippe
Broet for access to data used in the example. The author is supported by the National Institutes of
Health grant number EY07119. Software implementing the algorithm presented in this paper will
be available shortly athttp://www.stat.wisc.edu/˜dbdahl/modal/.MODAL CLUSTERING IN A UNIVARIATE CLASS OF
PRODUCT PARTITION MODELS
David B. Dahl
Abstract
This paper presents an algorithm for finding the maximum a posteriori (MAP) clustering
in a class of univariate product partition models. While the number of possible clusterings of
n observations grows according to the Bell exponential number, the dynamic programming
2algorithm presented here exploits properties of the model to provide anO(n ) search. Hence,
the algorithm can be used to find the MAP clustering for tens of thousands of univariate data
points, whereas previously it could only be approximated through a stochastic search. Inte
grating over the latent location variables in a Dirichlet Process mixture (DPM) model leads
to a product partition model. The paper shows that several univariate, conjugate DPM mix
ture models satisfy the conditions for the mode finding algorithm. The clustering algorithm is
demonstrated with using data from a microarray experiment to detect differential gene expres
sion involving 4,608 genes.
Key Words: Model based clustering; Product partition model; Dirichlet Process mixture
model; MAP clustering
1 Introduction
The paper is organized as follows. Section 2 reviews product partition models which have the
key feature that the likelihood, prior, and posterior for a random partition of observed data are
products over partition components. After defining a class of product partition models, Section
3 proves two theorems regarding modal partitions. This motivates a dynamic programming algo
rithm to find the mode of the partition likelihood and the partition posterior for this class of models.
Section 4 describes how integrating out the component location in a conjugate Dirichlet Process
mixture (DPM) model leads to a product partition model with a particular partition prior. Section
5 provides several specific conjugate DPM models which fall into the class of models to which6
6
our mode finding algorithm applies. Finally, Section 6 presents an illustration of the algorithm
using data from Broet,¨ Richardson, and Radvanyi (2002). The example clusters 4,608 genes from
an experiment to detect differential gene expression using microarray technology and Broet¨ et al.
(2002).
2 Product Partition Models
The mode finding algorithm presented in this paper is relevant to a class of product partition mod
els. In this section, general product partition models are reviewed and notation is adopted. The
next section defines the subclass of interest.
A partitionη ={S ,...,S } of a setS ={1,...,n} has the following properties:1 q 0
• S =∅ fori = 1,...,q,i
• S ∩S =∅ fori =j, andi j
q
• ∪ S =S .j 0j=1
The sets S ,...,S are referred to as partition components. When S = {1,2,3}, for example,1 q 0
there are five possible partitions, namely:
{1,2,3} {1},{2,3} {1,2},{3}
{1,3},{2} {1},{2},{3}.
The set partition {1,2,3} denotes that all three objects belong to the same component, while
{1},{2},{3} is the partition placing each object into its own component. The Bell number, de
notedB(n), is the number of partitions ofn objects and has the recurrence relationB(n + 1) =
Pn n B(k), whereB(0) = 1. Table 1 contains Bell numbers for variousn.
k=0 k
Product partition models, introduced by Hartigan (1990) and Barry and Hartigan (1992), assume
that observations in different components of a partition of the data are independent. That is, then 3 7 50 100 200
47 115 275B(n) 5 877 ≈ 10 ≈ 10 ≈ 10
Table 1: Bell exponential numbersB(n) for variousn. The Bell number counts the number of set partitions
of{1,...,n}.
likelihood for a partitionη ={S ,...,S } with observed datay = (y ,...,y ) is a product over1 q 1 n
the components:
q
Y
p(y|η)∝ f(y ), (1)Sj
j=1
wherey is the vector of observations corresponding to the elements of the componentS . TheS jj
component likelihoodf(y ) — that is, the likelihood contribution from aS — is de S
fined for any non empty componentS ⊂S and can take any number of forms. Specific examples0
off(y ) are given in Section 4.S
The prior distribution for a partitionη is also taken to be a product over the partition components:
q
Y
p(η)∝ h(S ), (2)j
j=1
whereh(S )≥ 0 is defined for each non emptyS ⊂S andc is a normalizing constant. As withj 0 2
the likelihood contributionp(y ),h(S ) is left unspecified for now, but a particular form forh(S )S j jj
is discussed in Section 4.
All inference concerningη is made from the posterior distributionp(η|y). By Bayes theorem, the
posterior is expressed as the product of partition components:
p(η|y) =p(y|η)p(η)/p(y)
q qh ih iY Y
∝ f(y ) h(S )S jj
(3)
j=1 j=1
q
Y
= f(y )h(S ).S jj
j=1
In the context of cluster analysis, a set partition η defines a clustering for observed data y. The
componentsS ,...,S define groups of data referred to as clusters.1 q3 MAP Partition in a Class of Product Partition Models
This section defines a class of product partition models, discusses a strategy to verify that a partic
ular model falls into this class, and then introduces a mode finding algorithm applicable to these
models. The algorithm is defined and its validity is established. Finally, the efficiency of the
algorithm is discussed.
3.1 Class of Models under Consideration
This section gives two conditions related to the mode finding algorithm to be presented shortly.
Condition 1 must hold in order to be able to find the mode of the partition likelihood using the
algorithm. If Condition 2 also holds, then the algorithm can also be used to find the mode of the
partition posterior. Before stating the conditions related to the mode finding algorithm, the key
concept of overlapping components is introduced.
Definition 1 (Overlapping Components). Two partition components S and S are said to bei j
overlapping components ifS contains an integer which is between the smallest and largest integersi
ofS , or vice versa.j
For example,S ={1,3} andS ={2} are overlapping components, butS ={1} andS ={2,3}i j i j
are not.
Condition 1. The likelihood componentf(y ) in (1) has the property that, for some permutationS
of the datay, the product of the likelihood components of any overlapping componentsS andSi j
∗ ∗ ∗ ∗is not greater than the product from some otherS andS , whereS andS representi j i j
a reallocation of the elements of S and S such that the number of elements of S and S is,i j i j
respectively, preserved.
An immediate implication of Condition 1 is that there is a mode of the partition likelihoodp(y|η)
in (1) that does not contain overlapping components, a key feature which will be exploited by the
mode finding algorithm. In the next subsection, a strategy is presented that is useful in verifying
that a particular product partition model satisfies Condition 1. Before presenting the strategy, a
related condition concerning the partition prior is stated.Condition 2. The functionh(S) in (2) only depends upon the number of elements contained inS.
An immediate implication of Conditions 1 and 2 is that there exists a mode of the partition posterior
p(η|y) in (3) that does not contain overlapping clusters and, thus, the algorithm is applicable to
find the posterior mode.
3.2 Strategy for Verifying Condition 1
Condition 2 can be verified by trivial inspection, but it is not immediately obvious how to ver-
ify Condition 1 for a particular f(y ). This subsection presents a general strategy for verifyingS
Condition 1 in the case of univariate data.
The strategy proceeds as follows. Permute the datay such thaty ≤...≤y and suppose thatS1 n i
andS are two overlapping components. Leta andc be the smallest and largest integers inS andj i
letb be the smallest integer inS . SinceS andS are overlapping components, assume (withoutj i j
loss of generality) thata<b<c (otherwise, interchangeS andS when defininga,b, andc).i j
[ [ [ [LetS = (S \{a})∪{b} andS = (S \{b})∪{a}. That is, defineS andS fromS andSi j i ji j i j
] ]
by simply swappinga andb. Likewise, defineS andS by swappingc andb inS andS . Noticei ji j
that swapping integers preserves the number of elements in each component.
Condition 1 is satisfied if
f(y )f(y )≤f(y [)f(y [) (4)S Si j S Si j
or
f(y )f(y )≤f(y ])f(y ]). (5)S Si j S S
i j
Consider two mutually exclusive and exhaustive cases. The first case assumes thatf(y )f(y )≤[ [S S
i j
f(y ])f(y ]). Exploiting properties of the particularf(y ), show that this implies (5). The secondSS S
i j
case is the complement of the first: f(y [)f(y [)>f(y ])f(y ]). Again, using properties of theS S S Si j i j
particular f(y ), show that this second case implies (4). Thus, (4) or (5) holds and, therefore,S
Condition 1 is satisfie