Tree clustering for layout-based document image retrieval

icon

9

pages

icon

English

icon

Documents

2011

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

9

pages

icon

English

icon

Documents

2011

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

Tree clustering for layout-based document image retrievalSimone Marinai, Emanuele Marino, Giovanni SodaDipartimento di Sistemi e InformaticaUniversita` di FirenzeVia S.Marta, 3 - 50139 Firenze - Italymarinai@dsi.unifi.itAbstract vector spaces is still the subject of active research. Somemethods (e.g. [2] [15]) have been proposed to search inWe describe a system for the retrieval on the basis of lay- high dimensional spaces, however these methods degener-out similarity of document images belonging to collections ate to the linear complexity when dealing with spaces hav-stored in digital libraries. Layout regions are extracted and ing more than a few dozens of dimensions. One interestingrepresented with the XY tree. The proposed indexing method feature of the Cluster Tree approach [15] is the use of a clus-combines a new tree clustering algorithm (based on Self Or- tering algorithm in the indexing phase, so as to capture theganizing Maps) with Principal Component Analysis. The uneven distribution of patterns in the feature space.combination of these techniques allows us to retrieve the In this paper we tackle the document indexing by explor-most similar pages from large collections without the need ing the effectiveness of a tree clustering method based onfor a direct comparison of the query page with each indexed Self-Organizing Maps (SOM) [10] extending an approachdocument. that we recently proposed for the efficient word retrievalfrom large document ...
Voir icon arrow

Publié par

Publié le

24 juin 2011

Nombre de lectures

57

Langue

English

Poids de l'ouvrage

1 Mo

Tree clustering for layout-based document image retrieval
Simone Marinai, Emanuele Marino, Giovanni Soda
Dipartimento di Sistemi e Informatica
Universit`a di Firenze
Via S.Marta, 3 - 50139 Firenze - Italy
marinai@dsi.unifi.it
Abstract
We describe a system for the retrieval on the basis of lay-
out similarity of document images belonging to collections
stored in digital libraries. Layout regions are extracted and
represented with the XY tree. The proposed indexing method
combines a new tree clustering algorithm (based on Self Or-
ganizing Maps) with Principal Component Analysis. The
combination of these techniques allows us to retrieve the
most similar pages from large collections without the need
for a direct comparison of the query page with each indexed
document.
1. Introduction
The use of Document Image Retrieval (DIR) techniques
is essential to build successful Digital Libraries [1]. Among
several DIR paradigms, the layout-based retrieval aims at
identifying relevant documents on the basis of similarities
in the page layout. We recently addressed the layout-based
DIR by means of a XY tree-based layout representation
and subsequent tree comparison [13]. To compare the trees
we used both tree edit distance algorithms and similarities
based on a fixed-length encoding of the tree structure by
means of N-gram inspired techniques that allow us to de-
scribe the tree with a 388-dimensional feature vector. In
these methods we used tree grammars to model the varia-
tions in the trees of similar pages which are due to segmen-
tation errors or to structural differences.
Both approaches, and in particular the method based on
the tree edit distance, have significant computational costs
when dealing with large document collections. In the re-
lated problem of exact indexing of Dynamic Time Warping
a fast lower bounding function that helps prune sequences
that could not possibly be the best matches has been re-
cently proposed [9]. However, these techniques have not
yet been adapted to the tree edit distance computation.
On the other hand, efficient search in high dimensional
vector spaces is still the subject of active research. Some
methods (e.g. [2] [15]) have been proposed to search in
high dimensional spaces, however these methods degener-
ate to the linear complexity when dealing with spaces hav-
ing more than a few dozens of dimensions. One interesting
feature of the Cluster Tree approach [15] is the use of a clus-
tering algorithm in the indexing phase, so as to capture the
uneven distribution of patterns in the feature space.
In this paper we tackle the document indexing by explor-
ing the effectiveness of a tree clustering method based on
Self-Organizing Maps (SOM) [10] extending an approach
that we recently proposed for the efficient word retrieval
from large document collections [11].
The method described in this paper groups trees con-
sidering the similarity between vectorial encodings of their
structure. To reduce the search complexity we project the
data in each cluster with Principal Component Analysis
(PCA) and then perform an efficient search in the projected
space. In the retrieval phase the user provides to the sys-
tem a prototype document (with a query by example mech-
anism) and a XY tree is extracted from the query page. A
similarity measure is then computed between the query tree
and the indexed pages by using the SOM clustering and pro-
jected spaces.
The paper is organized as follows: in Section 2 we ana-
lyze some alternatives to compute the tree clustering. The
approaches proposed to perform document indexing and re-
trieval are discussed in Sections 3 and 4, respectively. Some
experimental results are reported in Section 5, whereas a fi-
nal discussion is drawn in Section 6.
2. SOM clustering of page layout
One essential step in the proposed document image re-
trieval system is the clustering of pages on the basis of lay-
out similarity by means of Self Organizing Maps. The Self
Organizing Map (SOM [10]) is a special kind of artificial
neural network that is based on competitive learning. The
purpose of the SOM training is the computation of an opti-
Figure 1. Example of one trained recursive SOM. Empty neurons (not labeled) are activated by inter-
nal or leaf nodes.
mal clustering of a collection of patterns in
.
I
n
t
h
e
S
e
l
f
Organizing Map the neurons are typically arranged in a two
dimensional lattice: the feature map. Each neuron receives
inputs from the input layer (vectors in
) and from the
other neurons in the map. During the learning the network
performs clustering and the neurons are moved in the lat-
tice so as to reflect cluster similarity by means of distances
in the map. To each element in the map it is associated one
real vector (in
) that can be considered as a prototype of
the patterns in the cluster.
The standard SOM model [10] deals with fixed-size vec-
tors that are mapped into a topological map at the end of the
learning process. To compute this map we might represent
each page with a fixed-size vector (e.g. obtained by com-
puting appropriate features in the regions defined by a grid
superimposed to the page [8, 14]). Pages in digital libraries
are frequently perceived as relevant for a query even when
the layout objects have a similar arrangement, but different
physical positions (e.g. a section title that can be in any
position in text columns). To capture these similarities the
grid representation can be less useful than structural repre-
sentations of the page layout, like the XY tree [4, 6]. In the
latter case the layout similarity can be computed by means
of an appropriate tree similarity. In our framework, we first
build a structural page representation (based on XY trees)
and then compute a SOM clustering either using appropri-
ate models designed to deal with Directed Acyclic Graphs
(DAG) [7], or using the standard algorithm with a suitable
tree encoding. These two approaches are shortly analyzed
in the following.
2
I
hs
T
T
vL
vL
hs
vl
D
E
A
vl-vL-hs
vl-hs-vL
hs-I-hs
hs-T-T
vl-hs-I
vl-hs-hs
hs-hs-T
hs-hs-T
B
C
F
A-B-C
D-E-F
MXY tree
Example page
tree-patterns
Balanced
tree-patterns
Chain
Figure 2.
One example page together with the corre-
sponding MXY tree, and the
tree-patterns
occurring in the
tree. For instance there are three
chain tree-patterns
and in
particular the pattern
hs-hs-T
occurs twice.
Artificial neural networks are mostly used to process
items represented with fixed-size feature vectors.
The
Multi Layer Perceptron (MLP) and the Self-Organizing
Map (SOM) are typical examples of these models in the su-
pervised and non-supervised frameworks, respectively. In
recent years, both the MLP and SOM have been extended
to deal with structured data by means of the recursive model
(see [12] for additional references). In particular, the recur-
sive SOM (R-SOM) has been introduced in [7] with the aim
of processing structured data encoded as labeled Directed
Acyclic Graphs (DAGs) in an unsupervised fashion. The
extension from the standard SOM is obtained by taking into
account the unfolding approach initially proposed for recur-
r
e
n
t
M
L
P
s
a
n
d
t
h
e
n
u
s
e
d
i
n
r
e
c
u
r
s
i
v
e
M
L
P
s
.
To simplify the following description we will refer to
trees instead of DAGs. In the unfolding framework each
node of the tree is encoded with a real vector containing fea-
tures of the node and information provided by its children.
This vector is used as input to a neural network (an MLP in
the case of recursive MLP) and the network output is then
used as input to its father node in a recursive way. The ex-
tension to the SOM (see [7] for details) allows to cluster
nodes of the whole training set collection into neurons of
the SOM map. At the end of the training, some neurons of
the SOM will be active mostly for leaf or internal nodes,
whereas other neurons will correspond to roots.
Given a trained R-SOM it is therefore possible to retrieve
the most similar pages by identifying the neuron that is acti-
vated by the query tree root after processing all its subtrees.
The indexed trees (correspondingto pages) associated to the
winning node will define the closest pages with respect to
the query.
To evaluate this idea we trained one R-SOM with 6,000
pages that have been manually labeled with a class name
(e.g. “title page”, “image”). The size of the R-SOM is 60 x
70, and it is significantly larger than the SOM described in
Section 2.2 so as to leave space for both internal nodes and
root nodes. At the end of the R-SOM training process, we
associated each tree with the SOM neuron activated by its
root. Lastly, we labeled the neurons with the most frequent
classes among the associated trees. The result of this pro-
cess is shown in Figure 1, where graphical symbols denote
the different classes associated with each root node.
By analyzing Figure 1, we can see that the neurons cor-
responding to each class (identified with the same sym-
bol) are spread in the map and usually are not clustered
together. Moreover, when inspecting the content of each
neuron, we noticed that pages with different layout are fre-
quently mixed together. This is probably due to the unfold-
ing process, that gives more emphasis to the information
provided by nodes close to the root. We therefore adopted a
different approach to perform the tree clustering for layout-
based DIR as described in the next section.
To solve the above mentioned problems, we propose to
build one vector-based tree representation to be used to train
a SOM map in order to cluster the pages on the basis of
layout similarity. In this section we first describe the word
encoding approach, and then analyze the training algorithm.
2.2.1 Vector-based tree encoding
In our DIR system we use an extension of the XY tree rep-
resentation (the MXY tree) that splits the page layout also
3
H
E
A
B
C
G
F
D
Figure 3. Example of a SOM trained with the proposed algorithm. Each neuron is represented with
the thumbnail of the page whose tree coincides with the cluster center. Pointed regions identify
close neurons having homogeneous content:
A)
Title pages;
B
) Pages with equations and formulas;
C
) Two column text;
D
) Single columns text;
E
) Two column text with a picture on the top;
F
)
T
w
o
column text with a section title in the middle of one column;
G
) Similar to
F
but the section title is at
the top of one column;
H
) Two column text with one picture in one of the two columns.
4
along horizontal and vertical ruling lines in addition to the
usual cuts along white spaces [4]. The MXY tree nodes
contain symbolic attributes describing the purpose of the
node. In particular, internal nodes represent the cut strat-
egy considered (we can have horizontal/vertical cuts along
either spaces or lines), whereas leaves correspond to homo-
geneous blocks in the page (text, image, horizontal/vertical
line). The MXY tree data structure is encoded into a fixed-
size feature vector by taking into account occurrences of
tree-patterns
made by three nodes [3]. Trees composed of
three nodes can have two basic structures. The first pat-
tern has the root and two children and is called
balanced
tree-pattern
. The second pattern is made by the root, one
child, and a child of the second node and is called
chain
tree-pattern
(see Figure 2 as an example).
Since node labels are in a fixed number, then the number
of possible
tree-patterns
is fixed as well. Frequently, sim-
ilar pages have some
tree-patterns
in common and contain
the same number of occurrences of a given
tree-pattern
(for
instance table pages usually contain a large number of
tree-
patterns
with lines). Unfortunately, some patterns appear
roughly in every document, and therefore these patterns are
not very useful for measuring page similarities.
These peculiarities are very similar to the use of index
terms in classic Information Retrieval. We extended the
vector model approach (used in IR for textual documents) to
our representation based on
tree-patterns
. The vector values
are weighted so as to provide more importance to most dis-
criminant terms. One common approach relies on the well
known
tf-idf
weighting scheme. The weight assigned to the
-th
tree-pattern
in page
is computed by:
(1)
where
is the frequency of the
-th
tree-pattern
in the
tree corresponding to page
normalized with respect to
the maximum
tree-pattern
frequency in the tree associated
to page
, N is the total number of pages, and
is the
number of pages containing the
-th
tree-pattern
.
2.2.2 SOM training
By using the approach described in Section 2.2.1 it is possi-
ble to represent each tree in the collection with a fixed-size
feature vector that corresponds to the
tf-idf
weighting of the
occurrences of
tree patterns
in the tree. These vectors can
be used to compute a SOM clustering.
In our model the SOM is used to build a document im-
age database where the clusters contain similar documents
(from the page layout point of view). In Figure 3 we report
one trained SOM obtained with the approach summarized in
this section. The size of the map is 15 x 10 and it is smaller
than the R-SOM described in Section 2.1 since in this case
we do not need to leave space for the internal tree nodes. A
careful inspection of Figure 3 allows us to identify homoge-
neous regions (groups of neurons containing similar pages).
To facilitate this inspection, we marked regions with similar
layout as detailed in the figure caption. The contents of the
clusters are more homogeneous than before, even if in this
case we used a larger dataset (described in Section 5) with
a broad range of books.
3. Document image indexing
After computing a suitable SOM clustering of pages we
can use this map to index the pages. When dealing with
large collections, in order to reduce the SOM training time,
we can compute the SOM by using a sub-set of the pages to
be indexed. A segmentation algorithm is used to extract the
MXY tree from each indexed page. Next, the trees are en-
coded obtaining a vectorial representation whose items are
related to the occurrences of the
tree-patterns
in the MXY
tree. One problem of this approach is the high dimension-
ality of the feature vector (388 dimensions) that is reflected
into a long training time for the SOM. To reduce the cost
of the search in each cluster we also compute, during the
indexing, the projection that best represents the data with
PCA.
The following procedure is repeated for each SOM clus-
ter in order to compute a low dimensional hyperplane by
means of Principal Component Analysis (see e.g.[5], page
568). We first compute the mean vector
and the
co-
variance matrix
of the data in the cluster. The eigenvec-
tors and eigenvalues of
are computed and the eigenvec-
tors are sorted according to decreasing eigenvalue. The first
eigenvectors (
) are combined as columns of
the
matrix
.
I
t
i
s
n
o
w
p
o
s
s
i
b
l
e
t
o
p
r
o
j
e
c
t
t
h
e
d
a
t
a
i
n
the cluster (for instance one point
) onto the
-th dimen-
sional subspace according to
(2)
Summarizing, the document indexing is composed by
the following steps: 1) layout analysis and tree encoding 2)
SOM training and PCA computation from a sub-set of the
documents to be indexed 3) projection of all the documents
in the lower dimensional space.
4. Document image retrieval
Document images are retrieved by using a “query by ex-
ample” approach: given a sample page, the system com-
putes the layout similarity of it with all the indexed pages.
The pages in the database are afterwards ranked on the basis
of this similarity and shown to the user. The page similar-
ity is evaluated considering the distance between the feature
vectors describing the layout.
5
Cluster 2
Cluster 1
Cluster 0
Final rank
Figure 4. From top: Final ranking, and contents of the ranked lists of the three closest clusters for
the query page shown in Figure 2. The query is a page inserted during the digitzation in order to
separate different parts of a volume.
6
Example
14.1
11.9
5.2
13.4
10.1
6.6
2.2
8.7
# queries
10
10
5
10
Table 1. Results obtained for 35 query pages belonging to four classes.
is the number of correct
pages in the final ranking list;
is the number of subsequent correct pages at the top of the list.
For both values we removed the query page (retrieved as well) from the counts. For each class the
average values of
and
are shown.
Without taking into account efficiency issues, one simple
approach for document image retrieval would rely on the
linear comparison of each indexed page with the query doc-
ument image by measuring the distance between the cor-
responding feature vectors. Unfortunately, this approach is
not feasible when large databases are considered. The pro-
posed method takes into account the main features of SOM
clustering and PCA to efficiently address the document re-
trieval. Let us describe the four main steps of the algorithm.
1.
The query
The query page is encoded in the same way as indexed
pages. Obviously, when the query page belongs to the
indexed one, then the page is retrieved in the first posi-
tion.
2.
Cluster search
In the second step we identify the three clusters hav-
ing prototypes closer to the query so as to restrict the
subsequent search. Due to layout variations and to the
presence of noise in scanned pages, it is not usual that
all the most similar pages are contained in the closest
cluster. To tackle this problem we consider the three
closest clusters. In so doing we reduce on the average
the number of patterns to be processed by the subse-
quent step by a factor 100 (for a map having size of 15
x 20).
3.
Search in clusters
After selecting the three closest clusters we identify
the most similar words by means of PCA. During the
search in the three clusters we look for the
nearest
neighborhoods in the projected spaces.
4.
Final ranking
The previous step allows us to identify the most sim-
ilar pages in the projected spaces of the three closest
clusters. To merge the three lists of top-ranked pages
and refine the final ranking we compute the distance in
the original space between the query tree and the trees
in the three lists. It is worthwhile to remark that the
computation in the original space is performed for a
small number of pages and therefore it is not problem-
atic from the computational point of view.
We show in Figure 4 the first 20 pages in the three clus-
ters closest to the query image shown in Figure 2, as well as
the first 20 pages in the final ranking. In this case the first
cluster contains six matching pages, whereas the other clus-
ters contain five and four matching pages, respectively. The
final ranking (on top of the figure) contains seven matching
pages in the first positions of the ranking.
5. Experimental results
To perform our experiments on a large data-set we ex-
tracted the MXY trees and we indexed the pages from 53
books with a total of 22,523 pages. The size of the SOM
is 10 x 15, therefore on the average each neuron contains
around 150 pages. Moreover, we should notice that there
are several neurons corresponding to most common layouts
and usually those neurons are very populated. It is inter-
esting to summarize the disk space required to store these
7
Figure 5. Four examples of retrieval results. Each answer set is sorted left-to-right, top-bottom. The
first image is always the query one.
8
documents. The compressed TIF images need 3.3 GBytes;
the MXY trees 90 MBytes; the cluster information (centers,
projected vectors and original vector-based tree encoding)
47 MBytes.
To perform some preliminary experiments with this data-
set we selected 35 pages whose layout can be grouped into
four main categories. The standard technique to evaluate in-
formation retrieval systems is the computation of precision-
recall plots. However, to build these plots we need to know
all the documents satisfying the query. Since this informa-
tion is not available for this large data-set, we identified the
number of correct answers in the final ranking proposed by
the system. Figure 5 contains four examples of queries to-
gether with the final answer set. Table 1 summarizes the
results obtained for the 35 query pages. Since Table 1 con-
tains average values, it is not possible to describe each indi-
vidual query. For instance, for the queries shown in Figure 5
several correct answers are highlighted. However, in other
cases there are few correct answers since the layout of the
query page is not very common.
6. Conclusions
In this paper we proposed a new approach to perform
document image retrieval relying on XY tree page layout
representation. The most important feature of the proposed
method is the use of Self Organizing Maps to perform doc-
ument layout clustering so as to reduce the computational
cost during retrieval phase. We first empirically compared
two methods to perform the tree clustering and then inte-
grated the most promising one into the query retrieval sys-
tem. Preliminary experimental results with a data-set con-
taining more than 22,000 pages show promising results.
References
[1] H. S. Baird. Digital libraries and document image analysis.
In
Proc. 7th ICDAR
, pages 2–14, 2003.
[2] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree:
an index structure for high-dimensional data. In
Proc. Proc.
22nd VLDB
, pages 28–39, 1996.
[3] F. Cesarini, M. Lastri, S. Marinai, and G. Soda. Page classi-
fication for meta-data extraction from digital collections. In
Proc. DEXA 01
, pages 82–91, 2001.
[4] F. Cesarini, S. Marinai, and G. Soda.
Retrieval by lay-
out similarity of documents represented with MXY trees.
In
Document Analysis Systems V
, pages 353–364. Springer
Verlag, LNCS 2423, 2002.
[
5
]
R
.
O
.
D
u
d
a
,
P
.
H
a
r
t
,
a
n
d
D
.
G
.
S
t
o
r
k
.
Pattern Classification
.
John Wiley & sons, 2001.
[6] P. Duygulu and V. Atalay.
A hierarchical representation
of form documents for identification and retrieval.
IJDAR
,
5(1):17–27, November 2002.
[7] M. Hagenbuchner, A. Sperduti, and A. C. Tsoi. A self-
organizing map for adaptive processing of structured data.
IEEE Transactions on Neural Networks
, 14(3):491–505,
2003.
[8] J.Hu, R.Kashi, and G.Wilfong. Comparision and classifica-
tion of documents based on layout similarity.
Information
Retrieval
, 2(2/3):227–243, May 2000.
[9] E. Keogh and C. A. Ratanamahatana. Exact indexing of dy-
namic time warping.
Knowledge and Information Systems
,
7:358–386, 2005.
[10] T. Kohonen.
Self-organizing maps
. Springer Series in Infor-
mation Sciences, 2001.
[11] S. Marinai, S. Faini, E. Marino, and G. Soda. Efficient word
retrieval by means of SOM clustering and PCA. In
DAS
2006
, pages 336–347. Springer Verlag- LNCS 3872, 2006.
[12] S. Marinai, M. Gori, and G. Soda. Artificial neural networks
for document analysis and recognition.
IEEE Transactions
on PAMI
, 27(1):23–35, 2005.
[13] S. Marinai, E. Marino, and G. Soda. Layout based document
image retrieval by means of XY tree reduction. In
Proc. 8th
ICDAR
, pages 432–436, 2005.
[14] A. Tzacheva, Y. El-Sonbaty, and E. A. El-Kwae. Document
image matching using a maximal grid approach. In
Pro-
ceedings of the SPIE Document Recognition and Retrieval
IX
, pages 121–128, 2002.
[15] D. Yu and A. Zhang. Clustertree: integration of cluster rep-
resentation and nearest-neighbor search for large data sets
with high dimensions.
IEEE Transactions on Knowledge
and Data Discovery
, 15(5):1316–1337, 2003.
9
Voir icon more
Alternate Text