Diploma Thesis

Document Layout Analysis
Diplomarbeit im Fach Informatik
Image Understanding and Pattern Recognition Group
Deutsches Forschungszentrum für Künstliche Intelligenz
Fachbereich Informatik
Technische Universität Kaiserslautern
Prof. Dr. Thomas Breuel

vorgelegt von:
Joost van Beusekom
Matrikelnummer: 34 33 90

Gutachter:
Prof. Dr. Thomas Breuel
Betreuer:
Daniel Keysers
Faisal Shafait

Contents
1 Introduction

3

2 Distance Measures for Document Layouts
2.1 Introduction . . . . . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . . . .
2.3 Properties of a Good Distance Measure . . . .
2.4 Definitions . . . . . . . . . . . . . . . . . . . . .
2.5 The Benchmarking Distance Measure . . . . . .
2.6 Distance Measures based on a Block Distance
Method for the Matching Problem . . . . . . .
2.6.1 Overview . . . . . . . . . . . . . . . . .
2.6.2 Block Distances . . . . . . . . . . . . . .
2.6.3 Matching . . . . . . . . . . . . . . . . .
2.7 Evaluation of Distances Measures . . . . . . . .
2.7.1 Evaluation Method . . . . . . . . . . . .
2.7.2 Layout Analysis Algorithms . . . . . . .
2.8 Results . . . . . . . . . . . . . . . . . . . . . . .
2.8.1 Error Rates . . . . . . . . . . . . . . . .
2.8.2 Evaluation of Matching Methods . . . .
2.8.3 Evaluation of Block Distances . . . . . .
2.8.4 Comparison to the FIRE Results . . . .
2.8.5 Error Types . . . . . . . . . . . . . . . .
2.9 Application: Layout Analysis by Example . . .
2.9.1 Evaluation . . . . . . . . . . . . . . . .
2.9.2 Results . . . . . . . . . . . . . . . . . .
2.10 Conclusion and Future Work . . . . . . . . . .

. . .
. . .
. . .
. . .
. . .
and
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .

.
.
.
.
.
a
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Solution
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .

3 Layout to HTML Converter
3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Functionalities and Description of the Graphical User Interface
3.2.1 Mouse Buttons: Manual Segmentation . . . . . . . . . .
3.2.2 Menu entries . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Buttons . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Checkboxes . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Handling PDF Files . . . . . . . . . . . . . . . . . . . . . . . .

1

.
.
.
.
.
.
.

5
5
6
8
9
10
11
11
12
15
22
22
24
30
30
30
33
35
36
40
40
41
43
44
44
45
45
45
50
51
52

A Matching Algorithms
A.1 Hungarian Algorithm . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Minimum Weight Edge Cover Algorithm . . . . . . . . . . . . .
A.3 Transportation Problem Solving Algorithm . . . . . . . . . . .
A.3.1 Minimum Cost Method for Finding a Feasible Solution .
A.3.2 Optimisation of the Initial Solution . . . . . . . . . . . .
A.3.3 An Example for Solving the Transportation Problem . .

2

.
.
.
.
.
.

53
53
54
54
54
57
60

Chapter 1

Introduction
For many years people claim that, after thousands of years of predominance of
paper, it will disappear as medium to save and transport information. Although
many products have been developed to reduce the use of paper, for the moment
it is not predictable if paper will ever disappear as information storage. As
nevertheless more and more processes of everyday life are done electronically,
there is a strong need to be able to convert printed documents into electronic
ones. One step to do so is to use optical character recognition (OCR).
One important step in OCR systems is the manipulation of the document
layout. Although the text contains most of the information of a document, the
layout also has a certain importance. Imagine for example how an OCR software
should try to convert a scanned image to a word processor conform document
without considering the layout. No proper reconstruction of the document can
be done.
Or, if a three column newspaper should be displayed on a PDA display,
the document has to be “serialised” in order that the images stay on the right
logical position (near the paragraph referring to them) or that the order of the
paragraphs is conserved. In a first step this requires to know how the layout
originally looked like.
Another example is layout based document image retrieval: as many people
recognise documents in a first step by the layout, it would also be favourable to
have some possibility to search for documents having a layout similar to a given
one. Therefore it is necessary to be able to compare layouts. Considering these
applications, it becomes important to take care of the problem of document
layout analysis.
In layout analysis there are a lot of different areas of research. A good
overview is given by Cattoni et al. in [7]. One major part of the publications
are algorithms used to segment a page into homogeneous areas, which is called
page segmentation. Another part does logical layout analysis and tries to label
the different blocks according to their function in the text (e.g. author, title,
abstract, footnote, etc.). On the domain of benchmarking for layout analysis
algorithms also a bunch of work has been done. Benchmarking tries to evaluate
how good a given algorithm works, e.g. how close the automatically created
layout is to the ground truth, normally produced by a human.
Many methods and algorithms exist in layout analysis for many different
applications. Despite this collection of methods, there are only few methods for
3

distance measuring for layouts. Distance measures can be used for several applications, e.g. document image retrieval by layout analysis, page segmentation
by example and benchmarking for page segmentation algorithms. This is the
reason why in this diploma thesis methods for distance measuring have been
analysed. This will be discussed in the first part of this thesis.
In the second part, a tool is presented to visualise the results of layout
analysis algorithms. This tool can also be used to convert segmentations of
pages to HTML tables, which are a widely spread format that allows simple
visualisation on the one hand, and further processing of the segmentation data
on the other hand.

4

Chapter 2

Distance Measures for
Document Layouts
2.1

Introduction

The importance of layout analysis having been described in Chapter 1, it also
is important to be able to compare document layouts. This comparison can
be done using a layout distance measure. A distance measure for layouts is
for example used to benchmark layout analysis algorithms: the output of an
algorithm has to be compared to ground truth in order to rate it. Another
application is layout based document retrieval: in this case a query layout is
compared to a database of known layouts in order to find layouts that are similar, e.g. layouts from the same journal or magazine. The main application for
the distance measure presented here will be layout based document image retrieval. A third application could be the combination of different layout analysis
algorithms: one approach would be to take the output of several known layout
analysis algorithms and to compute the distance for these outputs to a database
of known “good” layouts. Then the output with the smallest distance is taken
as the result of the layout analysis process. The hope is that by this process
the specific errors that each of the known algorithms has, will have less effect
on the final resulting layout. Another approach could be to take one layout
analysis algorithm, to compute a possible layout in the first step, then compare
this layout with the database and to transfer the layout of the best match to
the new document. This kind of application is also known as layout analysis by
example.
There are two categories of layout information: the geometric layout information and the logical layout information. The geometric layout information
consists of data describing how the layout is divided into different homogeneous
regions. In our case these regions are rectangular blocks that are represented
by two points, the lower left and upper right corner coordinates. An example
of a document and its geometric layout information can be found in Figure 2.1
and in Table 2.1.
Logical layout information goes one step further and classifies the blocks into
different categories: e.g. title, abstract and author. In this diploma thesis we
focus on the geometric layout information and do not discuss the logical layout

5

information.
In this chapter two ideas will be analysed: one distance measure, proposed
by Liang et al. in [18], that was initially designed for layout analysis algorithm
benchmarking and a distance measure based on a block distance for blocks for
two layouts. This method was initially used to compare the output of a segmentation algorithm to the ground truth for a given document image. Based on
block information, it computes the different segmentation errors by comparing
the overlapping area of the blocks.
The second method uses various distance measures to compute the distance
between two blocks and three matching functions that can be composed to a
wide variety of different distance measures. As matching methods the minimum
weight edge cover, the assignment problem and the Earth Mover’s Distance have
been chosen. The block distances presented here are based on geometric features of pairs of blocks as e.g. the difference in width, the distance between their
centres and their overlapping area. Different combinations of block distances
with matching methods have been tested. This will be discussed in the second
part of this chapter. An example for one possible combination could be the difference in width of two blocks as block distance, combined with the assignment
problem for the matching step.
Block number
1
2
3
4
5
6
7
8
9
10

Xlow
184
182
181
185
183
184
184
1053
847
1793

Ylow
2923
2716
1538
1334
1201
1052
810
805
840
652

Xhigh
1403
1406
1520
484
744
1521
605
1516
855
1804

Yhigh
2989
2853
2439
1366
1241
1139
856
856
866
669

Table 2.1: The coordinates of the corner points of the blocks of the geometric
layout information as shown in Figure 2.1.

2.2

Related Work

Most of the work in the field of document image retrieval uses specific features
computed on the document image, e.g. font information, connected components,
and texture. An overview of the publications on this subject can be found in [9].
Shin, Doerman and Rosenfeld present in [28] a method using structure-based
features for document classification and retrieval. Eglin and Bres [10] present
a document similarity that in a first step labels the different regions (e.g. text,
image, heading) in the document and then compares these labelled regions to
the regions of another document.
We disregard the image as information and use only block information in
order to analyse to what extent the geometric layout information alone can be
used for document image retrieval. To do so, methods are needed that only
6

Figure 2.1: Example of a page with its geometric layout information. The
original page is painted in grey, the geometric layout information blocks in
black. The black point near the centre of the abstract represents the centre of
gravity of all the blocks.

7

use the geometric layout information to measure the distance. However, the
above-mentioned features are likely to aid the retrieval if include in the process
described here.
Hu et al. present in [13] a two step method for layout comparison. They use
different methods to compute the distance between image rows after a segmentation into a grid of equal-sized cells. Each cell is identified as text cell if at least
half of the cell is part of some text block. In the other case it is a white space
cell. Document images are then compared using dynamic programming on the
row-based representation of the documents. In [12] the use of clustering and a
hidden Markov model for learning of prototypes is discussed in more detail. The
test data used comprises five classes (1-column and 2-column letter, 1-column
and 2-column journal, magazine) and an average error rate of 21.4% is reported.
Unfortunately, the data used are not available, so a direct comparison with the
results is not possible.
Benchmarking methods for layout analysis algorithms have to compare the
output of the algorithms to the ground truth. Because these measuring methods
yield a quantitative description of the difference between two layouts, they can
be used as a distance measure for the task of document image retrieval.
Mao and Kanungo present in [19] their page segmentation evaluation toolkit
in which they use morphological operators to define the sets of missed, merged,
split, and falsely detected text lines. The error types are weighted and a total
metric is obtained by summing up the error types multiplied by their weight.
Liang et al. propose in [18] a method that uses another approach to find
the different errors and then also computes the total error by summing up the
different errors multiplied by their weight.
In [31] Yanikoglu and Vincent present a method similar to the ones mentioned above, but instead of working with regions or blocks, the number of errors
are counted on the basis of pixels. The total error is then again obtained by
summing up the number of errors multiplied by the weighted cost.

2.3

Properties of a Good Distance Measure

The distance measure for layouts for image retrieval should give a small distance
when two layouts are similar and a long distance when two layouts are different.
As mentioned above, the layout is described as a set of blocks. In order to
measure the similarity (or dissimilarity) between two layouts, blocks from one
layout have to be compared to blocks from another layout and a definition for
similarity has to be defined based on these blocks. The following criteria have
been regarded as useful for this measurement:
• Position: If the positions of the blocks of the two layouts are similar then
this will be in favour for the analysed layouts.
• Width: If the width of the blocks is the same, the two layouts may have
the same column width.
• Area: If the two blocks of different layouts differ in area this might be an
indicator that these blocks should not be matched to each other.
Furthermore, it would be useful if the distance measure would be tolerant
for some typical errors that occur during layout analysis, namely splitting and
8

merging errors. It should also allow a few false and missing errors. A merging
error occurs when two blocks of the query layout are transformed into one block
of the reference layout. A splitting error occurs if the inverse case happens. A
false error occurs when one block from the query layout is not matched to any
other block of the reference layout. A missing error occurs when a block from
the reference layout is not matched to a block of the query layout. A last type
of errors, spurious errors, are those errors that do not fit in any of the error
categories mentioned above. Examples for these errors can be found in Figure
2.2.
Another important feature for a distance measure used for queries is that it
should be fast to compute. The distance measure should be able to compare one
query document to a whole database of documents in an “acceptable” amount
of time, where “acceptable” may be defined e.g. that the user obtains the results
for a query in less than 10 seconds, if run on actual hardware.

2.4

Definitions

In the following sections we use these definitions:
• layout: a layout L is a set of blocks Bi : L = {B1 , ..., Bn }
• block: a block B is a pair of points Pi : B = (P1 , P2 ), the lower left and
the upper right corner.
• point: a point P is a defined by a pair of coordinates x, y: P = (x, y)

Figure 2.2: The red part is an example of a merging error, the blue of splitting
and the cyan one of false and the green one of a missing error; the pink one is
a correct match and the yellow ones are spurious errors.
9

2.5

The Benchmarking Distance Measure

Layout analysis algorithm benchmarking tries to find some objective measure
for the performance of a layout analysis algorithm. Benchmarking of layout
analysis algorithms is often done by comparing the resulting layout of a layout
analysis algorithm to the manually extracted layout of the same page, called
ground truth. The idea that lead to the use of this method as distance measure
for layouts was, that for benchmarking layout analysis algorithms, the output
of these algorithms, namely layouts, have to be compared to the ground truth.
So the idea was to take one of these benchmarking methods and try to use its
distance measure for our purposes.
One algorithm that was simple to implement and fast to compute was the
one presented by Liang in [18]. Because of its small computational effort it
would be a good choice for a distance measure for document image retrieval.
This method is based on the computation of the overlapping area of the matched
blocks. But instead of using this area as distance measure, it uses the relations
it gets from this overlapping area analysis to find out what different type of
matching errors have been made by the segmentation algorithm.
Six kind of matches have to be distinguished: one-to-one match (no error),
one-to-zero match (missing error), zero-to-one match (false error), one-to-many
match (splitting error), many-to-one match (merging error) and many-to-many
match (spurious error). Then a certain weight is assigned to every kind of mismatch. The final score is obtained by summing up all the mismatches weighted
with the specified value and dividing this by the total number of boxes in the
two layouts.
The analysis of the overlapping areas, in order to get the number of the
different sorts of matches, is done by computing two “area overlap” matrices Σ
and T . The entries σij and τij are defined below:
σij =

Area(Bi ∩ Bj )
Area(Bi )

(2.1)

τij =

Area(Bi ∩ Bj )
Area(Bj )

(2.2)

where Bi ∈ L1 and Bj inL2 are two blocks from the two layouts we want to
compare.
After having computed these values for every i and j, the different kind of
errors can be extracted by the following rules:
• if σij ≃ 1 and τij ≃ 1: one-to-one match
• if σij ≃ 0, ∀j: one-to-zero match
• if τij ≃ 0, ∀i: one-to-zero match
P
• if σij < 1, ∀j and N
j=1 τij ≃ 1: one-to-zero match
• if τij < 1, ∀i and

PM

i=1

σij ≃ 1: one-to-zero match

• in all the other cases: many-to-many matches
10

where M and N are the number of blocks of the two layouts being compared.
In the case of one-to-one match the blocks are correctly matched (the two
blocks have nearly perfect overlap). A one-to-zero match occurs when one block
has no common area with any other block and so this can be seen as a misdetection. The opposite case, namely the zero-to-one match is called a false alarm.
A one-to-many match is called splitting detection and the many-to-one match a
merging detection. The many-to-many detections are called spurious detections.
The different weights for the types of detections are given in Table 2.2.
The performance measure f (the measure we use as distance measure) is
then defined as:
P
t∈types nt × wt
f=
(2.3)
|P1 | + |P2 |
where types is the set of different error types, |P1 | and |P2 | are the number of
blocks in layouts L1 and L2 respectively and nt is the number of errors of type
t, where t ∈ types.
This measure f is then considered as a distance measure between two layouts.
Type
Weight

Correct
0.0

Merge
0.5

Split
0.5

Miss
1.0

False
1.0

Spurious
1.0

Table 2.2: The error-weights for the different kinds of matchings

2.6

2.6.1

Distance Measures based on a Block Distance and a Solution Method for the Matching Problem
Overview

As the basic entity of a geometric document layout information is a block, it is
self-evident that a distance measure for two layouts can based on some distance
measure for two blocks in order to obtain a global distance. The idea that
emerged from this observation was to try different distance measures for blocks
(in the following called “block distance”) to get one global distance measure for
two layouts.
Using these block distances, for every pair of blocks of the two layouts the
distance between these blocks is computed. The problem that arises is to match
the blocks from the query layout to the reference layout in order to minimise
the total distance obtained by summing up the block distances of the blocks
that are matched. This problem is referred to as the “matching problem”.
Considering that there are different methods to compute the block distance
and also three different methods to solve the matching problem, several combinations with these two steps can be formed. In this work we evaluate the
different methods for both steps quantitatively.
An illustration of this two-step distance measure can be found in Figure 2.3
and an example illustrating the matching in Figure 2.5.

11

In the first part of this chapter we will present a number of block distances.
The second part then presents the three different matching methods that have
been analysed.

2.6.2

Block Distances

In this part different block distances are presented that can be used to compute
the distance between two blocks. This first computation step returns the cost
matrices needed for the second step, the matching.
Relative and Absolute Block Positions
The coordinates of the blocks give the exact position of the blocks in the original
image. If the same image is translated, e.g. due to a different position of
the document on the scanner, the difference for these two layouts, based on
absolute block positions will probably high. Therefore it is useful to compute
the positions of the blocks relative to some reference point. A possible choice
for this reference point is the centre of gravity of the document layout. This
centre of gravity is computed by considering pixels that are part of a block
as “mass” and then by computing the centre of mass of the layout. Other
methods to make the distance measures robust to translations are also possible.
Robustness against rotation is not discussed here, as normally a deskewing step
is done before the segmentation step.
L_1

L_2

Manhattan Distance
Width
Height
Difference of Area
...

Block Distance Measure

d(b_i, b_j)

Assignment
Edge Cover
Earth Mover’s Distance

Matching

d(L_1, L_2)

Figure 2.3: An illustration of the distance measure construction kit. As input it
gets two layouts, then it computes all the possible distances between the blocks
of these layouts and in the last step the matching is done in order to obtain the
total distance.

12

Manhattan Distance of Corner Points
The idea that led to this distance measure was to create a distance measure
that implements some simple heuristics for a distance measure between two
blocks. As stated above, the criteria for similarity between blocks are similar
positions, similar widths and similar areas of the blocks. In order to include
these features into one distance measure, we opted for the sum of the Manhattan distances of the corner points of the bounding box as a possible distance
measure. Mathematically speaking this means:
Let Bi = (Pk , Pl ) and Bm = (Pn , Pp ) two blocks of the layouts La and
Lb respectively. Then the block distance Dmh is obtained by summing up the
Manhattan distances of the corner points of block Bi and Bm .
Dmh (Bi , Bm ) = dmh (Pk , Pn ) + dmh (Pl , Pp )

(2.4)

dmh (Pa , Pb ) = |xa − xb | + |ya − yb |

(2.5)

where

is the Manhattan distance between two points.
As in this distance position and also area take influence, this might be a
reasonable block distance measure to use.
Overlapping Area
This method computes the distance between two boxes by their overlapping
area. The overlapping area is defined as the number of pixels that belong to the
two boxes being compared and that have the same global coordinates on the
page.
For every pair of boxes (Bi , Bk ), where Bi ∈ La and Bk ∈ Lb the following
distance is computed:
Dov (Bi , Bk ) = 1 −

2 × Ov(Bi , Bk )
Area(Bi ) + Area(Bk )

(2.6)

where Area(Bi ) is the number of pixels (area) of box Bi and Ov(Bi , Bk ) is the
overlapping area of block Bi and Bk . We then obtain for every pair of blocks a
value between 0 and 1 where 0 is a perfect overlap and 1 is no overlap at all.
This block distance also incorporates position as well as area and aspect-ratio
into one measure. Only taking the overlapping area as block distance has one
major drawback: in case of non-overlap, the distance will be 1, no matter how
far the two blocks really are. Therefore we tried to add, instead of using only
the overlap, the normalised Manhattan distance of the corner points in case that
there would be no overlap. So for a block having no overlap in common with
any other block we get a distance between 1 and 2. The Manhattan distance of
the corner points can be normalised by dividing the obtained distance by twice
the maximal possible Manhattan distance. This maximum possible Manhattan
distance is given by the global lowest leftmost point xmin , ymin and the highest
rightmost point xmax , ymax . For the ease of computation one can also take the
sum of the width and the height of the picture as maximal possible Manhattan
distance.
13

Other simple block distances
It is clear that we can construct various other block distances. As they are
almost self-explanatory, they are only defined here:
• Difference in width:
Dw (Bi , Bk ) = |width(Bi ) − width(Bk )|

(2.7)

• Difference in height:
Dh (Bi , Bk ) = |height(Bi ) − height(Bk )|

(2.8)

• Product of difference in width and difference in height:
Dp (Bi , Bk ) = Dh (Bi , Bk ) × Dw (Bi , Bk )

(2.9)

• Distance of block centre:
Dbc (Bi , Bk ) = dm (centre(Bi ), centre(Bk ))

(2.10)

It is important to see that these block distances presented here only constitute different choices we can make. It is not said that all these block distances
are meaningful or will give good results. The idea is to evaluate how well these
simple measurements work and then to try to obtain a better distance measure
by combining different simple block distances, each containing a different kind
of information, to one new block distance. Examples for these simple block
distances can be found in Figure 2.4.
Combination of different block distances
Instead of choosing one block distance to compute the distance between two
blocks it might be useful to combine different block distances into one new
block distance.
The difference in width, for example, contains information concerning the
number of columns we have. But we do not have any information about position.
So we might add, for example, the distance of centres of the blocks to the
difference in width. As one can see, we obtain a lot of different possibilities
bringing some new problems:
• Scale: the different distance measures may have different scales, so they
should all be converted to the same scale, e.g. if we want add area to
Manhattan distance.

14

• Addition or Multiplication: to combine these different distances there are
different methods, especially adding or multiplying: instead of adding
different block distances they also could be multiplied. This is only meaningful in a few cases where one of the block distances has the ability of
defining a perfect match, as e.g. for the overlap: if the overlap distance
equals 0, than the position and the size and the aspect-ratio of the two
blocks is the same, so we have a perfect match. Other ways of combining
the simple block distances are also possible.
• Weighting: different block distances could be weighted differently according to their importance regarding our criteria, e.g. if we add the difference
in width with the difference in height, it would be useful to give more
importance to the difference in width, as for our purpose, column width
is more important than paragraph height.
So we get a lot of possible methods where it is interesting to see how each
of them will perform.

2.6.3

Matching

In this section the different matching methods will be presented. The matching
step matches the blocks from the query layout to the blocks of the reference
layout in a way to minimise the total distance. The total distance is obtained
by summing up the costs of the matches, which are equal to the block distance
between the two blocks participating in the matching. An example is shown in
Figure 2.5.
In the literature the matching problems as used here normally are presented
using the term “cost” instead of “distance”. When we present the general
problem we will use “cost”, if we are applying the method to our purposes we
will use the term “block distance”.
Three different matching methods will be discussed:

Figure 2.4: Example for simple block distances: the difference in width is 2
units, the difference in height is 2 units, the product of difference in width and
difference in height equals 4. The number of overlapping pixels equals zero, so
the overlapping area distance will be 1.45 as the maximum Manhattan distance
is 11, and the Manhattan distance of the corner points equals 10 units.
15

• Assignment Problem: each block is matched at most once.
• Minimum Weight Edge Cover Problem: each block is matched at least
once.
• Earth Mover’s Distance / Transportation Problem: each block is matched
partially to at least one other block.
The name of Earth Mover’s Distance was chosen by Yossi Rubner in [25] as
some CAD programs for road design have a function that allows to compute
the optimal way to move earth from roadcuts to roadfills. The Earth Mover’s
Distance is based on the transportation problem. A very common example
to explain the transportation problem is the one of factories producing some
goods and consumers consuming these goods. Before the consumers can do
so, the goods have to be transported from the factory to the consumers and
this transportation has a certain cost. The assignment problem is a special
case of the transportation problem: the usual example to explained is uses
agents (instead of factories) and tasks (instead of consumers). One agent is
assigned to one task. This assignment also has a certain cost, and as for the
transportation problem, the optimal solution is wanted. The minimum-weight
edge cover problem obtained its name out of graph theory, where the edge cover
for a graph denotes a set of edges that connects the vertices. The minimum
weight derives from the fact that for this application, the edges have a weight
(or cost) and the edges covering the vertices have to be chosen in a way to
minimise the total weight.
Matching by Solving the Assignment Problem
As mentioned above, the aim of the matching step is to match blocks from the
query layout to the blocks of the reference layout by minimising the total cost.
The total cost is the sum of all matches between two blocks multiplied by their
cost that in our case will be given by the corresponding block distance.
For the “Assignment Problem” we allow each block to be matched at most
once and every block that can be matched should be matched. So we have
exactly min(|L1 |, |L2 |) matches. It may happen, that, if the number of blocks
of the two layouts differ, some blocks are not matched. These will be called
“unmatched” blocks.
As the assignment problem is a bipartite graph matching problem, it can be
defined the following way (definitions are taken from [22]):
Let G = (V, E) be a graph consisting of a finite set V of vertices and a finite
set of edges E. A bipartite graph G is a graph whose vertex set V (G) can be
partitioned into two non-empty subsets X and Y and where for every pair of
vertices of one set, no common edge exists.
Now we consider our two layouts L1 and L2 as part of a bipartite graph in
the way that V = (L1 ∪ L2 ).
For every possible pair of blocks Bm ∈ L1 , Bn ∈ L2 we assign costs for an
eventual edge (Bm , Bn ) given by a block distance dgd (Bm , Bn ). We now want
to determine edges between these two subsets such that the number of edges is
equal to the minimum of the size of L1 and L2 and the total cost is minimal.
This problem can be solved by the “Hungarian Algorithm” (also called
“Munkres’ Algorithm”). The assignment problem can be described by a matrix,
16

d(Ba, B1) = 0
Ba

B1

Bb

B2 B3 B4

d(Bb, B2) = 4
d(Bb, B3) = 1
d(Bb, B4) = 4
Figure 2.5: Example of matching blocks: in this example every block of the
query layout (left) is matched to a block of the reference layout (right). The
total costs are given by 0 + 4 + 1 + 4 = 9. The block distance used is the
Manhattan distance of the block centres. This is a minimum weight edge cover
solution. The assignment problem solution would match block Bb to B3, B2
and B4 would remain unmatched.

17

where the entries mij are equal to the block distance between block i and block
j. An example can be found in Table 2.3 and 2.6.3. An example for the optimal
solution for the given cost matrix can be found in Table 2.6.3. A description
and an example for the Hungarian algorithm can be found in Appendix A.1.

Ba
Bb
Bc

B1
dgd (B1 , Ba )
dgd (B1 , Bb )
dgd (B1 , Bc )

B2
dgd (B2 , Ba )
dgd (B2 , Bb )
dgd (B2 , Bc )

B3
dgd (B3 , Ba )
dgd (B3 , Bb )
dgd (B3 , Bc )

Table 2.3: Symbolic cost matrix. dgd stands for a block distance.

Ba
Bb
Bc

B1 B2
1
2
2
4
3
6
Cost matrix

B3
3
6
9

Ba
Bb
Bc

B1 B2
0
0
0
1
1
0
Solution

B3
1
0
0

Table 2.4: An example for an assignment problem and its solution. Each “1”
represents an assignment: block Bc is assigned to block B1 , block Bb to block
B2 and block Ba to block B3 . The optimal solution has a total cost of 10.

Handling non quadratic problems In our application of the assignment
problem it often happens that the problem is not quadratic. This is the case
when the number of blocks in the two layouts differ. After the assignment there
are a number of blocks that are not matched. We have to take care about
handling these blocks appropriately. Simply ignoring them would give good
similarities for layouts consisting of a few blocks only. Penalising them by some
value afterwards could be a possibility but also inserting dummy blocks with a
certain penalty distance in order to get a quadratic problem could be a solution,
which is the approach we followed.
Matching by Solving the Minimum Weight Edge Cover Problem
The minimum weight edge cover problem consists of finding matches for the
same problem as the assignment problem, with the difference that every block
of L1 is connected to at least on block of L2 and vice versa. This problem is
solved using the Hungarian algorithm that was used in the preceding section to
solve the assignment problem. A description and an example for the algorithm
can be found in Appendix A.2. An example for the assignments done by the
minimum weight edge cover matching can be found in Table 2.5.
Matching using the Earth Mover’s Distance
One other interesting approach we analysed is to use the Earth Mover’s Distance
(EMD) as matching method. Instead of matching entire blocks, here blocks can
be divided into parts (in our case pixels) that are assigned to other blocks.

18

Ba
Bb
Bc

B1 B2 B3
1
2
3
2
4
6
3
6
9
Cost matrix

B4
4
8
12

Ba
Bb
Bc

B1 B2 B3
0
0
1
0
1
0
1
0
0
Solution

B4
1
0
0

Table 2.5: An example for minimum weight edge cover based matching. The
left table contains the initial problem, the right one the optimal matching with
a total cost of 14.
The EMD was used in [26] for image retrieval of colour images. Simply
spoken, the idea behind it was to calculate the cost to transform the histogram
of one image to the histogram of the other. In our case we want to build the
reference layout by moving pixels from blocks of the query layout to blocks of
the reference layout. As moving these pixels has a certain cost, a total cost can
be computed to convert one layout into another. Blocks as well as histograms
are called “signatures” in this concept, which is a more general concept.
Signatures
A signature is defined as a set of feature clusters represented by their mean
and by the fraction of pixels (“earth”) that belongs to this cluster: signature
S = {sj = (mj , wj )} where mj is the mean value of the cluster and wj the
fraction of pixels that belong to cluster j. A page layout can be considered as
a signature: the blocks represent the clusters and the fraction is given by the
area of the blocks (the number of pixels in that block).
Transportation Problem
The computation of the EMD is based on the solution of the transportation
problem. The following example illustrates what the transportation problem is:
Imagine there are two factories producing a certain amount of goods each, e.g.
Factory 1 (F1) produces 50 units of goods, Factory 2 (F2) produces 30 units
of goods. On the other side there are 3 consumers that all want to consume a
certain amount of goods. Consumer 1 (C1) needs 25 units, consumer 2 (C2)
45 and consumer 3 (C3) 10 units of goods. In order to get the goods from a
factory to a consumer, they have to be transported, and this transportation
has a certain cost. For example: the transportation of one unit of goods from
F1 to C1 has a cost of 15, from F1 to C2 30, etc. These informations can be
summarised in the so called “transportation tableau” that can be seen in Table
2.6.

F1
F2
Demand

C1
24
30
25

C2
30
40
45

C3
40
42
10

Supply
50
50

Table 2.6: The transportation tableau

19

Coming back to the initial idea of “earth moving” the factories will be replaced by the clusters of a signature S1 and the consumers by the clusters of a
second signature S2 . And the costs to move one unit from signature S1 to signature S2 is given by a block distance between the clusters of S1 and S2 . After
having found the solution for the transportation problem for two signatures we
obtain the total cost of transforming signature S1 to signature S2 . This cost
is divided by the total flow (the total number of transported units) and then
defined as Earth Mover’s Distance.
The algorithm for solving the transportation problem is described in Appendix A.3. A solution for the cost matrix in Table 2.6 can be found in Table
2.7.
Implementation Details For the assignment problem we chose to make the
problem quadratic, in case the number of blocks in the two layouts differs by
inserting dummy blocks. These dummy blocks need a certain penalty distance.
Depending on this distance the algorithm tends to make different mistakes: if
no penalty is assigned, layouts with few blocks will be too good a match. If the
penalty value is too high, the matching method is very inflexible if the number of
blocks in the two layouts differ (which may happen if e.g. splitting or merging
errors occur). Various methods of defining this penalty have been analysed,
e.g. the mean block distance value, no penalty at all, half of the maximum
block distance value, etc., and we opted for a penalty value that is given by the
maximum block distance value that exists between two real blocks.
Normalising the total distance by the number of blocks has also been tried
but did not give better results, due to the fact that by dividing, we lose the
information about the number of blocks per page.
For the minimum weight edge cover method we did not need to specify any
parameters.
Examples for the matching result for the assignment and the minimum
weight edge cover problem are shown in Figure 2.6. The query document (left)
contains blocks on the right side that are not part of the page layout but parts
of blocks from the facing page. These artefacts come from the scanning process.
The Voronoi layout analysis algorithm was used to extract these layouts. As
one can see, the minimum weight edge cover method find a correct match (the
same journal), regardless of these “artefacts”. The assignment problem based
method returns a wrong layout as layouts with approximately the same number
of blocks are preferred, due to the penalty value for unmatched blocks.
For the Earth Mover’s Distance we had to find a solution for the case that
two layouts have different number of pixels (only pixels belonging to blocks
are considered). As the area plays the role of demand and supply in the transportation problem, the initial transportation problem is unbalanced. So dummy

F1
F2

C1
5
20

C2
45
0

C3
0
10

Dummy
0
20

Table 2.7: Optimal solution for the transportation problem given in Table 2.6.
The total cost Ct is obtained by summing up the product of the allocation times
the costs for each cell. Ct = 5 × 24 + 45 × 30 + 20 × 30 + 10 × 42 + 20 × 0 = 2490
20

(a) Best match for the left side query layout by assignment problem matching
using the overlapping area as block distance.

(b) Best match for the left side query layout by minimum weight edge cover
matching using the overlapping area as block distance.

Figure 2.6: Examples for the two matching methods. The lines indicate which
blocks are matched to each other. The block distance used in both cases is the
overlapping area. The best match for the assignment method finds a wrong
layout (not the same journal), whereas the minimum edge cover method finds
a correct layout (same journal). The layouts were extracted using the Voronoi
layout analysis algorithm.

21

blocks have to be inserted to solve the transportation problem. These dummy
blocks do not have a position nor a size, they simply are pixel “producers” or
pixel “consumers”, in order to solve the transportation problem.
For the usual application of the transportation problem, this method works
fine as there can only be transported as much as there is supplied and needed.
The fact that the demand or the supply are too high will have no effect on our
total cost.
This is not appropriate for our purpose. We want to take into account all
pixels, even if the number of pixels in the blocks of the two layouts may differ.
As for the assignment problem, there are at least two possibilities: make the
problem a balanced one or penalise the unmatched pixels afterwards. As the
second solution needs some intelligent way of setting a penalty value and the
first solution gave better results, we opted for the first solution: we normalise
each layout to a size of one and each block does not have a fixed size in pixels
but only a fraction of pixels of the layout that belong to it.
An example how the pixels from one layout are matched to the other layout
for the two methods is shown in Figure 2.7.

2.7
2.7.1

Evaluation of Distances Measures
Evaluation Method

In this section we describe the evaluation method we chose for our distance
measures. The problem with evaluating the distance measures is to define what
a good result should be like. It should return for a given query layout a reference
layout from a database that “looks” similar, but the problem is to provide a
ground truth for that notion without too much manual labelling.
One possibility would be to let a human decide whether the layout found in
the database looks similar to the query layout. This idea has several drawbacks,
especially the time consuming task of visually comparing the results. Another
problem is the lack of objectivity, as every user has a slightly different opinion
of what is similar and what not. So the first tests, run on the UW3 Database,
only gave very little clues about how good the distance measures performed.
A more objective method for evaluation is needed. The idea was use some
document image database that is sorted according to different journals or magazines. As for one specific journal the layout should look very similar for different
articles, a query with a document from one given journal should return a layout
of the same journal.
The database we chose for this task is the MARG (Medical Article Records
Ground-truth) database. It contains 815 scanned documents of first pages of
medical journals, sorted by type (9 different types) and journal (161 different
journals). Further details about the MARG database1 can be found in [11]. The
9 different layout types are shown in Figure 2.8.
As different journals in this database are published by the same publisher
and publishers often use similar layouts for their journals, we furthermore manually sorted the MARG database according to the publisher. After sorting we
obtained 59 different publishers.
1 http://marg.nlm.nih.gov/index2.asp

22

(a) EMD using absolute area of blocks

(b) EMD using normalised area of blocks

Figure 2.7: (a) Result of the EMD without normalisation. (b) Result of the
EMD with the areas normalised to one. The left side shows the query image,
the right side the best match. The colours indicate the corresponding pixels,
where in (a) black is used to indicate unmatched pixels. The block distance
used is the sum of the Manhattan Distance of the corner points.

23

Title
Authors
Affiliation
Abstract
Other

F

A

B

C

E

G

H

I

O

Figure 2.8: The 9 different layout types of the MARG database. Type “O”
stands for “Other” and groups different layouts that do not fit into any of the
other 8 types.
The testing method used is the leaving-one-out test. As the MARG database
has no explicit training or test set, each of the 815 documents is used once as
query layout and is then compared to the remaining 814 layouts. The layout
with the smallest distance then is the result of the query. If the result of the
query is from the same class (journal, publisher or type respectively), the results
considered as correct, otherwise it is considered as an error.
As layout information we first used the layout ground truth of the MARG
database. However, this ground truth only contains information about four
special kinds of blocks, namely “author”, “abstract”, “affiliation” and “title”.
After a few tests, it became clear that this partial layout information was not
a good basis to test the performance of the distance measures, because many
blocks are disregarded. Completely segmented pages are a better choice to run
these tests.
We therefore chose to use known layout algorithms to extract the layout.
This is also the case which is more relevant to a practical task, because in these
it cannot be expected to have access to manually extracted layout information.
Although the algorithms will not yield perfect segmentations, they should produce similar errors for similar layouts. The algorithms we used for extracting
the layout information are: Voronoi algorithm [16], XY-Cut algorithm [21], Run
length smearing algorithm [30], Docstrum algorithm [23], Whitespace algorithm
by Baird [2] and Whitespace algorithm by Breuel [4]. An example of a page
segmented by all of these algorithms can be found in 2.12.

2.7.2

Layout Analysis Algorithms

In the following, we briefly describe the different layout analysis algorithms used.

24

XY-Cut
The idea of this algorithm and also its implementation are very simple: the
pixels of the image of the document are projected horizontally and vertically.
Then we look for the largest possible white gap in the projection and split the
image into two sub-images at this gap. We repeat this procedure recursively
until a stopping criterion is fulfilled.
We obtain by this method a list of horizontal and vertical splits that segments
the whole image into different regions. If we now shrink these regions to the
smallest possible rectangle covering all black pixels in this region we obtain the
different blocks. Depending on the stop criterion, we obtain lager or smaller
blocks.
This method also a few drawbacks. It has problems with separator lines and
also with black copy borders that are typical for scanned or copied book pages.
In the case of presence of such borders, the algorithm will not cut anywhere
because it does not find a gap at all. That is the reason why we first need to
remove the black borders from the images of the database before running the
XY-Cut algorithm. It also can only segment Manhattan layouts. These are
layouts consisting of rectangular blocks only.An example of a page segmented
by the XY-Cut method can be found in Figure 3.1 on page 46.
Run Length Smearing Algorithm
This algorithm by [30] is based on the smearing of black pixels in a binarised
image. The smearing process smears the black pixels (represented by a 1) over
the page in a manner that small white pixels (represented by a 0) are blackened.
This smearing is handled by two simple rules:
• Rule 1: 0’s are changed to 1’s if the number of adjacent 0’s is less than
or equal to a certain threshold C (if the length of a sequence of 0’s is less
than or equal to a threshold, the 0’s are changed to 1’s)
• Rule 2: 1’s stay unchanged
Consider the following example, where 0 represents white and 1 black pixels
and where the first line represents a line of pixels in its original form and the
second line the output of the smearing step of the first line. The smearing
threshold C is set to C = 4.
0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1
First this smearing process is done with the entire document image in horizontal direction with a threshold Ch = 300 and a first smeared image is obtained.
Then the same is done in vertical direction with threshold Cv = 500. These
thresholds have been fixed empirically. Then, these two bitmaps are combined
with the pixel-wise AND operation. An example can be found in Figure 2.10
This bitmap is then smoothed again by the smearing algorithm with a threshold
Cs = 30. Now we have the final image as can be seen in Figure 2.10(e).

25

After this step the bounding boxes have to be extracted, which can be done
by a connected component analysis. This step takes as input the image and returns the bounding boxes for each connected component. A connected component is a set of black pixels that are connected to each other. We are considering
the upper, the lower, the left and the right pixels for this neighbourhood, a so
called 4-neighbourhood (in contrast to the 8-neighbourhood where the diagonal
neighbours are considered, too). The bounding box of a connected component
is the smallest rectangular box that contains all pixels of the connected component. An example of connected components can be found in Figure 2.9.

Figure 2.9: An example of connected components. The grey boxes are the
bounding boxes of the connected components.
Although this method is quite simple and in some cases performs well, there
are these three parameters that depend on the resolution of the image and
that have to be adapted correctly. Furthermore it only segments pages having
rectangular blocks.
Docstrum
The Docstrum method extracts in the first step different important measurements of the document image. In the second step it groups the different connected components to different regions via these measurements.
The Docstrum method is based on connected components that usually represent individual characters, part of characters and merged characters as well
as some punctuation signs. For each of these connected components we compute the k-nearest-neighbour. The k-nearest-neighbour computes the k first
connected components that are the closest to the initial connected component,
using some block distance. In this case the Euclidean distance is used, and
the number k is set to k = 5. For each connected component we then get 5
neighbours to which we have a distance d and we can compute an angle θ. So
for every pair of connected components i and j we get Dij (d, θ).
When this data is plotted one can see different clusters and one can extract
the different distances that interest us, namely distance to the next character
in the same line and distance between lines as well as the skew angle. This is
the angle that gives us the page rotation (e.g.: the page was slightly rotated.
Using these distances, merging of connected components to lines by the
between-characters spacings can be done. Afterwards these lines can be merged
to blocks using the between-lines spacings in order to obtain a fully segmented
page.
26

(a) Original document image (b) After horizontal smearing

(d) After AND

(c) After vertical smearing

(e) After Smoothing

Figure 2.10: Example of the different smearing processes

27

Voronoi
The Voronoi document segmentation algorithm uses so called Voronoi Diagrams
to segment a page. A Voronoi point diagram is a diagram that splits up a region
consisting of points so that for each point, the area next to him is associated
with it. Examples of point Voronoi diagrams can be found in Figure 2.11. The
lines on the border of a region are called Voronoi edges.
The Voronoi segmentation algorithm uses a so called area Voronoi diagram.
Therefore the border points of the connected components are used. For a certain
number of these border points the Voronoi diagrams are constructed. Then, all
the Voronoi edges of points of the same connected component are deleted. After
this step each connected component is surrounded by a box that in general is
not rectangular block. Then, in a last step, further Voronoi edges are deleted
in order to obtain regions. These regions then have to be fitted into a block as
we are only considering Manhattan layouts.
Whitespace Algorithms
The Whitespace algorithm by Baird [2] uses another approach to segment the
page: instead of using the printed black area to segment the page, it tries to
find white spaces between the columns and the lines in order to segment the
document image.
The first step consists of finding maximal white rectangles. These are rectangles that cannot grow in size without adding one or more black pixels. Then
the rectangles that were found will be rated according to the area and the
aspect-ratio of the rectangle. The N -best rectangles will then be chosen as a
partial description of the background. Out of this description the blocks can be
extracted.
The main difference between Baird’s and Breuel’s method is the rectangle
finding step. A few parameters also change, but the approach remains the same.
Example for whitespace analysis can be found in [5]

Figure 2.11: Example of a Voronoi point diagram for three points. The red
dots are the points for which the Voronoi borders (black lines) are drawn. This
example is used to illustrate the idea of the Voronoi diagram. It is not drawn
exactly.

28

(a) Original document image

(b) Run-length smearing algorithm

(c) XY-Cut

(d) Whitespace (Breuel)

(e) Whitespace (Baird)

(f) Docstrum

(g) Voronoi

Figure 2.12: Example for layout analysis algorithms

29

2.8
2.8.1

Results
Error Rates

There are three different error rates that were computed using the MARG
database in order to evaluate the different distance measures:
• Journal (JOUR): This is the error rate for finding the correct journal for a
given query document out of a specific journal. This error rate is the one
that should give us, according to the main idea of this evaluation method,
a good overview about how good the distance measure works, as we expect
to find for a given query document a document of the same journal. There
are document images from 159 different journals.
• Type (TYPE): This error rate gives the ratio of misclassifications of the
document type. This error rate should be low, as the distance measure
should be able to identify the right layout type. Furthermore, only nine
different layout types have to be distinguished.
• Publisher (PUB): This error rate gives the ratio of misclassifications of
the document publisher, e.g. if the query document is from publisher Elsevier but has been classified as publisher Springer. This error rate is less
important because it may be that two journals from the same publisher
have different layout, and also the inverse case is quite frequent.
As the images of the MARG database are scanned very precisely, it is not
necessary to compute the coordinates of the blocks relative to the centre of
gravity of the layout. All results displayed here were gained with absolute
coordinates only.

2.8.2

Evaluation of Matching Methods

As mentioned in Section 2.6.3, three different matching methods have been
tested: the assignment problem, the minimum weight edge cover problem and
the earth mover’s distance based on the transportation problem.
These three methods have been tested with different block distances in order
to find out whether one method has an overall advantage over the other or if
the performance of the matching depends on the block distance, in a way that
a good block distance may compensate a bad matching.
As the ground truth contains only incomplete page segmentations, we ran the
major part of the tests on the output of different layout analysis algorithms. As
[27] stated that the Voronoi layout analysis algorithm is generally a reasonable
choice, we chose this algorithm to illustrate our results in this part.
Table 2.8 shows the error rates obtained by applying the three different
matching methods. The results are representative for various tests with different
layout analysis algorithms that have been performed: in short one can say that
the minimum weight edge cover method performs acceptably, whereas the EMD
and the assignment method perform worse, with a small advantage for the EMD.
Liang et al.’s method, originally used for benchmarking, is not appropriate for
our purposes, according to the results obtained. An example for the best match
with the benchmarking method can be found in 2.13. As one can see there,
the matching is somehow reasonable but there are many errors. One reason
30

why it performs not that good could be the fact that we implemented the
method on our own, so differences may exist between our implementation and
the original implementation. Another problem could be that the method to find
the matchings does not account for the size of the blocks. A small block that is
matched wrongly gets the same penalty as a huge block, although mismatched
small blocks do not necessarily mean, that the two layouts are not similar,
whereas huge blocks that cannot be matched correctly have a higher impact on
the visual difference of the layouts. One other problem are the many parameters
that need to be set: the penalty values on the one hand and the threshold that
defines the limits the interval of acceptable values for the different steps of the
matching computation (e.g. 0 has to be defined as an interval, because perfect
overlap is very improbable).
The reason why the assignment problem performs not as good lies in the
penalty for unmatched blocks: for unbalanced assignment problems (L1 and L2
have different number of blocks) unmatched blocks remain, which are penalised
with the maximum occurring block distance value. This implies that layouts
with similar number of blocks will be preferred, which is not always a wanted
effect. Different other methods for penalising have been tested, without big improvement (the error rate varies, but it stays far away from that of the minimum
edge cover): instead of giving a high penalty value for dummy blocks we gave
them no penalty at all. Although one might think that this approach is a bad
idea, it gave slightly better results. Another idea was instead of penalising the
dummy blocks or not penalising them at all, we gave them some neutral cost,
e.g. the mean distance of all the possible distances. This solution gave worse
results than penalising them. So we conclude that the assignment method has
one major drawback, namely the handling of unmatched blocks.
Compare the EMD to the assignment method, we observe that the EMD
performs slightly better. If we interpret these results we can say that prohibiting
the blocks from splitting up (as does the assignment problem) is not a good
idea, as it penalises very much the splitting and merging errors, errors that
occur very frequently in document layout analysis. Splitting blocks up into very
tiny parts (pixels) is not a good idea either. As one can see in Figure 2.7,
pixels from one block may be spread everywhere on the page, a problem that is
triggered by the transportation problem, but that is not necessarily wanted for
document layout comparison. The minimum weight edge cover method is the
most “natural” method for matching layout blocks: merging and splitting are
not too expensive, as two or more blocks may be matched to the same block
and based on a good block distance blocks are not matched all over the page.
Another important result is the runtime: in Table 2.9 the runtimes for the
leaving-one-out test can be found. The block distance used is overlap combined
Distance Measure
Edge Cover
Assignment
EMD
Liang et al.

JOUR
32.8
52.0
52.3
97.2

TYPE
8.2
22.9
20.2
80.0

PUB
7.6
25.4
23.9
93.5

Table 2.8: Comparison of different matching methods (error rates in [%]). The
used block distance is the overlapping area.
31

(a) Best match for Liang et al. method

(b) Best correct match for Liang et al. method

Figure 2.13: Example of matching done by Liang et al. method. The upper
figure shows the best match, the lower the best correct match (the best match
from the same journal type). Big black rectangles in blocks symbolise manyto-many matches, small rectangles zero-to-one or one-to-zero matches. Lines
symbolise one-to-many, many-to-many or one-to-one matches.

32

with Manhattan distance in case of non overlap. We can see, that the assignment
method and the minimum weight edge cover do not differ much. The fact that
the minimum edge cover is faster although it uses the assignment method is
explained by the size of the problem: for the minimum edge cover the assignment
problem is solved for a quadratic size of min(m, n), where m is the number of
block in layout L1 and n the number of blocks in layout L2 . When we use the
assignment problem as matching method, we make the problem quadratic to
the size of max(m, n). Compared to the Earth Mover’s Distance, both are very
fast. The 28 minutes that the Earth Mover’s Distance need can certainly be
improved, but is stay a more complex problem as the assignment problem.
Matching Method
Minimum weight edge cover
Assignment
Earth Mover’s Distance

Runtime
0m54s
1m02s
28m31s

Table 2.9: Comparison of the runtimes for the matching methods. The tests
were run on an Opteron 2.4GHz.

2.8.3

Evaluation of Block Distances

As shown in the preceding part, the best matching method of the three proposed
methods is the minimum weight edge cover matching. In order to test the
different block distances we used the minimum weight edge cover together with
a few block distances and compared the results. The results were obtained with
the blocks extracted by the Voronoi layout analysis algorithm on the MARG
database. In Table 2.10 the different error rates for the various block distances
can be found.
“Overlap” uses the overlapping area as block distance. So for every pair of
blocks we obtain a value between 0 and 1. If two blocks have no common area
at all, they will get the distance 1. In that case no conclusions can be made how
similar these two blocks are. Therefore we used the block distance “Overlap
+ Manhattan”, that, in case of non-overlap, adds the sum of the Manhattan
distances of the corner points divided by the maximal possible distance on the
page (sum of length and width of the image) to the 1 of the overlap distance.
This way we obtain a value between 0 and 2 for every block and we have some
information which block could be better or worse to match to if we have no
overlap.
“Manhattan Dist. of Corners” simply sums up the Manhattan distances of
the corner points of the two blocks. “Euclidean Dist. of Corners” does the
same but instead of the Manhattan distance it uses the Euclidean Distance to
measure the distance between two points.
”Manh. Dist. of Block Centres” computes the Manhattan distance of the
centre points of two blocks. ”Eucl. Dist. of Block Centres” uses the Euclidean
distance instead of the Manhattan distance.
“Difference in Width” uses the difference in width of two blocks as block
distance, so no explicit position information is used at all.
“Diff. Height × Diff.Width” uses the product of the difference in width and
the difference in height of two blocks. Instead of multiplying these two, one can
33

also sum them up. This is done in “Diff. Height + Diff.Width”.
“Difference in Area” uses the square root of the difference of the area of the
two blocks.
“Difference in Height” computes the difference of the height of two blocks.
As we can see, the overlapping area as block distance works quite well,
compared to the other methods. This comes from the fact that the overlapping
area depends on the position and on the size and also on the aspect-ratio of the
blocks, so a lot of information is contained within this single measurement. The
fact that adding the Manhattan distance of the blocks in case of non-overlap did
not improve noticeably the overall performance may come from the observation
that the best matches normally are made between blocks that are similar and
the rest of the blocks is matched against some other block, although these are
not necessarily similar. So it does not make a difference if we match them
randomly as done for the first method or if we try to improve the matching by
adding some additional feature.
Another conclusion that can be drawn is that the difference in width of two
blocks contains much more information for layout comparison than the difference
in height of the blocks. This is quite obvious as the column width for all the
blocks in a column is the same, but the height of these blocks may vary. In
addition, the column width for one journal is typically fixed, so it is a better
measure than the difference in height to identify the journal.
Furthermore it can be seen that the Manhattan distance has slightly better
results than the Euclidean distance, although the difference is not large.
Comparing “Diff. Height × Diff.Width” with “Diff. Height + Diff.Width”,
it can be seen that in this case the multiplication is the better operation to
combine the two distances. Although the journal error rate is slightly higher,
the type and the publisher error rates are lower. Using the sum to combine the
two measures will lead to small distances only for blocks with approximately
the same length and the same width, whereas the multiplication will lead to
small distances if either the difference in length or the difference in height are
small. A possible explanation for this behaviour may be, that due to the fixed
column width, it happens frequently, for layouts with the same column width,
that the total difference in width will become small, although the difference in
Block Distance
Overlap + Manhattan
Overlap
Manhattan Dist. of Corners
Euclidean Dist. of Corners
Manh. Dist. of Block Centres
Eucl. Dist. of Block Centres
Difference in Width
Diff. Height + Diff.Width
Diff. Height × Diff.Width
Difference in Area
Difference in Height

JOUR
31.2
32.8
39.7
40.7
41.6
43.7
47.4
49.6
50.7
81.8
88.1

TYPE
7.4
8.2
11.3
11.9
13.1
14.3
19.4
17.2
13.2
54.3
60.1

PUB
7.0
7.6
10.5
11.5
13.9
14.8
20.4
18.4
14.6
63.3
70.9

Table 2.10: Comparison of the different block distances (error rates in [%]).

34

height may still be large.
Having a general look at the error rates, even the best of our block distances
has an error rate of 31%. This seems to suggest that a lot of improvement should
still be possible. However, it is unclear what the class overlap of this task is,
i.e. how many journal pages cannot be distinguished by the layout alone.
Recall also that the distance measures use only the layout information, i.e.
the corner coordinates of the blocks. It is highly likely that better error rates
can be achieved when including more information about the blocks, e.g. their
texture, the distribution of bounding box sizes, or the output of a text/graphics
classifier. Furthermore, the method is able to determine the correct layout type
in 92.7% of the cases, which seems a reasonable basis for its use in document
image retrieval.

2.8.4

Comparison to the FIRE Results

To compare the results of our document image retrieval method by comparing
layouts to FIRE2 (Flexible Image Retrieval Engine), that originally was developed to do content based images retrieval on pictures, in a first step the original
images of the MARG database are used. FIRE computes a different features
and then runs the same test as described in Section 2.7.1 (leaving-one-out). The
computed features are:
• Image Features: The images are scaled down to the same size and then
compared using the Euclidean distance, c.f. [8].
• Tamura Texture Features: Coarseness, contrast and directionality of the
images are computed and compared. Further details can be found in [29].
The results are shown in Table 2.11. For the document type one can see, that
this method achieves better results as the presented distance measure, keeping
in mind that FIRE uses the original more information (original image) as the
distance measure (only block information).
To have a comparable results, the same test on images created of the Voronoi
segmentation information was run on FIRE. The images contained only black
blocks, so that there is no more text and it uses the same information as the
distance measure (block information). For this test two other features were
computed: Tamura and thumbnails. The weight is 2 for Tamura and 1 for the
image features. The error rate can be found in the last line of Table 2.11. It
shows that the distance measure presented in this work performs slightly better
then the FIRE.
Features
Only Tamura
Only Image Features
Tam. & Img Feat. (1:1)
Blocks, Tam. & Img Feat. (2:1)

JOUR
24.2
34.1
21.8
33.0

TYPES
8.2
16.3
7.1
10.6

Table 2.11: The error rates in [%] of FIRE.
2 http://www-i6.informatik.rwth-aachen.de/˜deselaers/fire.html

35

2.8.5

Error Types

Having a closer look at the different kinds of errors that the overlapping distance
measure produces, we can see in Table 2.12 the different error types together
with the number of occurrences of each error. The last column contains the
percentage of the error type with respect to the total number of errors. The
four error types are the following:
• ¬J ¬T ¬P: neither the journal nor the document type nor the publisher
have been recognised (total failure).
• ¬J ¬T P: only the publisher has been recognised.
• ¬J T ¬P: only the type has been recognised.
• ¬J T P: type and publisher have been recognised, only the journal is
wrong.
The error where only the journal is wrong is the most frequent. Having a
look at an example of this error in Figure 2.14 one can see that this type of
error it is comprehensible, as the layouts look even for a human person quite
similar. Although only one example is shown, more examples have been looked
at and nearly all of them had a high similarity between the query layout and
the false match.
By visually controlling errors where neither the right journal, the right type
nor the right publisher were found, one can see that these are real failures of
the distance measure. An example for such a failure can be seen in Figure 2.15.
The upper figure shows the best match found. The lower figure contains the
best match for a layout of the same journal as the query layout. This best
match layout fits very well, but nevertheless the global best match has a lower
distance to the query image. Although the matching is good, it seems as if the
splits are more costly than the differing block widths and positions in the global
best match. In order to have an idea in what scale the differences of the two
matches differ, the measured distances are given here: the upper match has a
total distance of 5.72 whereas the lower has a distance of 6.63.
An example for the right publisher but wrong journal and wrong type can
be found in Figure 2.17(a). As shown in the example, the two layouts are very
similar. Having a closer look to the errors of this type revealed that the database
is not error-free. The layouts in the example are of the same type, although the
database keeps them as two different types, which is clearly wrong. All 22 errors
of this type have been visually checked and in 6 cases, there was a real error. In
the other 16 cases, the documents in the database were wrongly classified. An
example for one of these errors in the database can be found in Figure 2.16.
Errors where the neither the journal nor the publisher are correct found are
rare. In most cases these errors have quite different layouts (12 cases), in some
rare cases they result in similar layouts (5 cases). Being difficult to draw a
proper conclusion out of the visual control of these errors, it showed that most
of the errors leading to different layouts were layouts with a lot of blocks and
some segmentations errors. An example for this error type can be seen in Figure
2.17(b).
Having seen these error types, further investigation can to be done: the first
thing to try is to reduce the real failures. This could e.g. be done by using
36

Error Types
¬J ¬T ¬P
¬J ¬T P
¬J T ¬P
¬J T P

Absolute
39
22
17
175

Relative
15.4%
8.7%
6.7%
69.1%

Table 2.12: The different error types together with their absolute and relative
number of occurrences

Figure 2.14: Example of a falsely matched journal, although type and publisher are correct. Overlap as block distance and minimum weight edge cover
as matching method were used.

37

(a) Wrong journal, wrong publisher, wrong type

(b) Best correct match

Figure 2.15: Example for a failure. The upper image shows the best match, the
lower the best match for a layout of the same journal, type and publisher.

Figure 2.16: Example of an error in the MARG database. According to the
database, the type of both documents is different, which is obviously not true.

38

(a) Journal and type are wrong, publisher is correct

(b) Journal and publisher are wrong, type is correct

Figure 2.17: Examples of errors where either the publisher or the type is wrong.

39

k-nearest neighbour, where k is greater than 1. So the final match would be a
reference layout that, of k similar layouts, has the most layouts from the same
journal in this k-neighbourhood. This idea has been tested, results can be found
in Table 2.13, but no improvement has been observed for k = 1, ..., 5.

2.9

Application: Layout Analysis by Example

In this section we outline a method using the distance measure introduced in
the preceding section, that does layout analysis by example. The general idea of
layout analysis by example is to segment a new document by using documents for
which the layout is already known. One could e.g. take one document of a three
column magazine, run a segmentation algorithm on it, take the output and look
for a similar layout in the database. Then, the layout of this similar document is
used to segment the new document. This could be done by adapting the blocks
of the new layout heuristically using the information of the best matching layout,
which is not a trivial problem: given two layouts, one new and not error-free
layout and one known correct layout. By what methods can one combine the
two layouts into one layout that should be more correct as the initial layout
given by the segmentation algorithm?
In order to avoid this problem, we propose another more simple approach:
we used three page segmentation algorithms, namely Voronoi, Docstrum and
Whitespace by Baird to obtain three different segmentations of the same document. These layouts are then compared to a database of known layouts. The
layout of the algorithm that has the smallest distance to a document in the
database is used as the output of this combined segmentation process. By this
method we hope that the different weaknesses of the segmentation algorithms
will become less important.

2.9.1

Evaluation

The method has been tested on the “UW 3” database. The 1600 documents
of the database has been divided into two sets: 878 test documents and 722
reference documents (the database). The two sets are exactly the same as in
[27], in order to be able to compare the performance of this composed method
to the single segmentation algorithms Voronoi, Docstrum and Whitespace.
The method to combine the three different segmentation algorithms works
as follows:
Err.
k=3
k=4
k=5
k=6
k=7

¬J ¬T ¬P
42
49
58
65
72

¬J ¬T P
24
25
27
23
23

¬J T ¬P
18
19
25
27
29

¬J T P
176
183
191
194
201

Jour
31.9%
33.9%
36.9%
76.9%
39.9%

Type
8.0%
9.1%
10.4%
10.8%
11.6%

Pub
7.4%
8.3%
10.2%
11.2%
12.4%

Table 2.13: Number of errors and error rates for the k-nearest neighbours,
k = 3...7.

40

• Step 1: Take a document of the test set and segment it using the three page
segmentation algorithms: Voronoi, Docstrum and Whitespace by Baird.
• Step 2: Compare the three layouts obtained in step 1 to the ground truth
of the reference set using the proposed distance measure. Save the distance
of the best match for each of the three layouts (the best match for each
of the three layouts may be different!).
• Step 3: Choose the algorithm that produced the layout with the shortest
distance to its best match as the winner. The layout produced by this
algorithm is the final segmentation.
The evaluation of this method is done by using the results presented in [27]:
for each document, the algorithm that gave the best segmentation result was
chosen as an optimal solution for the method presented here. Then the error
rate of this optimal choice would be 2.8%. Always choosing Voronoi as best
algorithm would give an error rate of 5.5%. For Docstrum it would be 6.0% and
for Whitespace 9.9%.

2.9.2

Results

The error rates are shown in Table 2.14. The block distance used is the overlapping area combined with the Manhattan distance. As one can see, the method
we proposed does not give the improvement hoped for. It is slightly better as
simply choosing one of the three layouts at random, but it stays above the error
rate one would have if always choosing the Voronoi page segmentation algorithm. A test with the difference in width as block distance gave slightly better
results. An error rate of 5.6% was obtained. Other block distances have been
tested without any improvement of the error rate. Normalizing the distance
by the number of blocks of the new layout has also been tested but did not
succed in any improvement. The same is true for using relative coordinates of
the blocks instead of absolute ones.
Interpreting these results, different conclusions are possible and further tests
have to be done in order to find out what conclusion is the right one:
• Wrong Approach: it may be that the chosen approach is wrong. The
approach bases on the hope, that there is some document in the reference
set where the ground truth layout is similar to the segmented layout and
that then this segmented document is similar to the ground truth. If this
is not the case, e.g. if there is a document where the segmented layout is
very similar to one layout of the ground truth but the original documents
are not the same (this may be due to segmentation errors), then this
method will fail.
• Reference Set: it may be that the reference set is to small to contain a
good match for the segmented output. As the number of different layouts
is even for Manhattan layouts very large, it may happen that there is no
layout in the reference set that is similar to the segmented layout.
• Bad Distance Measure: another possible explanation could be that the
distance measure for comparing two layouts is not adapted for this type
of application. The difference in width giving better results as the overlapping area could be an indication for this hypothesis.
41

In order to find out if the distance measure works well, the following test
has been done: instead of comparing the segmented page to the database, it
was only compared to its ground truth. If the distance measure works well, the
error rate for this test should be near to the optimal. Considering the error
rate of 6.2% for this test, we can conclude, that there is some problem with the
distance measure. An example for a match for this test can be found in Figure
2.18. There you can see, that the distance measure has problems handling small
blocks that probabely come from noise. Further tests have to be done to find
out what exactly causes this approach to fail.

Figure 2.18: Example for comparison of a new document to its ground truth.
Small blocks are problematic as their cost is very high to match them, although
it would be better to ignore them. N.B.: Missing matching lines are caused by
an error in the drawing function (blocks in the middle of the right column are
matched but the line is not visible.
Settings
k=1
k=2
k=3
k=5
k=7
Rel. position, k = 1
Normalised , k = 1
Choosing at random
Block distance: Diff. in width
Block distance: Diff. in height
Block distance: Sum of Manh. dist. of corners
Comparison to ground truth:
Optimal Choice
Always Voronoi
Always Dosctrum
Always Whitespace

Error Rate
6.6
6.6
6.5
6.5
6.5
6.4
7.3
7.0
5.9
6.7
6.3
6.2
2.8
5.5
6.0
9.9

Table 2.14: The error rates for layout analysis by example in [%].

42

2.10

Conclusion and Future Work

We presented a distance measure for document layouts. It is a two step method
that first computes the distance between blocks and then computes a matching
between these blocks. Three different methods were proposed: assignment problem solving, minimum weight edge cover and Earth Mover’s Distance. Different
block distances have also been proposed and tested together with the matching
methods using the publicly available MARG database.
Testing these methods lead us to conclude that the best matching method for
document image retrieval by layout is the minimum weight edge cover method,
as it gave the lowest error rates. Furthermore we saw that the Earth Mover’s
Distance is not appropriate for document image retrieval by layout as its time
consumption is to high.
For the block distances we observed that the overlapping area is the best
choice, as it combines height, width and aspect-ratio into one distance. By
combining it with the Manhattan distance of the corner points of the block it
could slightly be improved. Concerning the computation time we can conclude,
that albeit there are differences for the different block distances, the overall
computation time is quite low, as the block distances are all very simple.
Concerning the different error types we saw, that a small number of errors
were due to errors in the MARG database. Most of the errors are due to different
journals from the same publisher and the same type. Given these facts, it is
traceable that the distance measure has problems to keep apart these layouts,
as even for a human person, the layouts look similar. The rest of the errors are
more problematic errors, as the distance measure in these cases returns layouts
that from a human point of view do not look similar. On the bases of which
further investigation can be done to find out how to improve the algorithm.
An application for the distance measure, namely layout analysis by example was presented, too. Although the results were not as positive as hoped,
different hypothesis for these high error rates could be established and further
examination is needed in order to find out if the approach is a possible step to
improve page segmentation algorithms. If not, other approaches are possible:
using the method described by Breuel in [6] to build a hierarchical tree of all
possible layouts, the distance measure could be used to find the best match of
all these possible segmentations and use that one as segmentation to choose.

43

Chapter 3

Layout to HTML Converter
Extracting geometric layout information is an important process step in layout
analysis. In order to visualise the results of such an algorithm we are interested
in a tool that allows us to check these in a simple manner.
Furthermore it would be interesting to have a common interface for layout
information that allows, apart from simple visualisation, also further processing
as well. HTML tables have been chosen as a good format that allows to visualise
the results, to present them on the net and that is a wide-spread language.
This chapter is divided into three parts: the first part contains a short
description about the requirements for this program. The second part explains
the different functionalities, and the last part gives some information about the
implementation details.

3.1

Requirements

In this part, the different exigencies to the tool are described. The tool should
be able to do the following things:
• Manual Segmentation: The tool should assist the user in manually segmenting a document image into different zones. This should be done by
the XY-Cut manner of splitting recursively horizontally and vertically to
achieve the desired segmentation. The user should have the possibility
to cut at a certain position in the image, simply by clicking on it. Left
mouse clicks should produce horizontal cuts, right mouse clicks vertical
cuts. The results of each cut should be displayed immediately, so that the
user can verify his segmentation.
• Export to HTML: To display the segmentation in a widely used format, the
tool should be able to convert the segmentation to HTML code. The XYCut method of splitting the page recursively into various parts perfectly
fits into the concept of nested HTML tables that can be built recursively,
too. The images, that are the result of the split original image, should be
placed in the right table to guarantee that the resulting HTML code produces the same layout as the original image. Furthermore, white borders
in the partial images should be removed to reduce the space needed to save
partial images. The tool should also allow drawing the HTML borders to
44

visualise the structure of the HTML table and the segmentation. For PDF
files it should paste the text into the concerning table cells instead of the
image.
• Import of segmentation information: As a wide variety of layout analysis
algorithms exists and their code is also available, the tool should offer an
interface to import the output of these algorithms. As most algorithms
output blocks, and not zones as does XY-Cut, the tool should be able
to import this block data and display it on the original image. It also
should be able to convert this block data into an HTML table, although
the conversion from blocks to tables is not trivial.
• Usability: The tool should be easy to use and provide a few little useful
functionalities for the user, as e.g. undoing, redoing, zooming in and
zooming out and displaying crosshairs to simplify the positioning of the
cut to be done.
• Automatic Segmentation: As manual segmentation can be time consuming, the tool should also provide a simple method for automatic segmentation. This segmentation should use XY-Cut algorithm.
• Input file format: As there is a large number of different file formats for
images, a few of them had to be chosen. TIF and PPM should be readable
by the tool, as TIF and PPM are a very wide spread format for document
images. In addition, PDF files should be a possible input. For these, the
original text should be extracted to paste it onto the HTML table.

3.2
3.2.1

Functionalities and Description of the Graphical User Interface
Mouse Buttons: Manual Segmentation

The manual segmentation consists of cutting the page into different parts by
splitting it horizontally and vertically at different positions, just like the XYCut algorithm does. An example for consecutive horizontal and vertical cuts
can be found in Figure 3.1.
The splitting is done by selecting a position in the image and then clicking
the left mouse button for a horizontal split and the right mouse button for a
vertical split. After each split the new segmentation is drawn in green lines.
To facilitate the positioning of the splitting, crosshairs can be activated. A
screenshot of the segmentation process can be seen in Figure 3.2.

3.2.2

Menu entries

There are only two menus, the “File” and the “Edit” menu. The “File” menu
contains the usual items to open and close a file, to quit the program and to
export the segmented page to HTML. The “Edit” menu offers a method for
undoing and redoing segmentation steps and a possibility to activate crosshairs.
In the following, the menu entries are explained in more detail.

45

Figure 3.1: An example for consecutive cuts that segment a page into different
zones.

Figure 3.2: A screenshot of the manual segmentation process. The green line
represents a horizontal cut, the blue lines represent the crosshairs.
46

Open
The open dialog lets you choose the file the user wants to open. Supported file
formats are:
• TIF, PPM: input image to be segmented and converted to HTML
• PDF: input PDF file to be segmented and converted to HTML
• DAT: text file containing coordinates of the corner points in xl yl xh yh integer format of the blocks, where (xl , yl ) defines the lower left and (xh , yh )
the upper right corner of the block. Choosing a DAT file opens a new
“Open” dialog where the user has to choose the image this block information belongs to. There is no check whether it is the right file or not, nor a
check if the DAT file has the right format; the user has to take care of it.
An example for external block data displayed on the original document
can be found in Figure 3.3
Close, Quit
The “Close” function resets the graphical user interface to the initial state and
internally resets all the data structures so that every segmentation information
is lost. “Quit” closes the program.
Export as HTML
The “Export as HTML” function fulfils the main requirement of the program:
it converts a set of horizontal and vertical cuts into an HTML table. This

Figure 3.3: A screenshot of the GUI displaying block information (red blocks)
on the original document image.
47

conversion process is described in the following.
Starting with the current segmentation given by the cuts, done by the user
or obtained by the automatic segmentation, the export method builds a nested
HTML table that looks the same as the segmented image. A nested HTML
table is a table that contains one or more other tables, that in turn also can
contain tables, and so on. The original image is split into different subimages
and these are then pasted into the different table cells.
An example of an HTML version of a segmented image can be found in Figure
3.4. In Figure 3.5 the same segmentation is shown, this time with HTML table
borders.
Undo, Redo, Crosshair
“Undo” removes the currently last cut from the list and redraws the blocks we
obtain. “Redo” moves the recently removed cut back to the list and redraws
the blocks. “Crosshairs” activates or deactivates the crosshair.

Figure 3.4: A screenshot of an HTML version of a segmented page in a browser
window.

48

Figure 3.5: A screenshot of an HTML version with borders of a segmented page
in a browser window.

49

3.2.3

Buttons

A screenshot of the buttons can be found in Figure 3.6.
Zoom in, Zoom out
These buttons are used to zoom in and to zoom out. The factor of zooming
is diminished or increased by 0.1 per step. The minimal factor is 0.2 and the
maximal 10.0.
XY-Cut Image
Clicking this button starts the XY-Cut segmentation algorithm. This algorithm
recursively splits the page in horizontal or vertical direction. The coordinates
where it should split are found by finding the largest possible gaps in the X
and Y projection of the image. A more detailed description of this algorithm
can be found in Section 2.7.2. The result of the segmentation method is then
displayed on the screen. The user has the possibility to manipulate the obtained
segmentation by undoing segmentation steps and by adding new cuts.
XY-Cut Boxes
This method is used to convert the output blocks of some general layout segmentation algorithms to a HTML table. Because converting some arbitrary
blocks to a table is not that trivial, the following method was chosen for this
task:
First, an image is created in which the blocks are simply painted black.
Then, the XY-Cut algorithm splits the image into different regions that can
be converted without problem to a nested HTML table. An illustration of this
procedure can be found in Figure 3.7.

Figure 3.6: A screenshot of the menubar of the HTML conversion tool.

B1: x0, y0, x1, y1
B2: x0, y0, x1, y1
...
B6: x0, y0, x1, y1

<!DOCTYPE html...
<head>
<title>MLAN</title>
...

Figure 3.7: An illustration of the conversion from block data to an HTML
table. On the left side we have the coordinates of the blocks obtained by some
segmentation algorithm. These are converted to an image containing black
blocks. On this image we then run the XY-Cut algorithm. The last step then
converts the result of the XY-Cut algorithm to HTML code.

50

Redraw Blocks
When blocks from another segmentation layout are used as input, these are
also represented on the image. If the user now wants to segment the page, the
program automatically changes the presentation and draws the regions obtained
by the cuts. By pressing this button, the user will see the original blocks from
the input data.
Redraw Cuts
This button has the inverted functionality of the preceding one: instead of
changing the presentation from cuts to blocks, this button changes the representation from blocks to cuts. This can be used if the user wants to check the
result of the automatic segmentation with the input blocks.
Extract Boxes from Colour
Instead of saving the output blocks from layout segmentation process in separate
(text) files, one could also save this information by coding the different blocks
in different colours. So each block will be identified by its own specific colour.
This function finds the minimum bounding box for all the pixels of the different
colour values. So if a bitone image is used, it returns the minimum bounding
box of all the black pixels in the page.

3.2.4

Checkboxes

This section describes the functionality of the checkboxes in the tool bar.
Conserve Original Size in HTML Export
This checkbox is only useful when converting a PDF document. If this checkbox
is enabled, the program tries to conserve the original size of the tables in HTML
export. In the other case, the program does not specify the size of the text cells
but simply uses the normal text flow to define the size of the blocks.
HTML Breaks in HTML Export
This checkbox is only useful when converting a PDF document. Enabling this
checkbox will bring the program to insert <br> at the end of the line. In the
other case it will simply let the text flow decide when to insert a new line.
HTML Borders
This option sets the border size of the HTML tables to some value greater than
zero. In the normal case, the borders have size 0 in order to maintain the original
aspect of the document if the HTML conversion is looked in a browser.
Invert X Axis, Invert Y Axis
These two checkboxes allow to invert the axis of the coordinate system. This is
needed as the coordinate system of images may differ. If the user then loads a

51

DAT file with blocks, the blocks will be in the wrong place. Inverting the axis
will solve this problem.

3.3

Handling PDF Files

As implementing a parser for PDF files is very time consuming and not trivial
and as no free PDF parser library is available, we chose the “pdftotext” program
to extract the text. A little modification had to be done in order to output the
coordinates of the characters and not only the characters.
When the user opens a PDF file, he will be asked to chose the page number
he wants to convert. Then this page will be converted to PNM in background
and the text together with the coordinate information is extracted. Now the
user can segment the page. When exporting it to HTML, the text, instead of
the image, is put into the table. If one cell of the table contains no character,
we presume that there should be an image and copy the corresponding part of
the image into that cell. As we are removing the white borders of the partial
images before saving, entirely white images will be recognised and not saved.
An example of the HTML version of a segmented PDF page can be found
in Figure 3.8.

Figure 3.8: A screenshot of the HTML version of a segmented PDF page. The
border option is turned on and no breaks are inserted.
52

Appendix A

Matching Algorithms
A.1

Hungarian Algorithm

A detailed description of the algorithm can be found in [17]. The algorithm
for solving the assignment problem works as follows (the description and the
example are taken from [24]):
• Step 0: If necessary, rotate the cost matrix so that the number of rows is
less or equal to the number of columns; k = min(n, m), where n equals the
number of rows and m the number of columns of the assignment problem
table.
• Step 1: ∀ row: find the smallest element and subtract it from every element
in this row.
• Step 2: Find a cell containing a zero, called Z: if in the row or column
of Z there is no starred zero, Z will be starred. Repeat this step for all
elements in the matrix.
• Step 3: Cover all columns containing a starred zero. If the number of
covered columns equals k, k = min(n, m), then we are finished, Goto
Done. Else continue with step 4.
• Step 4: Find an uncovered zero and prime it (Z’). If no starred zero Z*
can be found in the row of Z’, Goto step 5. Else cover the row of Z’ and
uncover the column of Z*. Continue until no uncovered zeros are left.
Save smallest uncovered value and goto step 6.
• Step 5: Starting at the Z’ found in step 4, we build a path by alternatively
looking for a starred zero in the column and for a primed zero in the row.
Continue until we find a primed zero where no starred zero can be found
in the same column. For each zero of the path, change starred zeros to
unstarred zeros and change primed zeros to starred zeros. Delete every
prime and uncover every line. Goto step 3
• Step 6: Add the value of step 4 to every element of every uncovered row.
Subtract the value from every element of every uncovered column. Goto
step 4
53

• Done: A starred zero in cell i, j indicates that the block in row i is assigned
to block in column j.
An example for the different steps of this algorithm can be found in Table
A.1.

A.2

Minimum Weight Edge Cover Algorithm

The steps to solve the minimum weight edge cover problem using the Hungarian
algorithm, are the following (the description of the algorithm is from [15] and
from [14]):
• Step 1: find for each node the edge with the minimum weight
• Step 2: for each possible edge, subtract the minimum weights of the two
nodes of this edge
• Step 3: find the minimum value of the matrix obtained in step 2 and
subtract this from every value in order to obtain only positive values.
Apply the Hungarian algorithm.
• Step 4: from the result of the Hungarian algorithm, delete every edge that
has a nonzero weight
• Step 5: for each unmatched node add an edge with minimum weight to
the cover
An example for this method can be found in Table A.2.

A.3

Transportation Problem Solving Algorithm

As mentioned before, the transportation problem has to be solved in order to
compute the EMD. A mathematically more profound approach can be found in
[3]. To explain the algorithm the balanced case is considered. This means that
the sum of all the demands equals the sum of all the supplies. Although this
condition is not always satisfied for signatures, this does not present a problem,
as in this case either a dummy supply or a dummy demand row can be added.
This dummy produces or consumes the difference of supplies and demands between the two signatures. If demand and supply are equal, the problem is called
“balanced”, the other case “unbalanced” transportation problem.
Solving the transportation problem is done in two major steps: the first step
consists of finding an initial feasible solution, The second step is optimising this
solution. For the first part the minimum cost method is used to find a feasible
solution. For the second part the transportation simplex method is used.

A.3.1

Minimum Cost Method for Finding a Feasible Solution

A sketch of the algorithm is given here:

54

B1 B2 B3
Ba
1
2
3
Bb
2
4
6
Bc
3
6
9
Initial Cost Matrix

B1 B2
Ba
0
1
Bb
0
2
Bc
0
3
After Step 1

B3
2
3
6

B1 B2
Ba
0*
1
Bb
0
2
Bc
0
3
After Step 2

B3
2
3
6

B1 B2
Ba
0*
1
Bb
0
2
Bc
0
3
After Step 3

B3
2
3
6

B1 B2
Ba
0*
1
Bb
0
2
Bc
0
3
After Step 4

B3
2
3
6

B1 B2
Ba
0*
0
Bb
0
1
Bc
0
2
After Step 6

B3
1
3
5

B1 B2
Ba
0* 0´
Bb
0´
1
Bc
0
2
After Step 4

B3
1
3
5

B1 B2
Ba 0* 0´
Bb
0´
1
Bc
0
2
After Step 5

B3
1
3
5

B1 B2
Ba
0
0*
Bb 0*
1
Bc
0
2
After Step 3

B3
1
3
5

B1 B2
Ba
0
0*
Bb 0*
1
Bc
0
2
After Step 4

B3
1
3
5

B1 B2
Ba
0
0*
Bb 0*
1
Bc
0
2
After Step 6

B3
0
2
4

B1 B2
Ba
0
0*
Bb 0*
1
Bc
0
2
After Step 4

B3
0´
2
4

B1 B2
Ba
1
0*
Bb
0*
0
Bc
0
1
After Step 6

B3
0´
1
3

B1 B2
Ba
1
0*
Bb
0* 0´
Bc
0´
1
After Step 4

B3
0´
1
3

B1 B2
Ba
1
0*
Bb 0* 0´
Bc 0´
1
After Step 5

B3
0´
1
3

B1 B2
Ba
1
0
Bb
0
0*
Bc
0*
1
After Step 3

B3
0*
1
3

Ba
Bb
Bc

B1 B2
1
0
0
0*
0*
1
Done

B3
0*
1
3

Table A.1: An example for the different steps of the Hungarian algorithm for
solving the assignment problem. Rows or columns in italic symbolise the covering of rows or columns. Values in bold are marked cells that are needed in
further steps.

55

Ba
Bb
Bc

B1
1
2
3

B2
2
4
6

B3
3
6
9

B4
4
8
12

Ba
Bb
Bc

Initial Cost Matrix

B1
1
2
3

B2
2
4
6

B3
3
6
9

B4
4
8
12

Minima found in step 1

B1 B2 B3 B4
Ba
-1
-1
-1
-1
Bb
-1
0
1
2
Bc
-1
1
3
5
Step 2: after subtracting the minima

Ba
Bb
Bc

B1 B2 B3
0
0
0
0
1
2
0
2
4
After step 3

B4
0
3
5

B1 B2 B3 B4
Ba
0
0
1
0
Bb
0
1
0
0
Bc
1
0
0
0
Result of Hungarian Algorithm

Ba
Bb
Bc

B1 B2 B3
0
0
1
0
1
0
1
0
0
After step 5

B4
1
0
0

Table A.2: An example for minimum weight edge cover problem algorithm based
on the Hungarian algorithm.

56

• Step 1: Get the cell with the least costs and allocate the most possible
resources, given by the minimum of demand and supply for this cell. Reduce demand and supply in this column and row by the allocated amount.
If total demand is greater than zero, repeat step 1. Else stop.
This method starts with looking for the cell with the least costs to allocate
one unit (this means that one unit of goods from a specified producer is assigned
to a consumer). After having found this cell, the maximum of units that can be
allocated is determined. This value is given by the minimum of the supply of
this row and the demand of this column. Then the supply and demand for this
row and this column are reduced by this minimum and allocated to the cell.
We now have allocated a transport of min(supp(i), dem(j)) units from F (i) to
C(j), and the remaining demands and supplies were adjusted accordingly. This
process is repeated until no demand and no supply remains (keeping in mind
that only balanced problems are considered, both demand and supply come
simultaneously to zero).
Other methods to find this initial feasible solution are possible. As the
number of iterations of the next part of the algorithm depends on this first
solution, it is of interest to find a good feasible solution (a solution that is
as close to the optimum as possible). This is the reason why we chose the
minimum cost method. Other possibilities to find an initial feasible solution are
the Northwest corner method and Vogel’s approximation model. Descriptions
of these methods can be found in [1]. An example can be found in Table A.8.
Variants of Vogelś method can be found in [20].

A.3.2

Optimisation of the Initial Solution

After having found an initial feasible solution it has to be optimised in order
to get an optimal solution. From the first step we get an array that contains
allocation values in different cells. An example can be found in Table A.3. The
cells having an allocation that is greater than zero will be called “basic cells”,
the other cells will be called “non-basic cells” (other terms used are “non-empty”
and “empty” cells or “occupied” and “non-occupied”cells).
The algorithm now optimises this first solution by repeating the following
three steps:
• Step 1: For each non-basic cell, compute the reduced costs. Select the
non-basic cell with the most negative reduced cost, in other words, the
cell with the lowest reduced cost smaller then zero. If no such cell exists,
then the optimal solution is found.
• Step 2: Generate a stepping stone path from the cell we found in step 1
to the same cell by walking over basic cells.

F1
F2
Remaining Dem.

C1
25 24
0 30
0

C2
5 30
40 40
0

C3
0 40
10 42
0

Dummy
20 0
00
0

Remaining Supp.
0
0

Table A.3: The solution found using minimum cost method. Values in italic
denote the costs for this cell, values in bold the allocation made for this cell.
57

• Step 3: Determine the minimum possible allocation for this path. Update
the allocation matrix by adding or subtracting the minimum possible allocation to the cells on the path. The basic cell with the minimum possible
allocation is changed to a non-basic cell as its allocation becomes zero. If
more than one basic cell on the path is degraded to zero, choose arbitrarily
one that will be changed to a non-basic cell. The non-basic cell we started
with becomes a basic cell. Continue with step 1.
Step 1: Computing reduced costs
This step uses as input the solution obtained from the minimum cost method or
from the preceding iteration and calculates the reduced costs matrix. This matrix gives an overview how the total costs change if the allocations are changed
(e.g. units from one producer are assigned to a different consumer). The reduced cost matrix, that is obtained in this step, indicates for each non-basic cell
how much costs can be saved by assigning one unit of goods to this non-basic
cell. This method works as follows:
Let ui be a variable associated with row i and vj a variable associated with
column j of the transportation problem matrix (we only consider the “cost”
cells, not the last column nor the last row that contain only the supplier and
consumer information). The number of variables is n+m, where n is the number
of blocks of layout L1 and m is the number of blocks of layout L2 .
The relationship for basic cells is given by: ci,j = ui + vj , where ci,j is the
cost of the cell in row i and column j. The relationship for non-basic cells is
given by: rci,j = ci,j − ui − vj , where rci,j is the reduced cost of the non-basic
cell in row i and column j.
We then obtain a set of linear equations with n + m − 1 equations and n + m
of variables. In order to be able to solve this system, the value of ui is set to
zero: ui = 0. Now the linear set of equations can be solved and a value for each
ui and vj is gained and the reduced costs matrix is computed. An example for
the reduced costs matrix can be found in Table A.4.
Step 2: Stepping Stone Path Finding Method
After having computed the reduced cost matrix the non-basic cell with the most
negative reduced cost is chosen. This value indicates that the total costs of the
solution will be reduced by this value for every unit allocated to this cell. So
we are interested to choose the cell with the most negative reduced cost. And if

F1
F2
vi

C1
24
-4 30
24

C2
30
40
30

C3
+8 40
42
32

Dummy
0
-10 0
0

ui
0
10

Table A.4: The reduced cost matrix of the initial solution. Values in italic
denote the costs for this cell, values in bold the reduced costs for the non-basic
cell. Assigning one unit of goods to cell (F2, Dummy) would decrease the total
cost by 10.

58

there is no negative reduced cost cell, the solution we have cannot be improved
and we have an optimal solution.
In order to allocate one unit to this cell, units from other basic cell have
to be removed, as it is not allowed to change the total demand or supply of
one row or column, as these are given as part of the problem. Removing one
unit in this cell implies adding one unit in another cell in order to meet the
constraints of total demand and and total supply for each row and column. So
a stepping stone path is needed that starts and ends in the cell with the most
negative number and that only passes basic cells (except for the starting cell).
An example can be found in Figure A.1. An example of a stepping stone path
for the example above can be found in Table A.5.
The algorithm works as follows:
• Step 1: Mark the starting cell with a “+” sign
• Step 2: Check for a basic unvisited cell in the same row or in the same
column. If there is one, go to step 3. Else go to step 4.
• Step 3: If the cell is the starting cell and the path has a length greater
than 3, stop. Else mark the cell as visited, add it to the path and mark it
with the correct sign: “+”, if the sign of the last cell was a “-” and “-” if
the sign of the last cell on the path was as “+”.
• Step 4: Go one step back on the path. Go to step 2.

2+
5−

1−

1+
9−S+

Figure A.1: Example of a stepping stone path. The “S” cell is the starting
point. The blue line gives the stepping stone path. The Numbers in the cell are
the assignments of the basic cells. The maximum possible new assignment is 1.

F1
F2
vi

C1
25
0
24

C2
5 “+”
40 “-”
30

C3
0
10 0
32

Dummy
20 “-”
0 Start “+”
0

ui
0
10

Table A.5: The reduced cost matrix of the initial solution. Values in italic
denote the sign for the cell (if it is part of the path, in the other case it is set
to 0), values in bold are the current allocations of the cell.

59

Step 3: Updating the Assignment Matrix
After having found the path, we look for the minimum of all the allocations in
basic cells made on this path. This minimum is then added or subtracted to
all the cells that are part of the path, depending on the sign the cell obtained
during the stepping stone path finding method. So the starting cell will be
transformed from a non-basic to a basic cell and one cell on the path will be
transformed from a basic to a non-basic cell because its allocation will become
zero. If the allocation of more than one basic cell is reduced to zero only one of
these cells is changed to a non-basic cell. Now we continue with Step 1.
When the optimisation of the solution is finished, that means when the
reduced costs matrix obtained after step 1 has no more negative number in any
of its cells, the total costs of the optimal solution can be computed by summing
up the product of the allocation of every cell with its cost.

A.3.3

An Example for Solving the Transportation Problem

Given the following transportation problem
Example, Part 1: Initial Solution:
Given the transportation tableau in Table A.6 After the first iteration of the
minimum cost method we obtain the following tableau:
Example, Part 2: Finding the optimal solution
After obtaining a first feasible solution we optimise the solution progressively:
The total cost Ct is obtained by summing up the product of the allocation
times the costs for each cell: Ct = 5×24+45×40+20×30+10×42+20×0 = 2490.

F1
F2
Demand

C1
24
30
25

C2
30
40
45

C3
40
42
10

Dummy
0
0
20

Supply
50
50

Table A.6: The transportation tableau
60

F1
F2
Demand

C1
0 24
0 30
25

C2
0 30
0 40
45

C3
0 40
0 42
10

Dummy
20 0
00
0

Supply
30
50

F1
F2
Demand

After second iteration:
C1
C2
C3
Dummy
25 24 0 30 0 40
20 0
0 30
0 40 0 42
00
0
45
10
0

Supply
5
50

F1
F2
Demand

After third iteration:
C1
C2
C3
Dummy
25 24 5 30 0 40
20 0
0 30
0 40 0 42
00
0
40
10
0

Supply
0
50

F1
F2
Demand

After fourth iteration:
C1
C2
C3
Dummy
25 24
5 30
0 40
20 0
0 30
40 40 0 42
00
0
0
10
0

Supply
0
10

F1
F2
Demand

After fifth and last iteration:
C1
C2
C3
Dummy
25 24
5 30
0 40
20 0
0 30
40 40 10 42
00
0
0
0
0

Supply
0
0

Table A.7: The minimum cost method. Values in italic denote the costs for this
cell, values in bold the allocation made for this cell.

61

1. Iteration:
Computing reduced costs:
C1
C2
C3
Dummy ui
F1 25 0
50
0 +8
20 0
0
F2
0 -4 40 0 10 0
0 -10
10
vi
24
30
32
0
Select cell (F2, Dummy) as starting cell for the stepping stone path.
Stepping Stone path:
c(2,4) “+” → c(1,4) “-” → c(1,2) “+” → c(2,2) “-”
Minimum possible allocation on this path: 20 units.
Then we update the allocations and compute the new reduced costs:
2. Iteration:
C1
C2
C3
Dummy ui
F1 25 0 25 0 0 +8
0 +10
0
F2
0 -4 20 0 10 0
20 0
10
vi
24
30
36
-6
Select cell (F2, C1) as starting cell for the stepping stone path.
Stepping Stone path:
c(2,1) “+” → c(1,1) “-” → c(1,2) “+” → c(2,2) “-”
Minimum possible allocation on this path: 20 units.
Then we update the allocations and compute the new reduced costs:
3. Iteration:
C1
C2
C3
Dummy ui
F1
50
45 0 0 +4
0 +6
0
F2 20 0 0 +4 10 0
20 0
10
vi
24
30
36
-6
STOP, as no non-basic cell has a reduced cost less than zero.
Table A.8: The transportation simplex method. Values in italic denote the
reduced costs for this cell, values in bold the allocation made for this cell.

62

List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18

Example for geometric layout information . . .
Segmentation errors . . . . . . . . . . . . . . .
Overview of the distance measure method . . .
Example for block distances . . . . . . . . . . .
Example for block matching . . . . . . . . . . .
Examples for two matching methods . . . . . .
Examples for EMD matching . . . . . . . . . .
MARG document layout types . . . . . . . . .
Example for connected components . . . . . . .
Example for the run-length smearing algorithm
Example of a Voronoi diagram . . . . . . . . .
Example for layout analysis algorithms . . . . .
Example of Liang et al. method . . . . . . . . .
Example of mis-match . . . . . . . . . . . . . .
Example of a distance measure failure . . . . .
Example of an error in the MARG database . .
Examples of publisher or type error . . . . . . .
Example for comparison to ground truth . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

7
9
12
15
17
21
23
24
26
27
28
29
32
37
38
38
39
42

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

Example of segmentation using XY-cut . . . . . . . .
Screenshot of manual segmentation process . . . . . .
Screenshot of a block information on original image . .
Screenshot of a converted layout . . . . . . . . . . . .
Screenshot of a converted layout with HTML borders
Screenshot of the menubar . . . . . . . . . . . . . . . .
Example for block data to HTML table conversion . .
Screenshot of a converted PDF file . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

46
46
47
48
49
50
50
52

A.1 Example of a stepping stone path . . . . . . . . . . . . . . . . . .

59

63

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

List of Tables
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14

Coordinates of blocks . . . . . . . . . . . . . . . . . .
Error-weights for Liang’s et al. method . . . . . . . .
Symbolic cost matrix . . . . . . . . . . . . . . . . . . .
Example for an assignment problem . . . . . . . . . .
Example for the minimum weight edge cover problem
Transportation tableau . . . . . . . . . . . . . . . . . .
Optimal solution . . . . . . . . . . . . . . . . . . . . .
Comparison of different matching methods . . . . . . .
Comparison of the runtimes for the matching methods
Comparison of different block distances . . . . . . . .
Error rates for FIRE . . . . . . . . . . . . . . . . . . .
Error type distribution . . . . . . . . . . . . . . . . . .
Results for k-nearest neighbour . . . . . . . . . . . . .
Error rates for Layout analysis by example . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

6
11
18
18
19
19
20
31
33
34
35
37
40
42

A.1
A.2
A.3
A.4
A.5
A.6
A.7
A.8

Example for the Hungarian algorithm . . . . . . . . .
Example for the minimum weight edge cover problem
Result for minimum cost method . . . . . . . . . . . .
Example of reduced cost matrix . . . . . . . . . . . . .
Example for the reduced cost matrix . . . . . . . . . .
Example for the transportation problem . . . . . . . .
Example for the minimum cost method . . . . . . . .
Example for the transportation simplex method . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

55
56
57
58
59
60
61
62

64

Bibliography
[1] N. F. Angel.
Transportation and Assignment Solution Methods.
http://www.ferrum.edu/fangel/qm/module b.pdf, January 2006.
[2] H. S. Baird. Background Structure in Document Images. Bunke, H. and
Wang, P. S. P. and Baird, H. S. (Eds.), Document Image Analysis, World
Scientific, Singapore, pages 17–34, 1994.
[3] M. S. Bazaraa, J. J. Jarvis, and H. D. Sherali. Linear Programming and
Network Flows. Wiley, 1990.
[4] T. M. Breuel. Two Geometric Algorithms for Layout Analysis. In DAS
’02: Proceedings of the 5th International Workshop on Document Analysis
Systems V, pages 188–199, London, UK, 2002. Springer-Verlag.
[5] T. M. Breuel. High Performance Document Layout Analysis. In Symposium
on Document Image Understanding Technology, Greenbelt, Maryland, 2003.
[6] T.M. Breuel. Layout Analysis by Exploring the Space of Segmentation
Parameters. In Proceedings of the International Association for Pattern
Recognition Workshop (Document Analysis Systems), 2000.
[7] R. Cattoni, T. Coianiz, S. Messelodi, and C. M. Modena. Geometric Layout Analysis Techniques for Document Image Understanding: a Review.
Technical report, IRST, Trento, Italy, 1998.
[8] T. Deselaers, D. Keysers, and H. Ney. Features for Image Retrieval –
A Quantitative Comparison. In DAGM 2004, Pattern Recognition, 26th
DAGM Symposium, LNCS, Tbingen, Germany, September 2004.
[9] D. Doermann. The Indexing and Retrieval of Document Images: A Survey. Technical Report LAMP-TR-013,CFAR-TR-878,CS-TR-3876, University of Maryland, College Park, February 1998.
[10] V. Eglin and S. Bres. Document Page Similarity Based on Layout Visual
Saliency: Application to Query by Example and Document Classification.
In Seventh International Conference on Document Analysis and Recognition (ICDAR’03), volume 2, pages 1208–1212, August 2003.
[11] G. Ford and G. R. Thoma. Ground Truth Data for Document Image
Analysis. In Proceedings of the 2003 Symposium on Document Image Understanding and Technology, pages 199–205, April 2003.

65

[12] J. Hu, R. Kashi, and G. T. Wilfong. Document Classification Using Layout
Analysis. In DEXA Workshop, pages 556–560, September 1999.
[13] J. Hu, R. Kashi, and G. T. Wilfong. Document Image Layout Comparison
and Classification. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’99), pages 285–288, September
1999.
[14] J. Keijsper and R. Pendavingh. An Efficient Algorithm for MinimumWeight Bibranching. Journal of Combinatorial Theory. Series B. Graph
Theory and Matroid Theory, 2:130–145, 1998.
[15] D. Keysers, T. Deselaers, and H. Ney. Pixel-to-Pixel Matching for Image
Recognition using Hungarian Graph Matching. In DAGM 2004, Pattern
Recognition, 26th DAGM Symposium, volume 3175 of Lecture Notes in
Computer Science, pages 154–162, Tübingen, Germany, August 2004.
[16] K. Kise, A. Sato, and M. Iwata. Segmentation of Page Images Using the
Area Voronoi Diagram. Comput. Vis. Image Underst., 70(3):370–382, 1998.
[17] D. E. Knuth. The Stanford GraphBase: A Platform for Combinatorial
Computing. Addison-Wesley, Reading, MA, 1994.
[18] J.L. Liang, I.T. Phillips, and R.M. Haralick. Performance Evaluation of
Document Structure Extraction Algorithms. Computer Vision and Image
Understanding, 84(1):144–159, October 2001.
[19] S. Mao and T. Kanungo. Empirical Performance Evaluation Methodology
and Its Application to Page Segmentation Algorithms. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 23(3):242–256, 2001.
[20] M. Mathirajan and B. Meenakshi. Experimental Analysis of some Variants
of Vogel’s Approximation Method. Asia-Pacific Journal of Operational
Research, 21(4), 2004.
[21] G. Nagy, S. Seth, and M. Viswanathan. A Prototype Document Image
Analysis System for Technical Journals. Computer, 7(25):10–22, 1992.
[22] H. Q. Ngo.
Lecture 1:
Matchings on Bipartite
http://www.cse.buffalo.edu/˜hungngo/teaching.html, 2004.

Graphs.

[23] L. O’Gorman. The Document Spectrum for Page Layout Analysis. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 15(11):1162–
1173, 1993.
[24] B. Pilgrim. CSC 445 Computer Algorithms.
bob.pilgrim/445/munkres.html, 2005.

http://216.249.163.93/

[25] Y. Rubner, L. Guibas, and C. Tomassi. The Earth Mover’s Distance, MultiDimensional Scaling, and Color-Based Image Retrieval. Proceedings of the
DARPA Image Understanding Workshop, pages 661–668, May 1997.
[26] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover’s Distance as
a Metric for Image Retrieval. International Journal of Computer Vision,
40(2):99–121, 2000.
66

[27] F. Shafait, D. Keysers, and T. M. Breuel. Performance Comparison of Six
Algorithms for Page Segmentation. In 7th IAPR Workshop on Document
Analysis Systems (DAS), volume 3872 of LNCS, pages 368–379, Nelson,
New Zealand, Feb 2006. Springer.
[28] C. Shin, D. Doermann, and A. Rosenfeld. Classification of Document Pages
Using Structure-Based Features. International Journal on Document Analysis and Recognition, 3(4):232–247, 2001.
[29] H. Tamura, S. Mori, and T. Yamawaki. Textual Features Corresponding
to Visual Perception. In IEEE Transactions on Systems, Man and Cybernetics, pages 406–472, June 1978.
[30] K. Y. Wong, R. G. Casey, and F. M. Wahl. Document Analysis System.
IBM Journal of Research and Development, 26(6):647–656, November 1982.
[31] B. A. Yanikoglu and L. Vincent. Pink Panther: A Complete Environment
for Ground-Truthing and Benchmarking Document Page Segmentation.
Pattern Recognition, 31(9):1191–1204, September 1998.

67