Document Layout Analysis as Solving Multiple Constraints Problem
Fu Chang
Document Analysis and Recognition Laboratory
Institute of Information Science, Academia Sinica, Taipei, Taiwan, R.O.C.
fchang@iis.sinica.edu.tw
hierarchical order (Figure 1). Individual characters are
organized into textlines, which in turn are organized into
paragraphs etc. In the past, many methods look for entering points in the hierarchy. For example, the recursive XY cuts method [1] and maximal white-rectangles method
[2] seek to enter at the top of the hierarchy, each using
different clues to decompose large objects into smaller
ones. The run-length smearing method [3] seeks to enter
at the bottom of the hierarchy, by gluing foreground pixels into connected pieces. The textline construction
method [4] may be said to enter the hierarchy in the middle, by constructing textlines out of character components.
All the above methods work reasonably well in certain
contexts. They would nevertheless face problems when
their contextual assumptions are violated or their parameter values are incorrectly estimated. For example,
the X-Y recursive method does not work at the environment where no simple cutting lines exist. The whiterectangle method and smearing method rely upon certain
threshold values (the size of white rectangles for the former, and the distance of white gaps for the latter). The
textline method cannot be applied without limit when
both horizontal and vertical reading orders are allowed,
as in Chinese documents.

Abstract
The aim of document layout analysis is to uncover the
hierarchical layout structure by means of a series of split
and merge operations. The validity of such operations
depends on certain parameters that need to be properly
estimated. Moreover, the values of those parameters cannot be calculated independently, but are parts of the
problems to be solved. The layout analysis is thus considered as multiple constraints problem whose unknown
variables are the elements of the hierarchical structure,
and the constraint conditions specify the relations between those elements. To solve this problem, we start
with a solution that satisfies some but not all the constraints. We then gradually improve the solution by estimating the values of parameters, also appearing in the
constraints, based on the temporary solutions obtained in
previous stages. This solution method proves to be extremely useful for analyzing stylized but complicated
documents, such as Chinese documents that are known to
allow both horizontal and vertical reading orders as well
complicated compositional structure even on the character level.

1. Introduction

2. Problem Setting

The textual contents of documents are organized in

In our view, all methods for layout analysis employ

Figure 1. The layout structure of a Chinese article is displayed. Headlines (striped boxes) are horizontal. The rest of
textlines (gray boxes) are vertical. Dashed lines form the borders of the article.

1

I

A

II

p
q

B

C

Figure 2. Two paragraphs, I and II, are enclosed in dashed lines. Paragraph I contains horizontal textline A. Paragraph II contains B, C and other vertical textlines. Each labeled textline is identified by the same fill pattern. The spacing
within B ( = distance between p and q) is less than that between B and C. The spacing within paragraph I ( = distance
between B and C) is less than that between I and II.
horizontal (vertical) textlines are properly aligned along
their top (right) and bottom (left) edges; textlines contained in horizontal (vertical) paragraphs are properly
aligned along their top (right) and bottom (left) edges.

some split and merge operations. Run-length smearing,
for example, is a merge operation while X-Y cut is a split
operation. To get a reasonable outcome, these operations
rely upon certain parameter values to function. Those
values are part of the problem that needs to be solved.
Document layout analysis is a multiple constraint
problem, where the objects (paragraphs, textlines, and
characters, etc.) to be constructed or identified are the
unknown variables built into the constraints. There are
actually three sets of constraints that are met by most
Chinese documents.

C. Spacing Constraints: the spacing within a textline
is less than the spacing between textlines; moreover, the
spacing within a paragraph is less than the spacing between paragraphs (Figure 2).
Each set of constraints involves more than one unknown variable, and each unknown variable occurs in
more than one set of constraints. The third set of constraints, moreover, incorporates some parameters (inner
spacing and outer spacing) whose values depend on the
attributes of other variables (textlines and paragraphs).
The methods [1-4] attempt to solve this problem in a
straightforward manner. They seek a solution for some
unknown variable, which in turn can be used to solve for

A. Composition Constraints: paragraphs are composed of contiguous textlines of similar sizes, which in
turn are composed of contiguous characters of similar
sizes; characters are composed of contiguous components,
i.e., clusters of connected foreground pixels.
B. Alignment Constraints: characters contained in

(a)

(b)

Figure 3. (a) Two textlines (enclosed by dashed lines) grow from the same (gray) component box. The horizontal line
has wider inner spacing than the vertical line. (b) The vertical textline (enclosed by dashed lines) extends beyond a white
rectangle (dotted box) with relatively wide spacing.

2

(a)

(b)

Figure 4. (a) Broken pieces falling within the same paragraph can be connected to make complete pieces. (b) Textlines belonging to different columns can be separated by means of recursive X-Y cuts.
find reliable elements to start with for the textline construction.
The textlines constructed thus far have two drawbacks.
First, starting from the same initial point, it is possible to
construct two textlines, one going vertically and one
horizontally (Figure 3a), but only one of them can be
legitimate. Secondly, some textlines may be overextended
to the area of other textlines. This is most likely to occur
when the two textlines have opposite reading order (Figure 3b).

another unknown variable, and so on. Unfortunately, for
most hard constraint problems, straightforward solutions
rarely exist. As an alternative, we propose the following
method. We start with an approximate solution, that is, a
solution satisfying some but not all the constraints. Based
on this temporary solution, we then estimate the unknown parameter values appearing in other constraints.
We can then derive a better solution (that is, a solution
that satisfies more constraints) by taking reference of the
estimated values. The solution thus obtained can still be
improved if more or better knowledge is generated.

Textline Consolidation Stage

3. Solution Method

By referring to the spacing constraint for textlines
and for paragraphs, we are able to eliminate those textlines that are illegitimate and also prune those that are
overextended (Figure 3b).

In actually solving the problem, an evolutionary path
has to be specified whose aim is to gradually improve the
solutions for unknown variables. At each stage, moreover,
some constraint conditions are used for the construction
of new solution or the refinement of current solutions.
Brief descriptions of these stages are given in the following. More details are given in [5:Section 3].

Paragraph Construction Stage
This is rather a simple operation that follows what is
suggested by the composition constraint for paragraphs.

Textline Construction Stage

Adjustment Stage

Textlines are constructed according to the requirements embodied in the composition and alignment constraint for textlines. The composing elements of textlines
are characters that are unfortunately unknown at this
stage (this is the case for Chinese documents, since a
Chinese character can be a composition of more than one
component). But using some statistical means, we can
A
B

By taking reference of the solutions obtained for the
higher levels of the layout structure, it is possible to adjust the solutions for the lower level objects. For example,
a merge operation can be applied to join those properly
aligned textlines that fall within the same paragraph
(Figure 4a). A further split operation may also be re-

C D

E

Figure 5. A horizontal textline (enclosed in dashed lines) incorporates irregular components, where A and B should
merge into one box, C and D into another box, and E should be split into two smaller boxes. The unusual size of box E is
caused by accidental touch of two nearby characters.

3

We would like to emphasize that the methodology
proposed here is for analyzing very stylized documents in
terms of layout complexity. Moreover, the constraint
conditions considered can be met by many Chinese documents, but certainly not by all of them. Some special
documents (business cards, for example) may require that
one or several of the conditions be removed or weakened,
and that a few others be added. Moreover, the validity of
those constraints has not been largely tested by nonChinese documents.
However, one nice thing about this approach is that it
can easily adjust the solution path according to changes
in the constraint specification. The reason for this easy
adjustment is that constraint conditions can be naturally
classified into categories. (We categorized them into
three sets). Thus, when one category of constraints remains unchanged, all the corresponding operations remain unchanged. The idea of modularity thus naturally
grows in this solution method. However, a problem that
remains to be solved is to classify the types of documents
satisfying different sets of constraints.

quired for separating two columns within the same article
(Figure 4b). These operations are made to comply with
the requirement imposed by the alignment constraint for
paragraphs. Furthermore, spilt and merge operations can
be applied to the components found within the same textline (Figure 5), conforming to the requirement that each
textline is composed of characters of similar size.

4. Testing the Methodology
For testing the document layout methodology, we have
prepared a set of samples collected from magazine and
newspapers. There are approximately 1400 of them.
Many samples consist of a single article, while others
have more than one article. We classify these samples
according to their layout structures into seven categories:
˙horizontal headlines with L-shaped contents
˙horizontal headlines with rectangle-shaped contents
˙vertical headlines with L-shaped contents
˙vertical head-lines and rectangle-shaped contents
˙mixture of texts and pictures
˙mixture of texts and tables
˙mixture of Chinese and English characters within
the same textlines

References
1. G. Nagy, S. C. Seth, and S. D. Stoddard, Document
analysis with an expert system, Proceedings Pattern
Recognition in Practice II, Amsterdam, 1985.
2. H. S. Baird, S. E. Jones, and S. J. Fortune, Image
segmentation by shape-directed covers, Proceedings
10th ICPR, Atlantic City, pp. 820-825, 1990.
3. F. M. Wahl, K. Y. Wong, and R. G. Casey, Block segmentation and text extraction in mixed text/image
documents, Comput. Vision Graphics Image Process,
vol. 20, pp. 375-390, 1982.
4. T. Pavlidis and J. Zhou, Page segmentation and classification, CVGIP: graphical models and image processing, vol. 54, pp. 484-496, 1992.
5. F. Chang, Retrieving Information from Document
Images: Problems and Solutions, Intern. J. Document
Analysis and Recognition, Special Issues on Document Analysis for Office Systems, to appear.
6. H. Baird, Anatomy of a versatile page reader, Proceedings IEEE, vol. 80, pp. 1059-1065, 1992.
7. A. Dengel, Initial learning of document structure,
Proceedings Intern. Confern. Document Analysis and
Recognition, Tsukuba, 1993.
8. J. Fisher, S. Hinds, and K. D’Amato, A rule-based
system for document image segmentation, Proceedings 10th Intern. Conf. Pattern Recognition, pp. 567572, Atlantic City, 1990.
9. H. Fujisawa and Y. Nakano, A top-down approach for
the analysis of documents, Proceedings 10th Intern.

Each time when we make ready a version of layoutanalysis program for testing, we randomly select a few
samples from each of the seven categories, plus some
“tough” samples that are always selected for the testing
purpose. In the last testing we made, 50 samples were
chosen and two problems were detected out of the test
results. Both of the problems have to do with the assumptions being taken by the layout-analysis program being
tested.
The first problem occurs at two parallel textlines, each
consisting of two characters only. The spacing between
the two textlines is actually smaller than the spacing
within each textline. This fact obviously violates our
spacing consumption for textlines. Since this kind of
problem occurs rarely in regular documents, we decide to
do nothing about it. The second problem has to do with
the listed items that contain only two characters. Since
our program assumes that a textline has to contain at
least three characters, those items were not segmented as
textlines. To solve this problem, care must be taken to
deal with listed objects. This is a subject we would rather
not go any further in this article.

5. Discussions and Unsolved Problem

4

Conf. Pattern Recognition, pp. 113-122, Atlantic City,
1990.
10. A. K. Jain, B. Yu, Document representation and its
application to page decomposition, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 20, pp.
294-308, 1998.
11. L. O’Gorman and R. Kasturi eds., Document Image
Analysis, Los Alamitos, IEEE CS Press, 1995.

5