Document Structure and Layout Analysis Anoop M. Namboodiri1 and Anil K. Jain2 1 1 International Institute of Information Technology, Hyderabad, 500 019, India, 2 Michigan State University, East Lansing, MI - 48824, USA, anoop@iiit.ac.in jain@cse.msu.edu Introduction A document image is composed of a variety of physical entities or regions such as text blocks, lines, words, figures, tables, and background. We could also assign functional or logical labels such as sentences, titles, captions, author names, and addresses to some of these regions. The process of document structure and layout analysis tries to decompose a given document image into its component regions and understand their functional roles and relationships. The processing is carried out in multiple steps, such as preprocessing, page decomposition, structure understanding, etc. We will look into each of these steps in detail in the following sections. Document images are often generated from physical documents by digitization using scanners or digital cameras. Many documents, such as newspapers, magazines and brochures, contain very complex layout due to the placement of figures, titles, and captions, complex backgrounds, artistic text formatting, etc. (see Figure 1). A human reader uses a variety of additional cues such as context, conventions and information about language/script, along with a complex reasoning process to decipher the contents of a document. Automatic analysis of an arbitrary document with complex layout is an extremely difficult task and is beyond the capabilities of the state-of-the-art document structure and layout analysis systems. This is interesting since documents are designed to be effective and clear to human interpretation unlike natural images. (a) (b) Fig. 1. Examples of document images with complex layouts. 2 As mentioned before, we distinguish between the physical layout of a document and its logical structure [4]. One could also divide the document analysis process into two parts accordingly. 1.1 Physical Layout and Logical Structure The physical layout of a document refers to the physical location and boundaries of various regions in the document image. The process of Document Layout Analysis aims to decompose a document image into a hierarchy of homogenous regions, such as figures, background, text blocks, text lines, words, characters, etc. The algorithms for layout analysis could be classified primarily into two groups depending on their approach. Bottom-up algorithms start with the smallest components of a document (pixels or connected components) and repeatedly group them to form larger, homogenous, regions. In contrast, top-down algorithms start with the complete document image and divide it repeatedly to form smaller and smaller regions. Each approach has its own advantage and they work well in specific situations. In addition, one could also employ a hybrid approach that uses a combination of top-down and bottom-up strategies. In addition to the physical layout, documents contain additional information about its contents, such as titles, paragraphs, captions, etc. Such labels are logical or functional in nature as opposed to the structural labels of regions assigned by layout analysis. Most documents also contain the notion of reading order, which is a sequencing of the textual contents that makes comprehension of the document, easier. Languages such as Arabic, Chinese, etc. can have different reading directions as well (right-to-left, topto-bottom). The set of logical or functional entities in a document, along with their inter-relationships is referred to as the Logical Structure of the document. The analysis of logical structure of a document is usually performed on the results of the layout analysis stage. However, in many complex documents, layout analysis would require some of the logical information about the regions to perform correct segmentation. Preprocessing Document Image Noise Removal Skew Correction Background Separation Layout and Structure Analysis Performance Evaluation Segmented Document Structure Anslysis Layout Analysis Ground Truth Fig. 2. Schematic diagram of a document layout and structure analysis system. Most document images contain noises and artifacts that are introduced during the document generation or scanning phase. In order to make the analysis algorithms more 3 robust to this noise, the layout and structure analysis is usually preceded by a preprocessing stage. The pre-processing stage consists of tasks such as noise removal, background separation, skew detection and correction, etc. Figure 2 shows a schematic diagram of a document layout and structure analysis system. We will discuss each of the modules in detail in the remainder of this chapter. 2 Preprocessing The preprocessing stage in document understanding primarily involves the following processes: i) removal of noise and other artifacts that are introduced during the scanning phase and during the production of the document, ii) separation of background regions of the document from the foreground, and iii) correction of skew that is introduced in the document image during scanning/acquisition. We will look into each of these problems and present commonly used techniques in this section. Some of these preprocessing algorithms could also be implemented as part of other modules during layout and structure analysis. However, pre-processing algorithms that are more specific to other modules of document understanding, such as character recognition are not studied here. Such methods might include binarization of a gray-level image and scaling of characters to a standard size. 2.1 Noise Removal Noise is a common problem in most of the image understanding problems. These involve white noise that is introduced by interferences in the sensors and amplification circuits of the digitization mechanism (scanner, camera). The common sources of noise include white noise, salt and pepper noise, quantization artifacts, etc. These are well known noise sources that are compensated for by using techniques such as median filtering, dithering, low pass filtering, etc. In this section, we will look at one of the artifacts that is specific to document images, called halftones. Dealing with Halftones: Halftoning is the process by which documents with smooth gray level or color changes are printed using fewer gray levels or colors. This is achieved by placing dots of various sizes, depending on the desired tone, at different parts of the image (see figure 3(a)). The method is also used in color documents when one would like to create continuously varying colors with few-color printing. This artifact becomes apparent especially in high resolution scans, which are very common with modern day sensors. There exist a variety of methods for halftoning depending on the resolution and gray levels available in the printing process [2]. Many algorithms that are designed for processing gray level images breaks down in such situations. One of the most commonly used solutions to overcome this problem is to develop algorithms to detect half toned areas from a model of the halftoning process. The detected regions are either processed separately or converted to continuous gray scale using an appropriate low-pass filtering, which is often followed by application of a sharpening filter to reduce the blur (see figures 3(b) and 3(c)). 4 (a) (b) (c) Fig. 3. Halftoning: (a) an image printed using halftoning, (b) result of applying gaussian blur to (a), and (c) result of sharpening (b). 2.2 Foreground Detection One of the important problems in document structure understanding is that of separating the foreground from the background image. The problem is relatively simple in case of many documents that have a white or plain background. Even in such documents, determining the exact pixels that belong to the foreground is a challenging problem due to sampling and quantization of slant edges and lines. One could take two different approaches to solve this problem. We could assign every pixel to either foreground or background based on a specific criterion, such as a threshold, and assume that the later processing algorithms will accommodate the errors made during foreground detection. A second approach is to assign a probability measure or a fuzzy membership to each pixel for the foreground and background classes. This uncertainty is propagated to the following stages, which will utilize the context to arrive at a final decision. In most applications, the first approach yields useful results and hence a hard classification is used due to its simplicity. The problem of foreground detection is much more challenging in the case of documents with complex backgrounds (see figure 1(a)). A variety of approaches have been used to decide whether a particular pixel belongs to the foreground or background. A common approach is to compute statistical or spectral properties of image patches and assign them as foreground or background using a classifier trained on labeled samples. Statistical features used for this purpose include the variance and moment features computed in a local neighborhood. Spectral features such as the responses from a bank of Gabor filters have been used with good success for this purpose [9]. A third approach is to train a classifier such as a neural network that directly accepts the image values in a neighborhood of a pixel region as input for its classification. Figure 4 shows a sample document with complex background and color reversed text, along with the foreground and background regions. The problem of foreground detection could be posed as that of detecting text, since many documents contain only text in the foreground. One of the problems that many text detection algorithms find difficult to handle is that of reverse color text, i.e., text of light color in a dark background. One of the approaches to overcome this difficulty is to run an edge detection algorithm on the document image and use the properties of the edges to determine whether a pixel is part of text (foreground) or not (background). 5 (a) (b) Fig. 4. (a) a document image with complex background, (b) foreground and background separated from (a). These algorithms use properties of character edges such as the presence of parallel lines and uniformity of color between them for classification. 2.3 Skew Correction Skew is introduced in a document image when a document is scanned or imaged at an angle with respect to the reference axes (see figure 5). The problem of skew correction plays an important role in the effectiveness of many document analysis algorithms, such as text line estimation, region boundary detection, etc. For example, algorithms based on projection profiles assume an axis-aligned scan. The primary challenge in skew correction is the estimation of the exact skew angle of a document image. 5 Degrees of Skew Fig. 5. Example of a document image with a skew of 5 degrees. A variety of techniques are used for the detection of skew. Most of them assume the presence of some text component in the document and estimate the orientation of text lines using different methods. The commonly used techniques are projection profiles and line fitting. In the projection profile based approach, we compute the histogram of text pixels on the vertical axis (vertical projection) and try to maximize the distance between consecutive line profiles, called line spacing, while rotating the document image. 6 We will examine the projection profile based method in the next section. One could also estimate the skew of a document by fitting a line to the centroids of connected components within a line and measuring the angle with respect to the horizontal axis, which can then be corrected. 3 Representing Document Structure and Layout A critical component in facilitating the understanding of document structure and layout is its representation in memory. An appropriate representation can encapsulate the prior knowledge about the documents being processed. In addition, the representation can also help to organize the extracted information and guide the analysis algorithm. A document contains multiple regions that are spatially and logically related to each other. Such a model is captured well in the formalism of a graph. A graph G is defined as a set of vertices, V, and a set of edges, E. Each edge, e ∈ E, is an ordered pair of vertices, (v1 , v2 ), where v1 , v2 ∈ V. The regions in a document image correspond to the vertices or nodes in a graph, while a relationship between two regions is denoted by an edge between the corresponding nodes. For a document we need to specify the relationship between two regions, rather than just stating that there exists one. For this purpose, we will associate an attribute with each edge that specifies the nature of the relationship. The vertices also have attributes that specify the properties of a region, such as the type, size, location, etc. Such graphs, where the vertices and edges have associated attributes are called attributed graphs. The graph representation can encapsulate the relationship between regions and their properties. However, in many documents, the attributes are not conclusive during the analysis stage. One needs to handle the ambiguity and estimate the most likely set of values for the attributes. To facilitate this processing and to enable the graphs to model the uncertainties in the knowledge itself, one could describe the attributes in terms of probability distributions. Such graphs are referred to as attributed random graphs. In attributed random graphs, each attribute is defined by the parameters of a probability distribution (say Gaussian with specified mean and variance values). Special cases of attributed random graphs, such as First order Gaussian Graphs have been successfully used in modeling and classifying document images [7]. The graph representation, however, fails to capture the hierarchical nature of a document structure and layout. To accomplish this, one could use a rooted tree for representing document layout and logical structure [1]. The root of the tree will represent the complete document, while each node of the tree will represent a specific region within the document. The set of children of a node will form subregions that compose the parent region. For example, consider the business card recognition application. A child node of the root could represent the address field, and its children in turn could represent fields such as street, city, zip code, etc. A node representing the name field could be another child of the root, and hence a sibling of the address node. Such a representation incorporates the prior information about the logical structure of a business card. Similar to attributed graphs, a tree-based representation, where each node and edge is associated with a set of properties or attributes, is known as an attributed tree. The trees could also be implicitly represented in terms of a set of rules [11]. Probabilistic versions of 7 tree representations are quite useful to handle the uncertainties in real-life documents in structure and layout. One of the most powerful ways to express hierarchical structures is to use formal grammars. The class of regular and context free grammars are extremely useful in describing the structure of most documents. A document is represented as a sequence of terminal symbols, T , which correspond to smallest units that we would like to represent in a document (pixels, character, regions, etc.). A formal grammar is defined as a set of rules that transform a starting symbol, corresponding to the complete document, into a set of terminal symbols, each representing a specific primitive. The primitives are the smallest units that one needs to address during the document structure and layout analysis. These could be units such as words, characters, pixels or connected components. Note that unlike the generic rule-based representation of trees that we mentioned before, formal grammars restrict the type of rules that can be used, which make their analysis and estimation more tractable. We also define a set of intermediate symbols, called non terminals, which correspond to the different regions in a document layout. The non-terminals could correspond to physical regions such as text regions, columns, paragraphs, lines, etc. or structural units such as headings, captions, footnotes, etc. The grammar transforms the start symbol into a string of non-terminals and terminals. Each non-terminal is in turn replaced with other non-terminals or terminal symbols, according to the rules defined by the grammar. This is repeated until all symbols in the string consists only of terminals. [0.5] [0.5] [1.0] [1.0] [0.8] [0.2] [1.0] [0.9] [0.1] < ST ART > → < T IT LE > < COLU M N > < COLU M N > < ST ART > → < T IT LE > < COLU M N > < T IT LE > → < text line > < COLU M N > → < T EXT BLOCKS > < T EXT BLOCKS > → < T EXT BLOCK > < space > < T EXT BLOCKS > < T EXT BLOCKS > → < T EXT BLOCK > < T EXT BLOCK > → < T EXT LIN ES > < T EXT LIN ES > → < text line > < newline > < T EXT LIN ES > < T EXT LIN ES > → < text line > (a) < ST ART > → < T IT LE > < COLU M N > → < text line > < COLU M N > → < text line > < T EXT BLOCKS > → < text line > < T EXT BLOCK > < space > < T EXT BLOCKS > → < text line > < text line > < newline > < T EXT BLOCK > < space > < T EXT BLOCKS > → < text line > < text line > < newline > < text line > < space > < T EXT BLOCKS > → < text line > < text line > < newline > < text line > < space > < T EXT BLOCK > → < text line > < text line > < newline > < text line > < space > < text line > (b) Fig. 6. Representing a document in terms of formal grammars: (a) example of a stochastic context free grammar that derives a text document with a title. Upper case symbols refer to non-terminal symbols, while lower case symbols show terminal symbols. (b) a sample document with a title and three lines, derived using the grammar in (a) 8 A specific interpretation of the layout or structure of a document will correspond to a specific sequence of grammar rules (productions) that generate the terminal symbols corresponding to the document. Such a sequence of productions is referred to as a derivation. However, there could be multiple derivations corresponding to a particular sequence of terminals. This would mean multiple interpretations of the structure or layout. The process of document analysis is hence to determine the correct derivation for a particular document. See figure 6 for an example of a grammar and the sequence of derivations that describe a specific document. One could associate a probability measure to each of the grammar rules, which represent the probability that the rule would form a part of the interpretation of a document. Such a grammar, where there are probabilities associated with each production is referred to as a stochastic grammar. Stochastic grammars are especially useful to integrate multiple evidences and estimate the most probable parse or interpretation of a given document. One could also attach attributes to each of the nodes in the grammar, such as properties of a region. Such grammars are called attributed grammars and are commonly used for layout and structure analysis purposes. 4 Document Layout Analysis The primary purpose of a document layout analysis module is to identify the various physical regions in a document and their characteristics. A maximal region in a document image is the maximal homogenous area of a document image. The property of homogeneity in the case of layout refers to the type of region, such as text block, image, graphic, text line, word, etc. These regions are not mutually exclusive and one region could contain other region types within it. Before proceeding further, we define a segmentation of an image as follows: Definition 1. A segmentation of an image is a set of mutually exclusive and collectively exhaustive subregions of the image. Given an image, I, a segmentation is defined as S = {R1 , R2 , · · · , Rn }, such that, R1 ∪ R2 ∪ · · · ∪ Rn = I, and Ri ∩ Rj = φ ∀i 6= j. Based on the above observations one could define layout analysis as follows: Definition 2. The process of document layout analysis decomposes a document image into a hierarchy of maximally homogeneous regions, where each region is repeatedly segmented into maximal sub-regions of a specific type. For a document image, D, the layout is a sequence of segmentations, S1 , S2 , ..Sn , where S1 is a segmentation of the complete document image, and Si is a segmentation of one of the regions in Sj ; 1 ≤ j < i. Note that the regions of each segmentation are also defined to be homogeneous in addition to being maximal. In fact, there exists a unique segmentation of a document image, given a specific definition of homogeneity of the regions. However, in practice, one could get multiple interpretations due to an ill defined homogeneity criterion. For example, two text blocks that are close to each other may be considered as a single block if the homogeneity measure used is not able to differentiate between them based on the spacing and alignment between the two blocks. 9 4.1 Approaches to Layout Analysis Document layout analysis algorithms are primarily divided based on their order of processing into top-down approaches and bottom-up approaches. Top-down approaches start with the complete document image and repeatedly split them into smaller homogeneous regions. The splitting is stopped when all the resulting regions correspond to the primitives, based on which the document is to be described. The primitives could be pixels, connected components, words, etc., depending on the application. Conversely, bottom-up approaches start with the primitives and repeatedly group them into larger regions such as words, lines, text-blocks, columns, etc. There are also hybrid algorithms that combine the above two approaches. The primary challenges in any of the above approaches are: – Region Segmentation: To split a document image into component regions, we need to decide where the homogeneity criterion is not satisfied. The homogeneity is often defined based on a parameter vector, θ, whose value plays a critical role in the success of the algorithm. This is also true for bottom-up approaches, when we try to group smaller components into regions. Computing a parameter vector, θ, that works well for all possible documents is often not possible. Current approaches try to tune the parameters based on feedback from later processing stages such as character recognition. The problem is also present at lower levels of segmentation such as line and word segmentation. – Region Classification: The second challenge is to decide the type of a particular region, once it is segmented. Typical problems are in text versus image classification and identification of graphics and tables. Region classification is a prerequisite for further processing of many region types. There are a variety of algorithms that has been proposed to perform layout analysis of documents. We now will look into three of the popular algorithms to illustrate the three approaches mentioned above. Examples of Algorithms Typical top-down approaches proceed by dividing a document image into smaller regions using the horizontal and vertical projection profiles. The X-Y Cut algorithm [13], starts dividing a document image into sections based on valleys in their projection profiles (see figure 7). The algorithm repeatedly partitions the document by alternately projecting the regions of the current segmentation on the horizontal and vertical axes. The splitting is stopped when a particular criterion that decides the atomicity of a region is met. Projection profile based techniques are extremely sensitive to the skew of the document. A small amount of skew could completely change the nature of the projection, resulting in no clear valleys. Hence extreme care has to be taken while scanning of documents, or a reliable skew correction algorithm has to be applied before the segmentation process. Efficient and accurate methods have been devised for documents with white background and a manhattan (rectangular) layout of the regions, based on geometric approaches that cover the background. Examples include the shape-directed cover algorithm by Baird et al. [17] and the white streams based segmentation by Pavlidis and 10 Fig. 7. Horizontal and vertical projection profiles of a document image. Zhou [20]. These are top-down approaches that divide a document image into smaller regions based on the structure of the background. The primary idea is to detect a set of maximal rectangles that do not contain any foreground pixels. These rectangles divide the document image into regions that may contain text, images or graphics. A region classification algorithm could be applied to them before further processing. The algorithms mentioned above are very complex to implement due to the representation of intermediate results and many special cases that need to be handled. The whitespace cover algorithm by Breuel [19] is an efficient variation of the same idea using geometric principles. The algorithm is simpler due to the lack of special cases and performs very well for this class of documents. Bottom-up approaches need to define primitive components to start the grouping process. The Docstrum algorithm by O’Gorman [15], proceeds by identifying a set of connected components in the document as the primitives. Large noise or non-text components are then removed from this set. K nearest neighbors of each component are identified and text lines are formed based on a threshold of the angle between the component centroids. Histograms of component spacings are used to identify intercharacter distance within a word and between words. Transitive closures of neighboring components using the within-line-distance and between-line-distance are computed to form text lines and text blocks. The algorithm has a set of threshold parameters, which are set by experiments on a set of documents. The algorithm performs well for English language documents with a wide variety in layout. Figure 4.1 shows two examples of document images segmented using Docstrum. As we can see, the algorithm handles fonts of various sizes and styles well, even with relatively complex layouts. The Voronoi diagram based algorithm by Kise et al. [14] is another effective method that functions by grouping connected components in a page. We start with computing a Voronoi tessellation of the document image. The neighbor graph of connected components is computer from the Voronoi diagram. The algorithm then uses a threshold based on the area and distance between adjacent components in the Voronoi diagram to decide region boundaries. Text lines are identified based on a threshold on the inter-component 11 (a) (b) Fig. 8. Typical examples of document images segmented using the Docstrum algorithm. Detected lines of text are shown in boxes. distance computed from its distribution in the neighbor graph. The results are very reliable and highly accurate on documents with white background, where the connected components can be identified reliably. Figure 4.1 shows two examples of document images segmented using the Voronoi-based algorithm. As in Docstrum, the Voronoi-based algorithm is able to handle very complex layouts and performs well on most pages in English. (a) (b) Fig. 9. Typical examples of document images segmented using the Voronoi algorithm. Detected lines of text are shown with the boundaries of the voronoi cells. The run-length smearing algorithm of Wahl et al. [21] is another example of bottomup approach. Kopec and Chou [16] describe an algorithm for segmenting a column of text that is modeled using a generative Markov model. However, this model makes the assumption that the connected components are recognizable and belong to a specific vocabulary. 12 Pavlidis and Zhou [23] proposed a hybrid algorithm using a split-and-merge strategy. Here a top-down approach is used to arrive at an over-segmented document image. Regions that are similar and nearby are then merged to form homogeneous regions. 5 Understanding Document Structure Structure understanding deals with logical descriptions of regions rather than their physical attributes. Definition 3. The logical structure of a document image is a mapping from the physical regions in the documents to their logical labels. Document structure analysis is the process of assigning the logical labels to physical regions identified during layout analysis. The logical labels include title, abstract, sub-title, paragraph, sentence, words, header, footer, caption, page number, etc. Even though the logical layout analysis process is defined to follow the layout analysis, in practice, the two processes could be combined into a single document understanding framework. One of the popular approaches to define logical structure of a document is to treat the set of regions in a document as a string of symbols. A grammar is defined over these symbols that describes the logical structure of an arbitrary document under consideration. The process of structure analysis then computes the most likely parse (set of grammar rules) that generates the observed string of symbols. The problem arises when there are multiple parses corresponding to a given document and we have to decide which interpretation to choose. The grammar rules are augmented with attributes of regions and their relationships to accommodate this situation. The most likely parse could be defined using a cost function that takes into account, the attribute values. One could also use a stochastic grammar, that could give a likelihood value for a parse, depending on the rules used and the terminal attributes. The stochastic grammar rules also contain a probability associated with it, which can be used to determine the most likely parse, along with the region attributes. Related methods use other representations of the logical structure, such as tree or graph representations that were discussed in section 3. The approach is similar to what we described for the grammar-based analysis. The domain knowledge is stored as a set of constraints on the structure of the tree or graph, instead of grammar rules. The problem is to identify the variation of the tree, according to the definition, that best matches the observed document [24]. Rule-based systems have also been proposed to determine the logical structure of a document. Note that these are different from the grammar rules that describe, formally, a set of derivations that are permissible. Rules in a rule-based system could denote actions to be taken in a particular situation or encode domain knowledge. The DeLoS system proposed by Niyogi and Srihari [26] uses three levels of rules that encode domain knowledge, control information, and strategies. The approach works well even for complex documents such as newspaper pages. However, the rules themselves are hard coded and rule based systems do not have the ability to learn from a set of sample documents. 13 A second approach is to apply domain knowledge about the layout and structure on the result of OCR and graphics recognition on a given document [25]. This approach assumes that the layout identification stage correctly identifies the text regions, which are then recognized by an OCR. The approach is very powerful, since it can make use of the contents of the text to decide its function. Note that this was not available in the previous methods. For example, the OCR-based approach can use the words in a text line near an image to decide whether it is a caption or not. 6 Performance Evaluation There are many factors that affect the results of evaluation of a document segmentation algorithm. To perform evaluation, one needs to decide on the following points: i) A set of test documents, ii) ground truth of segmentation of the chosen documents, iii) an evaluation or performance metric that is unambiguous and meaningful, and iv) a set of statistics over the performance metric to summarize the results over the complete document set. Algorithms could be compared against each other only when they are tested on a common dataset that has sufficient number and variety of documents. In addition, one should also agree on a common ground truth. This is often difficult as even human experts could assign different ground truths to a specific document. One needs to develop standardized datasets with ground truth to overcome these problems. The University of Washington datasets, the NIST document image dataset [30], etc. are among the most commonly used standards of datasets with ground truth that are used for performance evaluation. The data sets need to be updated and expanded or new ones created according to the future requirements. Details about some of the current datasets could be found at [31]. To evaluate the performance of a segmentation algorithm, we need to compare the result of segmentation against a ground truth. However, this is not a trivial task, since the result of a page segmentation algorithm is not a single mathematical quantity or model that can be compared with the ground truth to compute the error. General image segmentation algorithms output a set of regions for an image, which could be compared against the ground truth for overlap. However, document segmentation poses additional problems due to the hierarchical nature of segmentation. The approaches to define a performance metric could be primarily classified into two categories: – Overlap-based performance metrics: This is similar to image segmentation evaluation measures. The metric is defined based on the overlap area of labelled regions againt regions of the same label in the ground truth. We could compute the overlap at the lowest level of the hierarchy or report results at multiple levels. The performance metric could be any function of the overlap area. For example, the accuracy, ρ(G, S), could be measured as: ρ(G, S) = OI (G, S) − (MI (G, S) + SI (G, S)) , Area(G) (1) where I is the document image, G is the ground truth and S is the result of layout and structure analysis. The function OI measures the overlap area of regions 14 between G and S, where the labels agree, MI () and SI () measure the areas of missing and spurious regions in S, compared to G, and Area(G) is the total area of the regions in the image (or G). – Split and Merge based metrics: Such metrics are defined based on the observation that most errors that are made by a document segmentation algorithms stem from splitting of uniform regions and merging of two different neighbouring regions. We measure the accuracy of a segmentation based on the number of split and merge operations that are required to transform the result of segmentation to the ground truth. We could also include missing and spurious regions into the metric. For example, Mao and Kanungo [27] defines the accuracy of text line detection, ρ(G, S), as: #L − #{CL ∪ SL ∪ ML } , (2) ρ(G, S) = #L where # stand for the cardinality operator, L is the set of all text lines in the document, and CL , SL , and ML represents sets of spurious, split and merged lines respectively. Common statistical measures that are used for summarization of error results include the mean and variance of the accuracy over the complete test set. Cross validation accuracy is reported for algorithms that estimate parameters or learn from a dataset. 7 Handwritten Document Analysis The problem of document understanding can be extended to the domain of handwritten documents as well. The problem is to decipher the physical layout and the logical structure of a handwritten page. The problem is much more complicated due to the lack of physical organization in most handwritten documents. Hence the problem has not been studied in a generic setting. However, handwritten document analysis has been extensively used in specific areas such as postal address recognition and recognition of hand filled forms. In the case of hand filled forms, the format of the form is assumed to be known. This helps the algorithms to remove the form background from the document image. In addition, the recognition is also facilitated in many form filling applications, since the characters are often well separated and printed. One of the most impressive success stories of handwritten document understanding has been in the field of postal automation. The problem is to recognize the destination of an envelope for automatic routing of mails. The problem of postal address recognition will be discussed in detail, elsewhere in this book. A third successful application area of handwritten document analysis is that of recognition of mathematical equations. Most of the successful systems that carry out handwritten document analysis have been domain specific with clear and simple document structures. A generic handwritten document analyzer is still beyond the reach of the state of the art in the field. We now look into a special case of handwritten documents that have received attention, namely online handwritten documents. 15 7.1 On-line Documents Online documents are captured during the writing process using a digitizer such as special pens and pads, touch sensitive screens or vision-based pen tip trackers. They encode the dynamic information of the writing process in addition to the static physical layout. The data in online documents is stored as a sequence of strokes 3 . The temporal information regarding the stroke order helps us to reduce the two-dimensional segmentation problem into that of segmenting a linear sequence of strokes. This is true in most documents, where one does not move back and forth between multiple regions. Jain et al. [12] proposed a clustering based scheme, where the individual strokes are initially classified as belonging to text or non-text regions. The strokes are then clustered based on their spatial and temporal proximities onto various regions. Figure 10 shows an online document and the result of layout analysis from [12]. (a) (b) Fig. 10. Segmentation of online documents:(a) an online document and (b) result of segmentation [12]. Online document understanding and handwriting recognition are popular due to the large number of handheld devices that accept pen input. Such data require segmentation algorithms without assuming strict models about the layout. Grammar-based techniques are very promising for this problem, especially those using stochastic formal grammars for modeling document regions [28, 29]. 3 A stroke is the trace of the pen between a pen-down and the following pen-up 16 8 Summary Document layout and structure understanding often forms the basis on which many document understanding systems are built. One can reliably segment and decipher the structure of many document images with a regular structure and plain white background, such as technical journals, telephone books, text books, etc. However, a generic document layout and structure understanding system that can deal with complex backgrounds and layout is still an elusive goal. One such application that deals with a large variety of layouts is the document understanding/search module in a digital library. Learning-based approaches have the potential to adapt to such large variations in layout and structure. A complete understanding of the structure of a document is possible only with information about the contents of the document. Hence it is important to develop an integrated approach to document layout and structure analysis and various recognition modules. Robust OCR techniques that will work on noisy and degraded documents will go a long way towards structure and layout understanding. References 1. Nagy, G. and Seth, S.C.: Hierarchical Representation of Optically Scanned Documents. Proceedings of the 7th International Conference on Pattern Recognition, Montreal (1984) 347– 349. 2. Ulichney, R.: Digital Halftoning. The MIT Press, Cambridge, Mass., (1987). 3. Fujisawa, H. and Nakano, Y. and Kurino, K.: Segmentation Methods for Character Recognition: From Segmentation to Document Structure Analysis. in Proceedings of the IEEE 80, (1992) 1079–1092. 4. Haralick, R.M.: Document Image Understanding: Geometric and Logical Layout. in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Seattle, WA (1994) 385–390. 5. Jain, A.K. and Yu, B.: Document Representation and its Application to Page Decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, (1998) 294–308. 6. Nagy G.: Twenty Years of Document Image Analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, (2000) 38–62. 7. Bagdanov, A.D. and Worring, M.: First order Gaussian Graphs for Efficient Structure Classification. Pattern Recognition 36, (2003) 1311–1324. 8. Etemad, K. and Doermann, D.S. and Chellappa, R.: Multiscale Document Page Segmentation using Soft Decision Integration. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, (1997) 92–96. 9. Jain, A.K. and Bhattacharjee, S.: Text Segmentation using Gabor Filters for Automatic Document Processing. Machine Vision and Applications 5, (1992) 169–184. 10. Jain, A.K. and Zhong, Y.: Page Segmentation using Texture Analysis: Pattern Recognition 29, (1996) 743–770. 11. Fisher, J.L.: Logical Structure Descriptions of Segmented Document Images. in Proceedings of International Conference on Document Analysis and Recognition, Saint-Malo, France (1991), 302-310. 12. Jain, A.K. and Namboodiri, A.M. and Subrahmonia, J.: Structure in Online Documents. in Proceedings of International Conference on Document Analysis and Recognition, Seattle, WA (2001), 844-848. 17 13. Nagy, G. and Seth, S. and Viswanathan, M.: A Prototype Document Image-Analysis System for Technical Journals. Computer 25, (1992), 10–22. 14. Kise, K. and Sato, A. and Iwata, M.: Segmentation of Page Images using the Area Voronoi Diagram. Computer Vision and Image Understanding 70 (1998), 370–382. 15. O’Gorman, L.: The Document Spectrum for Page Layout Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (1993), 1162–1173. 16. Kopec G.E. and Chou, P.A.: Document Image Decoding using Markov Source Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (1994), 602–617. 17. Baird, H.S. and Jones, S.E. and Fortune, S.J.: Image Segmentation by Shape-Directed Covers. in Proceedings of International Conference on Pattern Recognition, Atlantic City, NJ (1990), 820–825. 18. Baird, H.S.: Background Structure in Document Images. Document Image Analysis, Word Scientific, Singapore (1994), 17–34. 19. Breuel, T.M.: Two Geometric Algorithms for Layout Analysis. in Proceedings of the Fifth International Workshop on Document Analysis Systems, Princeton, NY (2002), LNCS 2423, 188–199. 20. Pavlidis, T. and Zhou, J.: Page Segmentation by White Streams. Proceedings of International Conference on Document Analysis and Recognition, Saint-Malo, France (1991), 945-953. 21. Wahl, F. and Wong, K. and Casey, R.: Block Segmentation and Text Extraction in Mixed Text/Image Documents. Graphical Models and Image Processing 20 (1982), 375–390. 22. Wu, V. and Manmatha, R. and Riseman, E.M.: Finding Text in Images. ACM DL (1997), 3–12. 23. Pavlidis, T. and Zhou, J. Page Segmentation and Classifcation. Graphical Models and Image Processing 54 (1992), 484–496. 24. Yamashita, A. and Amano, T. and Takahashi, I. and Toyokawa, K.: A Model-based Layout Understanding Method for the Document Recognition System. in Proceedings of the International Conference on Document Analysis and Recognition, Saint-Malo, France (1991), 130–138. 25. Kreich, J. and Luhn, A. and Maderlechner, G.: An Experimental Environment for Modelbased Document Analysis. in Proceedings of the International Conference on Document Analysis and Recognition, Saint-Malo, France (1991), 50–58. 26. Niyogi, D. and Srihari, S.N.: Knowledge-based Derivation of Document Logical Structure. in Proceedings of the International Conference on Document Analysis and Recognition, Montreal, Canada (1995), 472–475. 27. Mao, S. and Kanungo, T.: Empirical Performance Evaluation Methodology and its Application to Page Segmentation Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2001), 242–256. 28. Artires, T.: Poorly Structured Handwritten Documents Segmentation using Continuous Pobabilistic Feature Grammars. in Workshop on Document Layout Interpretation and its Applications (DLIA2003). 29. Namboodiri, A.M. and Jain, A.K.: Robust Segmentation of Unconstrained On-line Handwritten Documents. in Proceedings of the Fourth Indian Conference on Computer Vision, Graphics and Image Processing, Calcutta, India (2004), 165–170. 30. NIST : NIST Scientific and Technical Databases, http://www.nist.gov/srd/. 31. LAMP: Documents and Standards Information, http://documents.cfar.umd.edu/resources/database/