Maletic & Collard 8/5/2005 Adding Structure to Unstructured Text Jonathan I. Maletic Michael L. Collard Department of Computer Science Kent State University Kent, Ohio 44242 jmaletic@cs.kent.edu Department of Computer Science Kent State University Kent, Ohio 44242 collard@cs.kent.edu Abstract An overview of the authors’ research program in document engineering is presented. Underlying techniques are being developed for agile parsing of unstructured and semi-structured text to extract metadata. XML technologies are leveraged in novel ways to support complex querying, analysis, and transformation of large text bases. New methods for difference analysis are being developed to support document evolution and maintenance. Additionally, advanced information retrieval methods, namely latent semantic indexing, in conjunction with clustering techniques are used to extract high level features and concepts from large corpora. 1 Introduction Our research program is centered on the maintenance and reverse engineering of large legacy software systems. To address this difficult problem we must deal with large numbers of text documents. These documents include source code files in various programming languages and dialects, internal documentation normally written in English, external system documentation of various types (e.g., text, diagrams, tables), and possibility user manuals, bug reports, and version histories. These documents are all unstructured or semistructured in nature. We have developed a number of fundamental techniques for querying, analyzing, and transforming documents with the explicit goal of recovering and identifying metadata from unstructured and semi-structured text. The domain of software documents poses many significant problems due to the sheer amount of text along with its heterogeneous nature. That is, commercial software systems are measured in millions (or 10’s of millions) of lines of code. To this end we are investigating efficient and flexible methods that can be applied across different languages and in heterogeneous formats. Figure 1 presents an overview of the problem and what aspects we are addressing. The main research problems we have investigated are related to the representation of unstructured and semi-structured text in XML, parsing and translation methods to go from raw text to XML, and tools for analysis and transformation. Our general approach is to use and leverage XML technologies to support storage and extraction of metadata. However, translating unstructured text into XML requires specialized custom built parsers. We utilize new parsing methods based on flexible grammar specifications that allow us to skip over uninteresting or ill formed text. These parsing methods are very robust and extremely efficient which allow them to be applied to very large text bases in a practical manner. With regards to our research on XML representations of text we have chosen a wholly document view. This is opposed to the more prevalent application of XML for data exchange. We are continuously developing tools and techniques for querying, analyzing, and transforming both raw text and its more abstract XML representation. This is particularly important Page 1 of 5 Maletic & Collard 8/5/2005 in automated feature detection or concept location in raw text. At the raw text level we have successfully used information retrieval methods, namely latent semantic indexing, to cluster document parts and automatically identify highlevel concepts in large document bases. Figure 1. Overview of problem Finally, we have developed a visualization tool to help support the analysis of large amounts of source code documents. This tool is quite generic with regards to input data and can be applied to any type of text document and other types of data. This paper is organized to detail the work we’ve done on a number of major research problems. Section 2 discusses our work with parsing methods; section 3 is on representation issues. Section 4 deals with the analysis tools we’ve develop to work on raw text and section 5 deals with analysis and translation tools for XML representations. A short description of our information visualization work is in section 6. We feel that our techniques can be directly applied to other problem domains that have unstructured text documents and in section 7 we briefly describe some proposed research directions along these lines. should be. On a purely textual level a straightforward lexical approach such as regular expressions can be used. However, robust and efficient regular expressions can be difficult to write especially when determining the matching context. To address this one can make use of current compiler technology however existing compiler-centric parsers have difficulties in the preservation of the original (source-code) text especially with regard to white space, comments, and preprocessor directives. Parsers take a high level AST (Abstract Syntax Tree) view of a program and do not consider these items to be of importance. In addition they are not robust and cannot handle code that is either incomplete or has compilation problems. In order to analyze all aspects of a source code document we have developed a robust and efficient translator that parses unprocessed C/C++/Java source code and generates srcML, our XML representation of source code (Collard et al. 2003; Maletic et al. 2004) . Our translator preserves all of the source-code text and is able to work with code fragments, e.g., an individual statement. It is very robust and can handle the typical unstable state of source code during development. Unlike typical parsers, a top-down approach is used that allows for it to be used as part of a stream or pipeline. For srcML this means that the translator can be used a source for SAX processing. The robustness of our translator is due to a selective parsing approach based on the concept of Island Grammars (Moonen 2001). In this approach “islands”, i.e., patterns of tokens of specific interest can be discovered even when surrounded by “water”, i.e., tokens not of interest. The concept was implemented using the LL(k) compiler generator ANTLR. The top-down parsing of ANTLR was extended to provide stream parsing where XML tokens were inserted into the stream of text tokens as soon as a markup element is identified. This provides for low latency when used as a source for SAX processing. 3 XML Representations 2 Agile Parsing Methods Analysis begins with the extraction of lexical, structural, syntactical, and documentary information from documents (i.e., source code files). Unfortunately, this is more difficult than it From its original application to text, XML is now used to represent a data of all types. The wide variety of applications of XML has led to two major categories of application: document-oriented and data-oriented. Document-oriented XML uses Page 2 of 5 Maletic & Collard 8/5/2005 elements to organize and provide context to text. It has been mainly applied to text documents, e.g., XHTML, DocBook, etc. Mixed content is typically used – where elements contain both text and sub elements. The original document text is preserved (possibly including white space) in its original document order. A full range of XML technologies may than be used, but in particular the schema languages DTD and RelaxNG are preferred over XML Schema. XML Schema is not used due to its inability to handle mixed content and less of a need for data-typing of text elements. Data-oriented XML stores the textual information in attributes, or in pure elements (i.e., no mixed content). All types of XML technologies may be used, but the deficiencies of DTD and the need to express strong data-type relationships leads to a preference for use of XML Schema or RelaxNG. We have found a document-oriented approach particularly useful for software engineering applications involving source code. In our srcML representation (SouRce-Code Markup Language) (Collard et al. 2003; Collard and Maletic 2004; Collard et al. 2002) XML is used to augment C, C++, and Java source code with syntactic information from the parse tree to add explicit structure and metadata to program source code. Comments, preprocessing information and formatting are preserved and identified for use by other maintenance tools. The srcML representation is transparent to the original source code text, i.e., there exists a 1-1 mapping between the text of the source-code document and the equivalent srcML document. Source code can be converted from plain text to the srcML representation and back without the loss of any textual information including white space. Locations in the source code can be addressed using XML addressing languages such as XPath and XPointer. In addition to the syntactic metadata that the srcML elements provide, additional metadata can be embedded as attributes in the syntactic elements, or stored externally with links into the srcML document. In practice the format has proven itself to be very lightweight. Application to large source code projects, (e.g., Linux kernel), has shown that the srcML representation is on average only 3.5 times the size of the original plain text. This is unlike data-oriented XML representations of source code whose representations are hundreds and even thousands of times larger than the original text, and which often lose essential document information. The speed of our translator is currently at 11,000 lines of source code per second. The srcML representation has been extended to simultaneously represent multiple versions of a source-code document. This representation, srcDiff, (Collard 2004; Maletic and Collard 2004) is a source-code meta-difference format Differences between versions of the document are represented in an XML format based on srcML. The original and modified source code is integrated into a single srcML-based representation with the differences marked in additional XML elements. Both the original and modified versions of the document can be directly extracted from the srcDiff representation. Locations in the both versions and their relationship to the changes can be addressed using XML addressing languages such as XPath and XPointer. This allows direct support for a syntactic description of the differences. Queries may be performed on the difference information involving location of source-code changes, characteristics of the changes, etc. Modeling of differences at this level allows for the validation and transformation of differences. Currently, our srcDiff tool translates at approximately five times the speed of diff. For a project with 60,000 lines of source code it requires less than 14 minutes. This is in sharp contrast to all semantic and most syntactic differencing tools which take this amount of time on a small file. 4 IR and Unstructured Text We have successfully applied an advanced information retrieval method, namely latent semantic indexing, to a number of problems (Maletic and Marcus 2001; Maletic and Valluri 1999; Marcus and Maletic 2001; 2003; Marcus et al. 2004) dealing with unstructured, heterogeneous, text documents. This novel work deals with identifying abstract concepts and features from large sets of documents. Latent Semantic Indexing (LSI) (Landauer et al. 1998) is a machine-learning model that induces representations of the meaning of words by analyzing the relation between words and passages in large bodies of text. As a model, LSI’s most impressive achievements have been in human Page 3 of 5 Maletic & Collard 8/5/2005 language acquisition simulations and in modeling of high-level comprehension phenomena like metaphor understanding, causal inferences and judgments of similarity. For complete details on LSI see (Deerwester et al. 1990). LSI was originally developed in the context of information retrieval as a way of overcoming problems with polysemy and synonymy that occurred with vector space model (VSM) (Salton and McGill 1983) approaches. One of the most successful applications of SVD in information retrieval is the Google search engine (www.google.com). Most importantly however, is that LSI does not utilize a predefined grammar or vocabulary. As such it can be easily applied different types of unstructured text documents in different languages or formats. Many other information retrieval (IR) and natural language processing (NLP) techniques require a detailed grammar to extract any meaningful information. These grammars are often expensive to construct or brittle. While some of these approaches are more accurate than LSI they cannot compete with regards to its flexibility and low cost of application. 5 XML Applications The first major application of srcML was to perform queries on source code documents. A combination of the srcML format and common XML tools were used to extract information from source-code documents with results comparable to other heavyweight (full parser-based) parsing approaches according to results obtained using a standard C++ fact extraction benchmark (Collard et al. 2003) . Queries can be performed on source code in a realistic state, i.e., incomplete code, code fragments, etc. The queries are written in XPath and a command-line XPath tool was used to query the source code. The transparency of our approach makes it ideal for document transformation. Transformation of source code text can be performed at the XML level (Collard and Maletic 2004) with the use of XPath to select the parts of the document that need to be transformed. These non-intrusive transformations of source code can be performed at the XML level using XSLT, DOM, SAX, etc. This allows changes to specific elements in the source code e.g., insertion or removal of an individual statement. As mentioned previously, the srcML representation allows for the addition of attributes for embedded metadata. However, storing metadata externally is a more flexible approach. In our work (Collard 2003; Collard 2004) associations between entities and regions of source code are represented as an XML model. This model uses the XML linking language XLink for the integration of source models with the source code. Multiple source models, from highly abstract to low-level source-code associations, can be integrated with source code into a single query or transformation. 6 Visualization of Documents Information visualization techniques are being investigated and applied to visualize the complex abstract relationships and characteristics of large software systems (Maletic et al. 2002; Maletic et al. 2003; Marcus et al. 2003). Our current research prototype, sv3D, was developed specifically as a visualization front-end and is completely independent of any underlying data acquisition or analysis tool. The system accepts simple XML input and be applied to text documents. sv3D allows the user to simultaneously visualize multiple attributes of multiple documents. It supports object level manipulation (rotation, scaling, etc.), filtering, and user selectable mappings of visual features to attributes. 7 Proposed Research Our research program has given us a large amount of experience and expertise in dealing with and extracting metadata from unstructured text. We feel that many of the fundamental techniques we developed will directly apply to the specific business needs and challenges of industry. Our interest is to identify some of these problems and investigate the applicability of our techniques. Our techniques are very efficient and scalable to large corpora. The parsing techniques are novel and provide a means to add structure to unstructured documents in various domains. 8 Bio Sketches Dr. Jonathan I. Maletic is an Associate Professor of Computer Science at Kent State University. He has authored nearly 60 published Page 4 of 5 Maletic & Collard 8/5/2005 papers and is an internationally recognized researcher in software engineering and document engineering. Dr. Maletic is on the Steering Committee of the ACM Symposium on Document Engineering. He completed the Ph.D. in 1995 from Wayne State University. His web site is at www.cs.kent.edu/~jmaletic/. Dr. Michael L. Collard recently graduated (2004) from Kent State University with the Ph.D. He is currently a lecturer within the Department of Computer Science. Dr. Collard’s research interests are in document-oriented approaches to single and multiple versions of source code. He is a member of the Program Committee for the 5th ACM Symposium on Document Engineering (DocEng) 2005. See www.cs.kent.edu/~collard/. References Collard, M. L., (2003), "An Infrastructure to Support Meta-Differencing and Refactoring of Source Code", in Proceedings of 18th IEEE International Conference on Automated Software Engineering (ASE '03), October 6-10, pp. 377-380. Collard, M. L., (2004), Meta-Differencing: An Infrastructure for Source Code Difference Analysis, Kent State University, Kent, Ohio USA, Ph.D. Dissertation Thesis. Collard, M. L., Kagdi, H. H., and Maletic, J. I., (2003), "An XML-Based Lightweight C++ Fact Extractor", in Proceedings of 11th IEEE International Workshop on Program Comprehension (IWPC'03), Portland, OR, May 10-11, pp. 134-143. Collard, M. L. and Maletic, J. I., (2004), "DocumentOriented Source Code Transformation using XML", in Proceedings of 1st International Workshop on Software Evolution Transformation (SET'04), Delft, The Netherlands, Nov. 9, pp. 11-14. Collard, M. L., Maletic, J. I., and Marcus, A., (2002), "Supporting Document and Data Views of Source Code", in Proceedings of ACM Symposium on Document Engineering (DocEng’02), McLean VA, November 8-9, pp. 34-41. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., (1990), "Indexing by Latent Semantic Analysis", Journal of the American Society for Information Science, vol. 41, pp. 391407. Landauer, T. K., Foltz, P. W., and Laham, D., (1998), "An Introduction to Latent Semantic Analysis", Discourse Processes, vol. 25, no. 2&3, pp. 259-284. Maletic, J. I. and Collard, M. L., (2004), "Supporting Source Code Difference Analysis", in Proceedings of IEEE International Conference on Software Maintenance (ICSM'04), Chicago, Illinois, September 11-17, pp. 210-219. Maletic, J. I., Collard, M. L., and Kagdi, H. H., (2004), "Leveraging XML Technologies in Developing Program Analysis Tools", in Proceedings of 4th International Workshop on Adoption-Centric Software Engineering (ACSE'04), Edinburgh, Scotland, May 25, pp. 80-85. Maletic, J. I. and Marcus, A., (2001), "Supporting Program Comprehension Using Semantic and Structural Information", in Proceedings of 23rd International Conference on Software Engineering (ICSE'01), Toronto, CA, May 12-19, pp. 103-112. Maletic, J. I., Marcus, A., and Collard, M. L., (2002), "A Task Oriented View of Software Visualization", in Proceedings of 1st IEEE Workshop of Visualizing Software for Understanding and Analysis (VISSOFT'02), Paris, France, June 26, pp. 32-40. Maletic, J. I., Marcus, A., and Feng, L., (2003), "Source Viewer 3D (sv3D) - A Framework for Software Visualization", in Proceedings of 25th IEEE/ACM International Conference on Software Engineering (ICSE'03), Portland, OR, May 3-10, pp. 812-813. Maletic, J. I. and Valluri, N., (1999), "Automatic Software Clustering via Latent Semantic Analysis", in Proceedings of 14th IEEE International Conference on Automated Software Engineering (ASE'99), Cocoa Beach Florida, Oct., pp. 251-254. Marcus, A., Feng, L., and Maletic, J. I., (2003), "3D Representations for Software Visualization", in Proceedings of 1st ACM Symposium on Software Visualization (SoftVis'03), San Diego, CA, June 1113, pp. 27-36. Marcus, A. and Maletic, J. I., (2001), "Identification of High-Level Concept Clones in Source Code", in Proceedings of Automated Software Engineering (ASE'01), San Diego, CA, Nov. 26-29, pp. 107-114. Marcus, A. and Maletic, J. I., (2003), "Recovering Documentation-to-Source-Code Traceability Links using Latent Semantic Indexing", in Proceedings of 25th IEEE/ACM International Conference on Software Engineering (ICSE'03), Portland, OR, May 3-10, pp. 125-137. Marcus, A., Sergeyev, A., Rajlich, V., and Maletic, J. I., (2004), "Concept Location using Latent Semantic Analysis", in Proceedings of 11th IEEE Working Conference on Reverse Engineering (WCRE'04), Delft, The Netherlands, Nov. 9-12, pp. 214-223. Moonen, L., (2001), "Generating Robust Parsers using Island Grammars", in Proceedings of 8th IEEE Working Conference on Reverse Engineering (WCRE'01), Suttgart, Germany, Oct. 2-5, pp. 13-24. Salton, G. and McGill, M.,(1983),Introduction to Modern Information Retrival, McGraw-Hill. Page 5 of 5