File name: Lot2_TSD_ICADC_v010 ICADC Technical Specification Document Contents Contents.......................................................................................................... 2 Tables/Figures ................................................................................................ 3 1 2 Introduction ............................................................................................... 4 1.1 Purpose of this document ................................................................................................................ 4 1.2 Scope of the document .................................................................................................................... 4 1.3 Definitions, Acronyms and Abbreviations ........................................................................................ 4 1.4 References ....................................................................................................................................... 5 ICADC introduction ................................................................................... 6 2.1 Key architectural requirements ........................................................................................................ 6 2.2 FEP architecture............................................................................................................................... 7 2.3 ICADC architecture .......................................................................................................................... 8 2.4 XSL/XSLT transformation engine introduction................................................................................. 8 2.5 ICADC persistence layer and indexes ........................................................................................... 11 2.5.1 2.6 Lucene concepts........................................................................................................................ 11 ICADC templating engine in detail ................................................................................................. 14 2.6.1 Results List templates................................................................................................................ 16 2.6.2 Document Detail templates........................................................................................................ 18 2.7 Migration path................................................................................................................................. 19 2.7.1 Old custom tags meaning and replacement format................................................................... 20 2.7.2 Database migration.................................................................................................................... 23 2.7.3 Migration of the CALLERS configuration file ............................................................................. 26 2.7.4 Migration of HTML page code to XHTML .................................................................................. 27 2.8 Migration and testing ...................................................................................................................... 30 2.9 Future migration to ICA2 ................................................................................................................ 31 Invitation to tender N° AO – 10017-annexes Page 2 / 31 Tables/Figures Table 1 - List of significant terms .........................................................................................5 Figure 1 - FEP architecture..................................................................................................7 Figure 2 – ICADC architecture.............................................................................................8 Figure 3 – Basic XSP processing flow .................................................................................9 Figure 4 –Adding logic sheets on basic XSP processing flow............................................10 Figure 5 – Sample XML template ......................................................................................10 Figure 6 – Example of fields and related indexing method ................................................12 Figure 7 – Offline operations on the Lucene Index ............................................................12 Figure 8 – Example of an entry in the Lucene Index configuration file...............................13 Figure 9 – Example of sitemap configuration file ...............................................................15 Figure 10 – Basic example for Results List templates .......................................................17 Figure 11 – Basic example for Document Detail templates ...............................................19 Figure 12 – Migration steps ...............................................................................................20 Figure 13 – Database relationships and data transfer .......................................................24 Figure 14 – New engine database access .........................................................................25 Figure 15 – DBS-ICA mapping configuration .....................................................................26 Figure 16 – Caller entry in the old configuration file...........................................................27 Figure 17 – Caller entry in the new configuration file .........................................................27 Figure 18 – ICA2 alternative architecture...........................................................................31 Invitation to tender N° AO – 10017-annexes Page 3 / 31 1 Introduction 1.1 Purpose of this document The purpose of this document is to outline a technical solution for handling display of dynamic content on the current implementation of the ICA (ICADC). In order to cover the required functionality, this document will focus on describing the solution that replaces the phased out FEP application that produced dynamic content for CORDIS. The term FEP (Filtered Entry Point) is used to describe functionality where the content of the CORDIS databases is displayed by various services of the CORDIS web site. In common usage, this term refers to browseable content and to search functionality implemented using the CGI search technique on DBS. 1.2 Scope of the document This document explores an alternative architecture for handling the display of dynamic content for CORDIS from a technical point of view. It also presents an analysis of the impact of this architecture on CORDIS as a whole and suggests future migration steps. 1.3 Definitions, Acronyms and Abbreviations The following table presents the most significant terms used in this document: Term Definition ICA Integrated CORDIS Architecture. ICA2 Integrated CORDIS Architecture 2. ICADC Integrated CORDIS Architecture - Dynamic Content (application). FEP Filtered Entry Point. CGI Common Gateway Interface. DBS The Fulcrum database server on CORDIS. CMS Content Management System. XML Extensible Mark-up Language. XML is a W3C recommendation for creating specialpurpose mark-up languages. It is a simplified subset of SGML, capable of describing many different kinds of data. Its primary purpose is to facilitate the sharing of structured text and information across the Internet. Languages based on XML (for example, RDF, RSS, MathML, XSIL and SVG) are themselves described in a formal way, allowing programs to modify and validate documents in these languages without prior knowledge of their form. XSL Extensible Style sheet Language. XSL is a set of language technologies for defining XML document transformation and presentation. XSLT XSL Transformations. XSLT, is an XML mark-up language used for transforming XML documents. It is the XML transformation language part of the XSL specification (the other parts being XSL-FO and XPath). Invitation to tender N° AO – 10017-annexes Page 4 / 31 Term Definition SAX Serial Access parser for XML or Simple API for XML. SAX is a common interface implemented for many different XML parsers, just as the JDBC is a common interface implemented for many different relational databases. SAX Event SAX is an event-driven interface in which an application supplies the parser with the callback event handlers that are invoked when certain parsing events occur. These events (called SAX Events) provide all the information an XML-compliant application needs. We can leverage the work a SAX parser does by encoding the sequence of events. UR User requirement. Table 1 - List of significant terms 1.4 References [R.1] Dynamic content management on the ICA (FEP replacement) Lot2_RPT_Dynamic content_management_on_the_ICA_v010.doc [R.2] IDS Technical specification document IDS_TSD1.00.doc [R.3] CORDIS/IDS Functional Specification Document IDS_R4_FSD1 00.doc [R.4] Java Virtual Machine http://www.java.com [R.5] Apache Tomcat http://jakarta.apache.org [R.6] Oracle RDBMS http://www.oracle.com [R.7] Hibernate - Relational Persistence for Idiomatic Java http://www.hibernate.org [R.8] Lucene http://jakarta.apache.org [R.9] Cocoon http://cocoon.apache.org [R.10] Quartz scheduler http://www.opensymphony.com/quartz [R.11] JTidy - HTML to XHTML converter http://jtidy.sourceforge.net/index.html [R.12] ORO home page http://jakarta.apache.org/oro/ Invitation to tender N° AO – 10017-annexes Page 5 / 31 2 ICADC introduction The ICADC application replaces the phased off Fulcrum based FEP, providing a mechanism to generate dynamic content based on data stored in the ICA repository. Based on the [R.1], this document will start with a detailed review of the FEP application, presenting then the ICADC application and the migration path in order to facilitate the reuse of information stored in templates and the query definitions (CALLERS) where possible. 2.1 Key architectural requirements Since the ICADC application is a central part of the CORDIS infrastructure, replacing the FEP application, there were some key requirements that aim to facilitate its insertion into the ICA driven architecture while taking into account the future migration to the ICA2. The main architectural requirements for the ICADC application were: − “it must provide at least the same functions as the FEP application” − “it must use the ICA content repository as its unique data source” − “it must be based on a standard transformation technology such as XSL/XSLT” − “it must take into account the future migration to the ICA2 and then use similar components where possible in order to facilitate future insertion into that content infrastructure” − “it must re-use the existing template definitions or provide a simple conversion mechanism” Invitation to tender N° AO – 10017-annexes Page 6 / 31 2.2 FEP architecture The FEP application provided a mechanism to generate dynamic content using a Perl CGI script. The figure below presents its architecture: Web browser CPS Database Request IDS files Response Apache HTTP Server srchidadb CGI script Data access Templates+ config file DBS Data store (Fulcrum search engine) Figure 1 - FEP architecture The components of this architecture are: − Apache HTTP Server – the web server that hosts the CGI script − Templates files – files containing proprietary tags that are interpreted by the CGI script − Configuration file – specific configuration files for different type of requests − DBS Data store – the search engine based on Fulcrum − CPS Database – the database that builds the dissemination files − IDS files – dissemination files are the input for indexing the Fulcrum search engine A request to the CGI script triggers a set of actions: − Parameter identification: 1. identify parameters specified as part of the request 2. identify parameters from the caller entry in the configuration file 3. identify necessary parameters that where not specified in the previous steps from global section of the configuration file − Template parsing, in order to identify the full list of the fields that need to be loaded − Data loading, from the Fulcrum search engine − Response sent back to the browser, based on template and the data loaded Before any runtime call to the FEP application, the DBS is indexed with data parsed from dissemination files. A script from the CPS system produces the dissemination files. Each file has similar structure to the corresponding DBS table. Invitation to tender N° AO – 10017-annexes Page 7 / 31 2.3 ICADC architecture The ICADC application provides a powerful mechanism to generate dynamic content based on data stored in the ICA repository. The figure below presents its architecture: Web browser CPS Database Request Response Apache Tomcat HTTP Server JDBC Hibernate Persistence Framework Cocoon ICA Database Business Layer Templates+ config files Lucene Engine Lucene Index Figure 2 – ICADC architecture The components of the new architecture are based on standard frameworks: − Apache Jakarta Tomcat Web server – the Web container based on Java technology − Apache Cocoon – a web development framework built around the concepts of separation of concerns and component-based Web development. This framework handles the logic of the application for HTTP request, configuration files and templates processing. − Hibernate – object/relational persistence framework. It provides the “mapping” functionality,. − Lucene – This component provides a search mechanism comparable to Fulcrum, but tightly coupled with other modules in order to provide an integrated solution. − ICA Database – an Oracle database − Lucene Search Index – this constitutes the “specific” data store for Lucene. This structure of data is stored in a file system. − CPS – the content production data store. 2.4 XSL/XSLT transformation engine introduction In order to achieve the full extensibility, the ICADC templating engine has the ability to use XSLT, improving different aspects of the quality of the solution when compared to the FEP application: − separation of contents, − code reuse, − scalability. Invitation to tender N° AO – 10017-annexes Page 8 / 31 The templates are basically XML well-formed files that contain custom tags. These templates are processed by the Apache Cocoon framework that drives the entire transformation cycle. Apache Cocoon uses a XML dialect known as XSP (XML Server Pages) to drive the process. A XSP page is a XML file that contains embedded Java code. Apache Cocoon reads the XSP pages using the Server Pages Generator (the standard Cocoon Generator). The latter executes the embedded logic and combines the output with the XML content stored in the rest of the file and processes it using a SAX based parser. The resulting SAX events are sent to the next component in the pipeline for processing. The next step in the pipeline can be (the framework is flexible enough to accommodate different pipeline configurations) a XSLT transformation that transforms the XML output of XSP to another XML format. The last step of the pipeline serializes the content in a XHTML format in order to be interpreted by the Web browser that acts as client. The diagram below shows this pipeline configuration: Transformation Compilation SAX Event XSLT Transformation Pipeline XSP Java Compiler Generator Serialize Pipeline XHTML Figure 3 – Basic XSP processing flow When a XSP is processed, it is actually transformed into a Java object. Apache Cocoon does this by creating a Java file, compiling it and executing it. Classically, the XSP page contains Java code inside. Because of that the code becomes extremely difficult to develop and to maintain. In order to avoid that, XSP offers the possibility to use the logic sheets. A logic sheet is a special kind of XSL style sheet, whose output is an XSP file. An XSP logic sheet is a tag library that defines a set of custom XML tags which can be used within a XSP program to insert whole blocks of code into the file. Apache Cocoon comes with several predefined taglibs that can replace well-known artefacts of java coding logic. Invitation to tender N° AO – 10017-annexes Page 9 / 31 Transformation Compilation SAX Event XSLT Transformation Pipeline XSP Java Serialize Pipeline Compiler Generator XHTML Logic sheets Figure 4 –Adding logic sheets on basic XSP processing flow This approach has been used in order to develop custom style sheets that provide the required functionality to support the custom tags included in the ICADC templates. During the transformation from HTML to XHTML, a human operator may move any part of HTML pages to XSLT file. For example, if the header or the footer needs to be common for a set of pages, then the XSLT solution is an optimal one to solve such situations. In order to count on a solution that does not depend on any specific transformation implementation, the ICADC keeps these concept while using a custom implementation that is tailored to CORDIS needs. For the sake of compatibility we will keep the term XSP. Below we can see an example of a template file in XML format: ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… Figure 5 – Sample XML template As we see, in order to make Apache Cocoon “understand” the custom ICADC tags it is necessary to specify that the XML document will use the “ICADC namespace” to localize the logic sheet domain. Invitation to tender N° AO – 10017-annexes Page 10 / 31 The full details about the interaction between the ICADC custom tags will be provided in the section that describes the ICADC templating engine. 2.5 ICADC persistence layer and indexes As shown in previous diagrams, the persistence layer is based on − the ICA repository, basically used for document level views and on − Lucene Indexes for result list views. Lucene Indexes stores metadata providing a very fast search engine while providing all the required “summary” fields required for result lists generation. 2.5.1 Lucene concepts The concepts of Lucene may be outlined as: − An index contains a sequence of documents − A document is a sequence of fields − A field is named sequence of terms − A term is a text string, keyed by field name Documents are the primary retrievable units from a Lucene query. The fields that define the document have the same name with the ICA mapping referred in the templates. Each Lucene Index is referred by an ICA category and a language. The properties of the fields in Lucene may be: − Stored = non-inverted, content retrievable. The original text is available in the documents returned from a search. − Indexed = inverted, searchable. This property makes the field searchable − Tokenized = broken into tokens. The text added to the field is run through an analyzer and broken into relevant pieces. This has sense only for indexed fields. Stored fields are handy for immediate access to the original text available from a search, such as a database primary key or filename. Stored fields can dramatically increase the index size, so these must be used wisely. Indexed field information is stored in an efficient manner, such that the same term in the same field name across multiple documents is only stored once, with pointers to the documents that contain it. Lucene has predefined methods for possible combination of various attributes described earlier: − Keyword -- Indexed and stored, but not tokenized. Keyword fields are useful for data like filenames, part numbers, primary keys, and other text that needs to stay intact as is. − Text -- Indexed and tokenized. The text is also stored if added as a String, but not stored if added as a Reader. Our solution will always use the String accessing way. − UnIndexed -- Only stored. Are not searchable. − UnStored -- Indexed and tokenized, but not stored. Are ideal for text that needs to be searchable but need also to maintain the original text elsewhere or it is not needed for immediate display from search results. Invitation to tender N° AO – 10017-annexes Page 11 / 31 In order to understand the solution, let’s take some fields business that may appear in a document, to see what methods are need to be applied: Field Method Stored RCN UnIndexed yes Title Text yes Document detail UnStored Country Keyword yes Indexed Tokenized yes yes yes Yes yes Figure 6 – Example of fields and related indexing method − The RCN field from any ICA category represents the primary key used to identify uniquely a document. It is not necessary to index it. It is used just to create the link from the results list page to the document level detail page. − The Title field is necessary in order to display details in the results list about the document and for searching − The Document detail is used just for searching. This field is not displayed in the results list. − The Country field cannot be broken apart into tokens. For example if a country is “Czech Republic”, then the match need to be exact on the complete name and not on pieces: “Czech” and “Republic”. This method is necessary to eliminate any relative match. These fields explained do not refer to a specific category. They simply illustrate a practical example for each kind of indexing method. The Lucene Index is tightly related to the ICA. Initially the Lucene Index is created from ICA data retrieved using batch scripts. The Lucene Index is stored in the same file system where the application is installed. As we see in the diagram below, the batch script that will operate on the Lucene Index may be started directly, from command prompt or by the system scheduler. The indexing task is resource consuming and it is better to trigger it during the hours with less overall system load. Start Business Layer Scheduler Batch script Command prompt Hibernate (Java API) Lucene (Java API) ICA (Oracle) Lucene Index Figure 7 – Offline operations on the Lucene Index Each category has a separate index for each language. If the category has only one language available and it is not specified then, this is considered to be default English. Invitation to tender N° AO – 10017-annexes Page 12 / 31 A configuration file is used to store, the location and the name of indexes, function by languages and the fields that defines the index documents for each category. Below we can see an example:. ……………………… ……………… …………………… ……………………… Figure 8 – Example of an entry in the Lucene Index configuration file The configuration file contains a global section that defines the default location for the root of the indexes structure. The path to an index is obtained default concatenating: default_root_location + category_name + language name If for a specific category or for a language from a category a different location is required, then it is possible to redefine the location for the specific case adding the location inside the language element. The batch script that creates or updates the Lucene Index expects to retrieving data from the business layer that stays on top of Hibernate persistence layer. The business layer must provide: 1. Full list of records for a category. Input: - category name - a language identifier. Output: - list of records, having the field names exactly the ones described in the Lucene Index configuration file for related category 2. List of records that need to be updated Input: - category name - a language identifier. - date of last update Output: Invitation to tender N° AO – 10017-annexes Page 13 / 31 - list of records, having the field names exactly the ones described in the Lucene Index configuration file for related category When the scripts create the full index from scratch, the batch script automatically creates the directory structure where the index is stored. In order to update the index, the batch script expects to have an index that respects the structure of the documents field described in the configuration file. When a document from the Lucene Index needs to be updated, in fact it needs to be deleted and re-added. 2.6 ICADC templating engine in detail The biggest challenges for this component are: − provide the rich set of features previously available on the FEP application. − minimize the migration effort from existing templates − support an open XSL/XSLT based technology − minimize the development effort when migrating to the ICA2 The templates have three major functionalities: 1. Show results list after search operation 2. Show document level detail 3. Show an error about empty results list The third functionality is used in cases where there are empty results list for both first two functionalities. The Apache Cocoon engine may allow functionality like MVC design pattern concept, but at higher level of flexibility. The templating engine was be developed in two stages, in order to support the templates migration steps: − First version provided complete support of the old FEP style tags, with a limited set of extra capabilities. − Second version extended the functionality with features provided by XSL/XSLT technologies. For new templates, the engine can provide functionality immediately. For the old templates, a testing period and extra operator attention is required, because the migration of templates from HTML to XHTML may generate interface issues that need to be verified manually. The base architectures for both modes of operation are exactly the same. The only layer that exposes a different behaviour is the presentation layer. Apache Cocoon has a rich set of tools for publishing web documents, and while XSP and Generators provide a lot of functionality, they still mix content and logic to a certain degree. The concept of Action was created to fill that gap. Because the Cocoon Sitemap provides a mechanism to select the pipeline at run time, sometimes there is a need to adjust the pipeline based on runtime parameters, or even the contents of the Request parameter. Without the use of Actions this would make the sitemap almost incomprehensible. In our case, an Action is the proper place for request logic processing. The Action does not produce any display data. In fact is the only component that may allow dynamically logic to choose the page that needs to be used for rendering. Invitation to tender N° AO – 10017-annexes Page 14 / 31 …………… Figure 9 – Example of sitemap configuration file As we may see in the example above, the application engine can use both implementations in parallel. The migration may be done in small steps and without affecting the production version of the site. The Action takes care of: 1. checking request parameters 2. identifying the operation that need to be executed 3. if the operation is to make a search, then a. reads the specific request parameters b. asks the Business layer to interrogate the Lucene layer c. gets the reference to the Hits (Lucene search operation results) d. identifies the template(s) that need to be processed e. prepares the request parameters 4. if the operation is a “show a document level detail” a. reads the specific request parameters b. asks the Business Layer to retrieve from Hibernate layer the data detail for requested object c. gets the reference to the document details Invitation to tender N° AO – 10017-annexes Page 15 / 31 d. identifies the template(s) that need to be processed e. prepares the request parameters The second step after the Action is the view processing. This second step may have different views available. The selection of the one that need to be used depends by the level migration. This level is defined by a new parameter named MIGRATION_LEVEL in the specific CALLER configuration entry. The accepted values for this parameter are: − reader – for The templates may be HTML format but the tags must be the XML compatible. This value is considered as default if the parameter does not explicitly appear in the CALLERS configuration file. − XSPOnly –The templates are expected to be XSP (is supposed to be XML compatible) but without additional XSLT file for transformation − XSPandXSLT- This parameter is expected to be used for the new developed templates. it offers extensibility beyond XSPOnlyand it’s the recommended option. As we may see from the sitemap configuration showed as example in the Figure 15, the first option is to use a Reader in order to provide intermediate testing facilities. This kind of component offers direct manipulation of templates and skips the XML features but is necessary to see that the database migration was done successfully and the request parameters and the base logic of templates may be handled successfully. The second option showed in the example from the Figure 15 proves that the migration of the old template was done in a fully functional way. The logic of templates is included in the Action handler and it is the same for all the steps of the migration. In this way, the effort to support different forms of templates is reduced to a minimum. Also, the testers are able to identify correctly the source of any error, depending by the way of how the interface is implemented. As we know, the templates engine is capable of generating multiple views of the stored data. The type of view that needs to be shown is specified by a set of parameters that may be identified from the request parameters or from the specific CALLER entry in the configuration file. The templates engine uses only one entry to handle both kind of views and related templates. In case of empty results list, or some other errors caused by a wrong interrogation for any of these pages, then the engine is automatically redirected to the template that shows the empty results. This is supposed to be a template, but it doesn’t have any special tag inside. Because of that it may be read as a simple text/html page. It does not require special migration work to be done, but, if for any reason it will be necessary in the future, a mechanism may be added to support custom tags. 2.6.1 Results List templates The first action that a user may do using this engine is to make a search and to obtain a results list. The starting point for such an action may be a static page that contains a form where the user may have possibility to enter and to organize different parameters for the search. Another way is to have a predefined link from a page with hard coded parameters. In both situations the “CALLER” parameter must be specified. Each set of templates is strictly related to the related CALLER entry that has at least the next parameters: − Table name – defines the category − Template path – defines the path to the template Invitation to tender N° AO – 10017-annexes Page 16 / 31 Having the category and the path to the templates the engine has the minimum information to identify the templates and to render it. As it was explained in the previous chapter, the interpretation is made inside the Action and the data is prepared on the request as reference. The view layer needs only to get reference to the specific hits object and to extract data for rendering. In the below we can see a standard set of tags that may appear in results list templates. In the example XSP format is used, with the specific namespace. The same format that is supposed to be the standard for the new templates. ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… /h l Figure 10 – Basic example for Results List templates As we see in the previous example, the structure of the custom tags is pretty simple. − TOTALDOCS tag is used to display number of documents found. − RESULTS tag is used to prepare the data to be assigned for the loop that will be start with BODY tag. The elements inside the BODY represent the detail that needs to be shown for each record founded by the search operation. For each of record, there is supposed to be shown the position in the results list (SEQNO), value for each field (specified in VAL tag) and link to the document detail (link created with by the DOCLINK tag). − The other 2 tags PRVGROUP and NXTGROUP provide the link to the previous or next page with set of records for the current results list. The table name and the fields names specified in the template need to match exactly the name of the ICA mapping described in the table Lucene Index configuration file as described before. Only the mappings described in this configuration file will be implemented in the Lucene Index and will be available for searching or results list rendering. Invitation to tender N° AO – 10017-annexes Page 17 / 31 2.6.2 Document Detail templates When a document detail view is requested, this kind of template is used. The starting point for this kind of page may be a link from the Results List page or from a static link. In order to show a page the following parameters are required: − Category name – identified from specific CALLER entry, specified on request − RCN – the unique ID of a record − Language – if this is missing, then English is considered default Like for the Results List, the path and the identifier for the templates are also necessary, but these are supposed to be found in the specific entry in the CALLER configuration file. Some of the tags used for the Document Detail templates are used in order to show the position in the Results List or the link to the next document in the Results List. In the Figure below a standard set of tags that may appear in document details templates is shown.. In the example the XSP format is used, with the specific namespace, the same format that is standard for the new templates. The document details usually show only the specific value for simple fields of the category entry. For some situations, it is possible to have a list of records related to the current record. In order to obtain this list, a special tag PERGROUP defines the relation between the base category and the sub-category used to provide the set of related records. ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… ………………………………………………………… Invitation to tender N° AO – 10017-annexes Page 18 / 31 Figure 11 – Basic example for Document Detail templates As we see from the example shown in Figure 17 a document detail template may show the simple properties for a document and also a list of related records from another subcategory. Like the Results List templates, the VAL tag shows the specific value of the field. The position of the document in the Results List that creates the link to this view is provided by the DOCNO tag. The PERGROUP tag prepares the list of related records from the subcategory. The rule is that the field from the slave subcategory table needs to match with the field from the master table. The master table is the table identified by the category of the Document Detail. The behaviour of this subcategory list functionality is very similar with the RESULT tag from the Result list templates. All the fields are cycled using the BODY field as delimiter. The tags PRVDOC and NXTDOC create the link to the previous and the next document in the Results List. 2.7 Migration path Migration of templates and configuration files (CALLER) from the FEP application to the ICADC needs to take into consideration the following aspects: − Functionality must be compatible − The migration effort must be minimized In order to do that, it is necessary to re-use as much as possible from the interface described in the template files. A mechanism has been implemented in order to “understand” the old templates and migrate them in a new flexible set of tags, based on XML format. The migration has to be done in steps, with an intermediate level, described later. Also, the database used for input data needs to provide access without modification. The CPS and ICA databases are considered to have a similar structure and that will not produce many modifications in the template tags attribute structure. The old FEP templates looks like normal HTML pages, but inside have custom templates/tags identified by character sequences “~#” at the beginning and “#~” at the end. Apache Cocoon is a publishing framework with strong foundations in XML-based server-side web application frameworks. The version 2 of the Cocoon introduces the concept of pipeline to handle requests, each component on the pipeline specializing on a particular operation. In order to use the full power of this framework it is necessary to transform the old templates from HTML format in XHTML format. The old custom tags have references to the DBS. These references need to be migrated to be compatible with the ICA database. This step requires extensive testing. Considering this, the migration has 2 main steps: Invitation to tender N° AO – 10017-annexes Page 19 / 31 Original HTML + Old custom tags Intermediate RegExp - ORO HTML + JTidy New XML custom tags. Final XHTML + New XML custom tags. Figure 12 – Migration steps First step involves: − Transformation of the old “~#” and “#~” delimited tags to the new tags based on XML notation − Update of formatter parameters − Update of database referrers − Verification of persistence layer implementation − Verification of URL matching implementation − Verification of tags handling implementation − Verification of templates that need to be migrated Second step involves: − Transformation of templates to be XHTML compatible − Extend normal functionality with XSLT if possible The tools that are used for the template migration process are based on standard Cocoon components. ORO is used especially for the matchers – there is possible to define complex rules for identifying correct pipeline to execute. JTidy is used as a generator, to create automatically XHTML from any HTML input (file, link, etc). The migration of templates has to be done offline in order to minimize the response time of the framework on requests. 2.7.1 Old custom tags meaning and replacement format The first step involved in migration is parsing the old templates (based on a HTML structure + old custom tags), extracting the old tags and replace them with new XML compatible tags. The tool that will do that operation will be based on ORO libraries that provide regular expression functionality. In general the old custom tags don’t have many parameters, the functionality they provide being atomic. The new XML format of the custom tags will contain the ICA namespace in order to be prepared for the second step. The best way to see how these tags will be migrated is to look over some of them. In the results list templates, the most important ones are: TOTALDOCS Invitation to tender N° AO – 10017-annexes Page 20 / 31 Old format: ~#TOTALDOCS#~ New format: Description: when interpreted, this tag will reveal the number of total results founded RESULTS Old format: ~#RESULTS EN_CONT#~ and ~#/RESULTS EN_CONT#~ New format: and Description: defines the starting of the results list loop. The additional parameter defines the table from which will be extracted data. Observation: the previous table name will be migrated from the old Fulcrum reference to the new BODY Old format: ~#BODY#~ and ~#/BODY#~ New format: and Description: Mark the code that will be cycled in order to show one record from the results list, started with RESULTS SEQNO Old format: ~#SEQNO#~ New format: Description: the tag is replaced by sequence number (order) of the record in the set of retrieved records. VAL Old format: ~#VAL 6 EN_NEWS.EN_RLTNS LNF=NWS LNC=NEWSLINK_EN_C#~ New format: table=" NEWS" field=" RLTNS" Description: put the field value of a record from the results list. The parameters that appear in this tag are: • format – specify how will be formatted the value of the field • table – specify the table from where the information is loaded (may not be present) • field - specify the field • lnc – defines the caller that will be used to create a link (may not be present) • lnf – defines the field that will be used to create a link (may not be present) If the lnc and lnf are present, then the tag will create a link to another location. Technical, the table defined in the related caller and the field used to create the link will create a join relation with the main table and main field in order to identify the label and the location of the link. Observation: The table name and the field name will be migrated from the old reference to the Fulcrum engine to the new mapping related to ICA database. Invitation to tender N° AO – 10017-annexes Page 21 / 31 DOCLINK Old format: ~# DOCLINK #~ New format: Description: provides link to the document detailed, referred by the current record from the results list. PRVGROUP Old format: ~# PRVGROUP #~ New format: Description: provides link to the previous set of records from the results list if the current set is not the first. NXTGROUP Old format: ~# NXTGROUP #~ New format: Description: provides link to the next set of records from the results list if the current set is not the last. The other important kind of templates is the one that describes the document details. Excepting the VAL tag, which has the same meaning, the other ones are totally new ones: DOCNO Old format: ~#DOCNO#~ New format: Description: the tag is replaced by sequence number (order) of the record in the set of retrieved records. PERGROUP Old format: #~PERGROUP table2.field10?table1.field3#~ and ~#/PERGROUP table2.field10?table1.field3#~ New format: and Description: This tag defines a loop that iterates over a set of records obtained from table2, where the value of table1.field3 equals to the value of table2.field10. The table1 is detailed in current template. Observation: The table name and the field name will be migrated from the old reference to the Fulcrum engine to the new mapping related to ICA database. Invitation to tender N° AO – 10017-annexes Page 22 / 31 PRVDOC Old format: #~PRVDOC~# New format: Description: The tag is replaced by a hyperlink to the previous record of the set (in the corresponding result list that was used to access the current record). NXTDOC Old format: #~NXTDOC~# New format: Description: The tag is replaced by a hyperlink to the next record of the set (in the corresponding result list that was used to access the current record). For both type of templates, there may appear some extra tags that have independent functionality: PASSVAR Old format: ~#PASSVAR: PGA#~ New format: Description: replace the tag with the value defined by identifier, founded on the request. FILELINK Old format: #~FILELINK filename~# New format: Description: The tag will be replaced with the content of the file that has been specified in the identifier attribute. NONE Old format: #~NONE~# New format: Description: The tag is migrated because of historical reason and it will be replaced by an empty string. 2.7.2 Database migration Some of the tags described in the previous section are used as parameters identifiers for database tables or fields. The old FEP templating engine used references to the Fulcrum DBS database. The ICADC engine uses a relative mapping, not directly to a table name or a field name. In order to understand the database mapping used to migrate to the new DB, it is necessary to identify the major databases or information structures from the system and the actual data transferring steps: Invitation to tender N° AO – 10017-annexes Page 23 / 31 CPS (Oracle) Dissemination Files DBS (Fulcrum) ICA (Oracle) Figure 13 – Database relationships and data transfer Where: − CPS – the initial Oracle database used for updating operations. − DBS – Search Engine Database developed on Fulcrum. The old template engine uses this database. The structure of information is accessible via limited SQL, for example: cannot join between tables. − Dissemination Files – files generated from CPS to be input for DBS − ICA – the new database, updated from the same initial CPS database. This is the database that is used by the ICADC application. In order to update the database references from DBS to ICA, it is necessary to understand the intermediate steps of mapping, starting with the FEP templates mappings references (see Figure 4): − Templates to DBS. This creates the list of mappings that are necessary to see the ICA correspondence − DBS – Dissemination Files − Dissemination Files – CPS − CPS - ICA The syntax of the database references from the old templates has a simple format: TABLE_NAME.FIELD_NAME This notation from the FEP templates is referred to the old simple Fulcrum flat tables and could not be reengineered to use directly the new ICA database because of the complexity behind of any entity model data. In order to allow database access, a business layer that offers data retrieval using a similar syntax (to the old DBS reference) will hide the mapping logic. Invitation to tender N° AO – 10017-annexes Page 24 / 31 New Templates Engine Business Layer Hibernate ICA (Java API) (Oracle) Lucene Apache Tomcat (Java API) Lucene Index Figure 14 – New engine database access As seen above shows, the new ICADC templating engine uses frameworks in order to access stored data. Comparing with the old Fulcrum engine that uses for both search and full data retrieval the same source, the new engine detaches these 2 as follows: − Lucene for fast search functionalities (Lucene creates a proprietary index structure that is persisted by the file system) − Hibernate to access ICA - Oracle RDBMS structure. The business layer hides specific logic implemented by the Lucene and Hibernate and allows data access functionally similar with the old Fulcrum mappings in order to minimize the template migration effort. The tool used to update the templates will use a XML configuration file in order to translate the old Fulcrum reference with a new one, ICA based. The differences between the new mapping and the old one are related to language interpretation/optimization. The language is no longer be a part of the category name. The new engine “knows” automatically how to retrieve data based on the general category name and the language. The root element for this file is categories. For every category there is a category element that has three attributes: − dbs – the name of the old DBS category − ica - the name of the ICA category − language – the identifier of the language for ICA category related to the old DBS reference For every category element there is a mapping for each field, inside elements with the name field with the attributes: − dbs – the name of the old DBS field − ica - the name of the ICA field Invitation to tender N° AO – 10017-annexes Page 25 / 31 ………………… ………………… …………………… Figure 15 – DBS-ICA mapping configuration As seen above, an old EN_TABLE1 DBS table name will be migrated to a TABLE1 category. But also an old DE_TABLE1 DBS table will be migrated to the same TABLE1 category in the ICADC application. In order to identify the interface language, the new system will try to localize it in different contexts, prioritizing according to the following rule: 1. request parameter UPL 2. the specific CALLER entry in the configuration file for UPL parameter 3. Mapping configuration file DBS to ICA 4. global section of the CALLERS configuration file 5. If the UPL is not defined in any of the previous contexts, the interface language is considered to be English (EN) 2.7.3 Migration of the CALLERS configuration file The FEP templating engine used a configuration file to maintain specific settings for each possible caller and a global section that keeps the default values for the case when the specific caller parameters are not specified. In this configuration file there is a huge list of callers. For the ICADC engine, only a subset from the initial list needs to be migrated dynamically. Considering this, only the callers that need to be migrated will appear in the new configuration file. In the same time with this operation, the files that are related to the configuration caller entry will be extracted in a new structure and will be prepared to be used to minimize the space and the complexity of FEP’s files structure. The files that will be migrated are: − Starting point search forms (if available) − Templates for results list − Templates for document level details − Files referred by FILELINK tag Let’s take a sample entry in the old CALLERS configuration file: Invitation to tender N° AO – 10017-annexes Page 26 / 31 [MSS_NEWS_FR_FR] TABLENAME=FR_NEWS TEMPLATEPREFIX=MSS/FR/FR/ #SEARCH_PAGE=xxxx SEARCH_TYPE=advanced ACTION=R DOC=1 RECORDS_DISPLAYED=10 RL_TMPL_TERM=FR_NEWS LANGUAGE=FRENCH DOC_TMPL_TERM=FR_NEWS QM_EN_PGA_A=MS-FR C USR_SORT=EN_QVD_A CHAR DESC …………. Figure 16 – Caller entry in the old configuration file This entry will be migrated in a XML format, taking as main element CALLER with name attribute the name of the caller from the old configuration file. The child elements will use the name of the key from the old configuration file. NEWS MSS/FR/FR/ R 10 FR_NEWS FRENCH FR_NEWS MS-FR C QVD_A CHAR DESC …………………… Figure 17 – Caller entry in the new configuration file During the migration process, only the parameters that are still necessary will be migrated. For example SEARCH_TYPE refers to an older search method that was replaced in time by a more advanced engine. Another important thing is that the reference to the related table for this caller will be migrated, based on the same mechanism described in the previous section. Also, will be migrated the other fields that may be references to the tables or fields (For example, EN_QVD_A from USR_SORT parameter will be migrated to QVD_A). The commented field from the previous file will not appear in the new one. (#SEARCH_PAGE=xxxx). The global section from the configuration file will be migrated in a similar fashion like any other caller entry, but the main element will be named GLOBAL. 2.7.4 Migration of HTML page code to XHTML JTidy is a Java port of HTML Tidy, a HTML syntax checker and a printer. It can be used as a tool for cleaning up malformed and faulty HTML. This parser checks the validity of the HTML code input by end-users and automatically tries to correct it. JTidy reads through the input file and if it finds any mismatched or missing end tags it corrects them and outputs a well-formed XML document. JTidy won't generate a cleaned up version when there are problems that it can't be sure of how to handle. This tool may be used to automate the Invitation to tender N° AO – 10017-annexes Page 27 / 31 migration, but it will request operator attention. During the migration this tool generates errors or warnings, depending on the situation. These events need to be analyzed by somebody in order to see if the migrations were performed successfully. A few examples how JTidy works: − Missing or mismatched end tags are detected and corrected

heading

subheading

It will be mapped to

heading

subheading

− End tags in the wrong order are corrected

here is aspecial paragraph.

It will be mapped to

here is a special paragraph.

− Recovers from mixed up tags

heading

new paragraph bold text

some more bold text It will be mapped to

heading

new paragraph bold text

some more bold text

− Getting the
in the right place:


heading

sub
heading

It will be mapped to

heading

sub


heading

Adding the missing "/" in end tags for anchors: References It will be mapped to References − Missing quotes around attribute values are added − Unknown/proprietary attributes are reported − Tags lacking a terminating '>' are spotted Invitation to tender N° AO – 10017-annexes Page 28 / 31 The major limitationsof JTidy are: − It has limited support for XML − Cannot recognize CDATA section − Cannot recognize DTD subsets These limitations will not reach the FEP templates, considering that the old HTML templates files are simple and without advanced tags, but this tool even if it is very useful, may generate design issues due the historical HTML interpreters and require operator attention. Because of that, extensive testing is envisaged. Invitation to tender N° AO – 10017-annexes Page 29 / 31 2.8 Migration and testing The migration from the FEP, as described before, cannot be done directly to the final standard of templates. If the engine will be tested simultaneously for all aspects, then there will be a risk to not identify correctly the source of an error. In order to avoid that, a two step procedure has been elaborated: The first step will allow testing of components that replace the functionality of the old ones: − Migration of database references from DBS to ICA − Custom tags migration (from old format to new format) − Configuration file migration − Request parameters interpretation for Document Detail − Direct access to the database for Document Detail − Lucene index creation − Request parameters interpretation for Results List − Search functionalities and results list processing − Lucene index updating The second step will allow testing of extensibility to new features: − Templates transformation from HTML to XHTML − Extensibility using XSLT − Integration with other external components After the first step of migration, the engine will provide a similar functionality to the FEP application. The proof that the system was migrated successful will be that the new engine will work exactly like the old system after the first step. This step may be done fully automatic and without operator attention. The second step requires operator attention. The tasks isolated in this second step require modifications or improvements that cannot be done fully automatic. Each CALLER that was migrated successful will have the MIGRATION_LEVEL parameter configured to activate the extended functionalities. For the new templates, the second level will be considered default. The first step is necessary only to test the migration from the old engine for old templates. In this way, the testing period will be reduced at minimum. Invitation to tender N° AO – 10017-annexes Page 30 / 31 2.9 Future migration to ICA2 The current architecture of this engine is modular and allows future improvements. In order to preview the major improvement that is scheduled for the CORDIS architecture, the migration to ICA2, it’s important to understand the development effort implied. Web browser Request Response Apache Tomcat HTTP Server ICA2 Connector Interface Cocoon Templates+ config files Business Layer Hibernate Layer Lucene Engine ICA2 Content Services ICA Database Lucene Index Figure 18 – ICA2 alternative architecture In order to migrate to the new architecture, as seen above, only the database layer requires reengineering. The rest of the components will remain the same. Invitation to tender N° AO – 10017-annexes Page 31 / 31