Title/Summary: Searching artifacts across SCA domain
Student: Wojtek (Wojciech) Janiszewski
Student e-mail: wojtek.janiszewski cuthere gmail cuthere com
Student Major: Computer Science
Student Degree: Master
Student Graduation: 2009
Organization: Apache Software Foundation
Apache Tuscany, an implementation of Service-Component Architecture (SCA) allows to create distributed applications. SCA applications are packed as set of various files (composite, scripts, Java classes, wsdl, xsd, etc), called contribution. Contribution could be a directory or JAR file, which then is contributed to SCA domain. Once installed contribution can be browsed manually from filesystem, which could be uncomfortable and time-consuming. Apache Tuscany SCA domain manager web application lacks search feature with possibility to browse and filter SCA objects quickly.
Apache Lucene is powerfull search engine library written in Java. Through its minimalistic API it allows user to index data and search them. Simplicity and generality of this open source library is the key which makes it possible to search for every data and in various environments. Advanced syntaxt and good performance makes it great choice for creating relayable search subsystem.
The goal is to provide facility to search and browse artifacts in SCA domain. Such goal could be achieved by integrating Apache Lucene library with Apache Tuscany. For this solution user would be allowed to search for artifacts using improved SCA domain manager web aplication.
2. Detailed description
Implementation of search feature should cover three main areas, which are indexing, searching and presentation. Such separation gives us modularity, which implies reuse of components and ability to test easily.
Indexing would be backed by Apache Lucene indexing mechanisms.
Indexing should be performed in two phases:
Phase one: gathering basic data
Every file in contribution should be parsed and indexed. Moreover files contained in every known archive file (JAR, ZIP files) should be indexed. Parsing each file would be aware of its filetype to get best indexing data. Internally, Document object would be created and appropriate fields would be filled. Unique identifier should be assigned to newly created document.
Phase two: reference completion
For every indexed document previously created index will be searched to find documents usages. Found elements would be stored with current document data. In example, for indexed Java class we should search for its name occurences in composite files. Found identifiers as well as their friendly names would be stored in indexed document representing Java class.
Index document model
Each indexed item would have several attributes. Bold items would be document fields for searching purposes. Italic is used for attributes used for internal purposes.
Full path for contribution object.
specific for file type, like QName of Java class, Qname of composite. For others it could be simply filename.
literal for text files, for non readable files, ie. Java classes some names like method names could be extracted and used in this field
contribution which file belongs to
archive which file belongs to
Contains names of items which are used by current:
Contains names of items which uses current. Generally it's reversed link for reference.
Used to store extracted objects from composite file.
All above fields to provide non-filter queries.
Unique across SCA domain to identify document/domain obejct.
Links to references documents.
Links to usages documents.
Above model shows how documents would be linked for search purposes. Example of references which could occur could be found on the diagram.
Searching would be backed by Apache Lucene search engine.
Custom search API for Apache Tuscany would be available via SCA component. Such element can be reused in various scenarios, ie. it can be exposed for other purposes via one of Apache Tuscany bindings. In this project we would like to use such component as a feed for web based UI.
There would be generally two operations exposed by such component. (more could be introduced for ie. administration purposes).
Fetch by phrase - search phrases are similiar to what we do in Google. Lucene query syntax would be used and various filters could be entered (basing on fields described in 2.1.1 Indexing). User could filter by:
- document name
- document friendly name
- document content
- contribution which it belongs to
- items declared in composite files
- none (all document fields would be used to search)
More query syntax elements would be used, such as:
- logic operators
- regular expressions
Fetch by item - getting item and its references. Item to fetch is identified by internal identifier. Such fetching method would be used in navigation based on hyperlinks, not search queries.
Navigation could be performed in two ways:
1. By using search box where user can type query, for "fetch by phrase" search method
2. By using links to items where user can navigate through references ("fetch by item" method). Such links could be found in several places:
- search start page - with links to available contributions
- result element view containing links to parents and children
Additionally after implementing project core some usability features coulb be added:
- artifact types
- indexed names
- most popular searches
- query syntax
Display layout would be common for both "fetch by phrase" and "fetch by item". Every search would be displayed as list of results. For long result lists paging would be applied. Furthermore having sort (basing on various criteria) feature would improve navigation through results list.
View for each search result element should contain:
- highlighted phrase which matches search query
- preview link (if item is readable)
- link to parent contribution
- links to runtime nodes (fetched from contribution)
- links to direct parents (composite, component, binding etc.)
- links to children elements
Following image shows example navigation throught search UI. It contains 5 web pages which can be reach in various flows. Red arrows shows what page wuold be generated after clicking a link. Purple color is used for comments.
Contribution scanner, parser and analyzer
Module which scans contributions, analyzes its artifacts and feeds Apache Lucene index. See 2.1.1 Indexing for details. Appriopriate JUnit tests should be introduced.
Module which exposes search features via simple API available by SCA component. See 2.1.2 Searching for details. Appropriate JUnit tests should be introduced.
SCA Domain manager web application extension
New pages, scripts, actions etc. which would handle UI described in 2.1.3 Presentation. Appropriate JUnit testst should be introduced.
Module which tests integration of project deliverables with Apache Tuscany.
User and developer documentation on Apache Tuscany web page.
2.3 Architectural outline
Contribution scanner, parser and analyzer will be available to use by Tuscany user by additional module, which if added to classpath would be automatically started. Such module will fetch contributions list by reading workspace.xml. This file would be constantly monitored in case of need to reindex changed contribution entries. Created index would be registred in internal structures of Apache Tuscany for further acces by search component.
Search component module will be started automatically if added to classpath. Indexed data would be obtained from Apache Tuscany internal structures. Physically it could be two separate maven modules - one for searching operations and one for exposing first as component.
SCA Domain manager web application extension will be accessing search component via its default binding. In special cases administrative operations would be invoked, ie. reindex request in case of adding or deleting contribution by user (but not necessary if contribution scanner will monitor workspace.xml often).
Before May 23
Proposal review and discussions, prototyping, getting familiar with advanced aspects of related technologies.
May 23 - June 30 (~5 weeks)
Implementation of Contribution scanner, parser and analyzer.
July 1 - July 11 (1.5 week)
Implementation of Search component.
Implementation of Integration tests.
Submitting mid-term evaluation.
July 13 - August 9 (~4 weeks)
Implementation of SCA Domain manager web application extension.
August 10 - August 17 (1 week)
Extra week in case of delay.
Writing documentation for project. Code review.
Submitting final evaluation