Title/Summary: Develop a 'NoSQL' Datastore component for Apache Cassandra, CouchDB, Hadoop/Hbase

Student: Eranda Sooriyabandara

Student e-mail: 070468d AT gmail DOT com

Student Major: Computer Science

Student Degree: Undergraduate

Student Graduation: October 2011

Organization: Apache Software Foundation

Assigned Mentor: Jean-Sebastien Delfino

Abstract: 

Apache Tuscany provides a comprehensive infrastructure to simplify the task of developing and managing Service Oriented Architecture (SOA) solutions based on Service Components Architecture (SCA) standard. SCA abstracts business functions as components and motivate the business people/solution providers to use them as building blocks to create a business solution without knowing much about the underlying infrastructure.

'NoSQL' (Not Only SQL) databases are modern concept of databases which differ from classic relational database management systems in many ways like; they may not require fixed table schemas, avoid join operations and scale horizontally. Also in these databases they do not use Structured Query Language (SQL) to manipulate the database instead use an API. We can list down Apache Cassandra, CouchDB, Hadoop/Hbase and AppEngine Datastore as some of 'NoSQL' databases.

In this project my ultimate goal is to create a SCA portable datastore components over number of 'NoSQL' databases like Apache Cassandra, CouchDB, Hadoop/Hbase using java. The main idea of creating these components is to hide the database APIs of each 'NoSQL' database and create a REST datastore interface which can be used by different people without worrying about the underneath database.

Implementation Plan:

In the implementation of SCA datastore components need to consider about the following attributes,

  • Service
  • Reference
  • Property
  • Intent Policies
  • Implementation

So my task in this project to identify and have a clear idea of those attributes and implements them as SCA components. There are two components per each database. First one is REST datastore interface component and the other one is the wrapped database component.

Service: 

Major functionality of REST datastore interface component is to give 'NoSQL' database access to the user without worrying the underline database. The 'service' of this component describes a generic service interface to store and manipulate the data of all the 'NoSQL' databases. Before implementing the interface we need clarify the REST datastore interface services which we use in all the datastore components. This needs to be done carefully since some concepts are specialized to its database. For example, SuperColumnFamily in Apache Cassandra.

The “service” of the wrapper components describes a database specific service interface to store and manipulate the data of related 'NoSQL' databases. These are varying with their APIs.

Reference:

In the preference we need to create an interface which describes the dependencies. The preference of REST datastore interface component will be directed to the wrapped database components service interface. Wrapped database components do not have references.

Property:

This defines the configuration parameters of the components that can be used to describe the behaviour of the datastore components. For example concurrency controls in the datastore components. These parameters can be set in a configuration file which is an xml or a text file. This configuration file may change for different 'NoSQL' datastores. Need deep analysis of each DBMS to find the configuration parameters.

Intent Policies:

  • Implementation policies:

This will be a transaction based implementation and need to have a log of each transaction. The logging function may included in the DBMS itself but here we need a separate log to see whether each and every transaction which invoke the service interface end up as a successful transactions.

Implementation:

The components will be implemented using java. The logical task of,

  • REST datastore interface component is to mediate the transaction to the wrapped database component and get back the results to the user
  • Wrapped database component is wrapping the ‘NoSQL’ database as a SCA component

Here is a sample for how components work together

All the implementation I mentioned above based on my knowledge and the ideas of Jean-Sebastian came up with. Need to discuss further to clear out the conflicts in the component.

Deliverables:
  1. The REST interface component.
  2. Components which Wrapped Apache Cassandra, CouchDB and Hadoop/Hbase databases.
  3. Functionality testing framework.
  4. Documentation and a tutorial for the new components.
Time-line:

April 25 - May 23

  • Continue studying on 
    • How Tuscany works 
    • How to create SCA components by reading and implementing sample SCA components.
  • Discuss the problems, ideas and the conflicts with the mentor and other Tuscany community members.
  • Define a  sample scenario for the implementation over the various databases
  • Use that sample scenario to identify the APIs of the databases.
  • Put database independent parts of the scenario in Tuscany and mock up the database access (identify the different commands).
  • Contact the Apache Cassandra, CouchDB and Hadoop/Hbase communities if there is a problem of understanding.

May 24 - July 10

  • Decide the API for access and manipulate data in the NoSQL datastore components.
  • Starting implementation of the datastore components
    • Stage 2: Implementing the REST interface component (Abstract model).
    • Stage 1: Implementing component for Apache Cassandra and modify the REST interface component to support Apache Cassandra.
      • Do functional tests for the component using the REST interface component.

July 11

  • Mid-term evaluation of the project.

July 12 - August 15

  • Continue implementation of the datastore components
    • Stage 3: Implementing component for CouchDB and modify the REST interface component to support CouchDB.
      • Do functional tests for the component using the REST interface component.
    • Stage 4: Implementing component Hadoop/Hbase and modify the REST interface component to support Hadoop/Hbase.
      • Do functional tests for the component using the REST interface component.
  • Write documentation and a tutorial for the new components using a well known use-case scenario.

August 16 - August 22

  • Make the final adjustments to all the deliverables for the submission.

August 26

  • Final evaluation deadline.
Community Interactions:

Working with an Open source model of communication I like to interact the community via,

  • JIRA issue tracking system
  • Apache Tuscany mailing-list
  • irc channel (#tuscany)
  • private chats on gtalk or Skype

Using these mediums I like to do my project fully open to the community and take the precious ideas of each and every community member.

Biography:

I am Eranda Sooriyabandara a final year student of Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka. As I am very much interested in databses I have experienced in working with databases like Apache Derby, MySQL, PostgreSQL, Oracle and Apache Cassandra as a 'NoSQL'. Also I have knowledge on Service Oriented Architecure and related topics like web services, SOAP since I had 6 month internship in a SOA middleware company.

The reason I involve in this project is because this is a great chance to learn about 'NoSQL' databases like Apache Cassandra, CouchDB, Hadoop/Hbase and AppEngine datastore and I can experience the Service Components Architecture, which is bit new technology to me but I like to learn the further while doing my contribution to Apache Tuscany. Also working with an experienced community is a big opportunity to me to learn new technologies from the best.