DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Implement RethinkDB datastore module for Apache Gora
Project: Implement RethinkDB datastore module
Project: Apache Gora
Issue: https://issues.apache.org/jira/browse/GORA-655
Potential Mentors: Kevin Ratnasekera, Chanaka Balasuriya, Furkan KAMACI
About Me:
I am Rumesh Perera ( nrumeshperera@gmail.com ) a master student of computer science from University of Stuttgart, Germany. I am an open source enthusiast, there are countless number of open source software I use on a daily basis for my academic and personal use. I take GSoC as an opportunity to participate in one of the prominent open source projects at Apache Software Foundation.
Project:
The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key-value stores, document stores, distributed in-memory key-value stores, in-memory data grids, in-memory caches, distributed multi-model stores and hybrid in-memory architectures. Gora also enables analysis of data with extensive Apache Hadoop MapReduce, Apache Spark, Apache Flink, and Apache Pig support.
RethinkDB is a popular implementation for document store which is being used extensively in modern real time applications. Detailed technical comparison between RethinkDB and MongoDB as a document persistent store solution is available on reference. [1] There are a number of benefits of RethinkDB as it developed focusing on developer oriented and operations oriented mindset. [2] Apache Gora currently supports MongoDB and CouchDB as document stores, this project is aimed to further extend the datastore support for RethinkDB.
Approach:
Apache Gora provides key value based abstraction to access and query data irrespective of the underlying data model of the physical database. So one may use simple Gora key value based abstractions rather worrying too much of the underlying data model of the backend database. So in order to do that one may have to provide a custom implementation to do the translation from / to Apache Gora abstractions from / to the physical database abstraction. There is a standard way to do this by writing a datastore as in. [3] There are core base classes one may need to extend as per the particular backend. These are, DataStoreBase<K, T>, QueryBase<K, T>, ResultBase<K, T>. Basically DataStoreBase<K, T> abstraction has core methods to deal with all backend connection management, data bean persist / retrieval and query execution. QueryBase<K, T> and ResultBase<K, T> smaller level entities to leverage query functionality of entity DataStoreBase<K, T>. Those are basically in-memory data structures for physical database query and result set representation. RethinkDB query language - ReQL is a very expressive language to manipulate documents that persist over the database.
RethinkDB java driver has extensive support for ReQL which can be used as a ReQL query builder. So there will be minimum effort in transition from Gora based queries to ReQL queries and implementation will be trivial.
Deliverables:
RethinkDB compatibility for Apache GORA will comprise,
- RethinkDB datastore module.
- Integration tests for the RethinkDB datastore. ( Datastore and Workload )
- RethinkDB Documentation for Apache Gora website.
- Weekly reports plus blog posts on project activity.
Timeline:
Activities | Time Period |
Community Bonding Period Setting up development environment Research on RethinkDB Document data model Research on API for RethinkDB Java driver Setting up test server of RethinkDB and POC on java driver client code to connect, persist and query data Participate in emailing list, address JIRA tickets for code contributions Design for AVRO beans to RethinkDB Document model mapping. That is mapping file - gora-rethinkdb-mapping.xml | May 04 - June 01 |
Coding Period 1 Create Initial maven module structure. Extend DataStoreBase<K, T> class methods customized as per RethinkDB backend. This will include, Logic to parse the mapping file and create an in memory object module for RethinkDB mapping. Logic to externalize RethinkDB java client related properties Eg:- host, port to gora.properties and to parse and maintain in memory Logic to DataStore methods initialize, get, put, create schema, delete schema. | June 01 - June 29 |
Evaluation 1 | June 29 - July 03 |
Coding Period 2 Extend DataStoreBase<K, T> class methods on RethinkDB query execution Logic to conversion of Apache Gora key value based queries to RethinkDB query language ReQL based queries and vice versa. Logic to DataStore methods execute query, partition query, delete by query. | July 03 - July 27 |
Evaluation 2 | July 27 - July 31 |
Coding Period 3 Integration tests for RethinkDB datastore RethinkDB documentation for Apache Gora website. | July 31 - August 31 |
Evaluation 3 | August 31 - September 7 |
Commitment:
I plan to allocate 30-40 hours per week within the GSoC period, which I believe will provide an adequate amount of time in successful completion of the project. My focus here is to take GSoC as a starting point to enter an open source community and continue my contributions in the long run even after the completion of the project.
Community Engagement:
I have already communicated and reviewed my proposal with potential mentors and additionally addressed following JIRA tickets with PRs.
[1] https://issues.apache.org/jira/browse/GORA-653 -
https://github.com/apache/gora/pull/212
[2] https://issues.apache.org/jira/browse/GORA-654 -
https://github.com/apache/gora/pull/211
References:
[1] https://rethinkdb.com/docs/comparison-tables/
[2] https://rethinkdb.com/blog/mongodb-biased-comparison/
[3] https://cwiki.apache.org/confluence/display/GORA/Writing+a+new+DataStore+for+Gora+HOW_TO