This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • Add MongoDB to Tajo Storage - Proposal

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Apache Tajo (Future of Data Warehouse) is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed to provide low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.[(http://tajo.apache.org/] Tajo ) Tajo currently embeds HDFS, S3, Openstack, HBase, RDBMS storage plugins, so users can connect those other data sources to Apache Tajo.  [ (http://tajo.apache.org/docs/current/storage_plugins/overview.html])

MongoDB is a open source, cross-platform document-oriented, NoSQL database. MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON). Like other NoSQL database it supports dynamic schema design allowing documents in a collection to have different fields and structures. 

As mentioned in the first paragraph Apache Tajo embeds several storage plugins plugins https://github.com/apache/tajo/tree/master/tajo-storage . The project propose to add a MongoDB storage plugin to tajo-storage. Implementing the new module tajo-storage-mongodb (storage plugin for MongoDB) will be the major part of the project.

...

Other than implementing mongodb-storage module it will be required to update configuration modules to allow support for MongoDB . The project task will be to implement above modules.

Here is a abstract diagram which describes the module implementation.

...

(Please consider, it is a really abstract diagram. For a example those classes may implement interfaces which are not in the diagram.)


Image Added

4. Timeline 

With the advises of Mentors (Jihoon SonJaeHwa Jung) I have already setup the development environment in a Ubuntu virtual machine. IDE used is intelliJ Idea. I am going follow the following schedule during the coding period and community bonding period. 

Community Bonding Period : Maintaining regular discussions with the mentors and working on the material and guidance they provide. Going through all the storage drivers again and study their architecture. Discussing on the most suitable architecture for the MongoDB storage driver with mentors.

There will be around 4 weeks from the start of coding(23th May) till the start of mid-term evaluations(20th June)

Week 1 : Finalize the architecture and complete class structure. Create dummy classes and methods without writing the actual implementation. Suitable class, attributes, method names will be decided at this step. It will be really helpful at the implementation process.

Week 2 :  Implementing  Implementing the actual code for MongodbConnectionInfo class and check the connectivity with mongodb.

Week 3 : Implement  Implementing Fragment, TableSpace and test them. Also Further it is required to implement required functions in supportive components to achieve this task.

Week 4 :  Implement Implementing Scanner and test testing the reading capability from a MongoDB database.


There will be around 7 weeks from mid evaluations(28th June) to the suggested 'pencil down' date(15th August)

Week 1 : Fix if there is any issue with the current implementation. Test the scanner.

Week 2 : Implementing the Appender.

Week 3 : Testing the appender and the complete tajo-storage-mongodb module. Start writing document.

Week 4 : Completing the document “MongoDB integration” in Tajo docs.

Week 5 : Testing all the functionalities of the driver, and create documentation on the architecture of the module.

Week 6 : Fix bugs and improve the quality of the code.

Week 7 : kept  Kept free for time flexibility in case of an emergency.

In addition I will be continuously blogging the work I do on my personal blog throughout the working period of the program.

...

...

My main interest is with C++ because it was the programming language I used to learn programming, but also I have a good practice in Java too. Further I have self studied MongoDB. I strongly believe I have the skill set required to complete this project. I will be glad to research and study any other required technologies for the project.

8. Other commitmentsCommitments

  • Semester 5 End Exams - 11th of July to 25th of July.
    • I believe it will not be a big issue and I will be able to continue the project during this time.
  • Part Time Tutoring - I do part time tutoring, 6 hrs per week.

...