Apache Software Foundation
Abstract: Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. Apache Spark is a fast and general engine for large-scale data processing. It runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Apache Gora should support a backend for Apache Spark.
Additional info: http://www.furkankamaci.com/
Spark Backend Support for Gora
GsoC 2015 Proposal
Istanbul Technical University
Graduate School of Science, Engineering and Technology
The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. In Avro, the beans to hold the data and RPC interfaces are defined using a JSON schema. In mapping the data beans to data store specific settings, Gora depends on mapping files, which are specific to each data store. Unlike other OTD (Object-to-Datastore) mapping implementations, in Gora the data bean to data store specific schema mapping is explicit. This has the advantage that, when using data models such as HBase and Cassandra, you can always know how the values are persisted.
Gora has a modular architecture. Most of the data stores in Gora, has its own module, such as gora-hbase, gora-cassandra, and gora-sql.
Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use ORM framework with data store specific mappings and built in Apache Hadoop support.
The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows:
• Data Persistence: Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system or Hadoop HDFS.
• Data Access: An easy to use Java-friendly common API for accessing the data regardless of its location.
• Indexing: Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
• Analysis: Accessing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
• MapReduce support: Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.
Gora differs from current solutions in that:
• Gora is specially focused at NoSQL data stores, but also has limited support for SQL databases.
• The main use case for Gora is to access/analyze big data using Hadoop.
• Gora uses Avro for bean definition, not byte code enhancement or annotations.
• Object-to-data store mappings are backend specific, so that full data model can be utilized.
• Gora is simple since it ignores complex SQL mappings.
• Gora will support persistence, indexing and analysis of data, using Pig, Lucene, Hive, etc.
Apache Gora currently supports these datastores:
• Apache Accumulo
• Apache Cassandra
• Amazon DynamoDB
• Apache HBase
• Apache Solr
On the other hand, Spark is an Apache project advertised as “lightning fast cluster computing”. It has a thriving open-source community and is the most active Apache project at the moment.
Spark provides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. Spark took over Hadoop by completing the 100 TB Daytona GraySort contest 3x faster on one tenth the number of machines and it also became the fastest open source engine for sorting a petabyte.
There are also some alternatives to Apache Spark, i.e. Apache Tez.
GSoC helps people to get involved with open source projects and this proposal is for GORA-386 issue that aims to develop a Spark backend for Gora.
DEFINITION OF THE PROBLEM
Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.
Even Spark is so powerful compared to Map/Reduce which Gora currently supports; there is no Spark backend for Gora. This proposal aims to develop a solution for Gora to support Spark.
There is already an existing Map/Reduce support for Apache Gora. However there is not a generic abstraction layer which allows using some other replacements instead of that.
The leading candidate for “successor to MapReduce” today is Apache Spark. Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads, and to do so much faster than the older system.
Original MapReduce executed jobs in a simple but rigid structure: a processing or transform step (“map”), a synchronization step (“shuffle”), and a step to combine results from all the nodes in a cluster (“reduce”). If you wanted to do something complicated, you had to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely. Complex, multi-stage applications were distressingly slow.
An alternative approach is to let programmers construct complex, multi-step directed acyclic graphs (DAGs) of work that must be done, and to execute those DAGs all at once, not step by step. This eliminates the costly synchronization required by MapReduce and makes applications much easier to build. Prior research on DAG engines includes Dryad, a Microsoft Research project used internally at Microsoft for its Bing search engine and other hosted services.
Spark builds on those ideas but adds some important innovative features. For example, Spark supports in-memory data sharing across DAGs, so that different jobs can work with the same data at very high speed. It even allows cyclic data flows. As a result, Spark handles iterative graph algorithms (think social network analysis), machine learning and stream processing extremely well. Those have been cumbersome to build on MapReduce and even on other DAG engines. They are very popular applications in the Hadoop ecosystem, so simplicity and performance matter.
To support Spark within Gora, its own format RDD should be supported.
Following behaviors are specific to Hadoop’s implementation rather than the idea of MapReduce in the abstract:
• Mappers and Reducers always use key-value pairs as input and output.
• A Reducer reduces values per key only.
• A Mapper or Reducer may emit 0, 1 or more key-value pairs for every input.
• Mappers and Reducers may emit any arbitrary keys or values, not just subsets or transformations of those in the input.
• Mapper and Reducer objects have a lifecycle that spans many map() and reduce() calls. They support a setup() and cleanup() method, which can be used to take actions before or after a batch of records is processed.
Suggested Steps for Proposed Method
1) Gora Input Format to RDD Transformation: Gora has its own input/output format and Spark uses RDD. Gora Input formats should be transformed to RDD format.
2) Generic Abstraction Layer Backend: Gora has developed for Hadoop Map/Reduce and it does not have an abstraction. Necessary infrastructure should be changed to work properly to handle Spark backend.
3) Data Storage via GoraInputmap: Gora’s internal classes should be updated and it should let to read/write data either as how it does now and via Spark style.
SCHEDULE & TIMELINE
Suggested schedule and timeline is as follows:
1) Analyzing The Problem (1 Week)
a) Problem will be analyzed with more detail.
2) Gora Input Format to RDD Transformation (4 weeks)
a) Gora has its own input/output format and Spark uses RDD. Gora Input formats should be transformed to RDD format and GoraInputFormat should be abstracted (2 weeks).
b) Spark implementation should be done new GoraInputFormat (2 weeks).
3) Generic Abstraction Layer Backend (2 weeks)
a) Gora has developed for Hadoop Map/Reduce and it does not have an abstraction. Necessary infrastructure should be changed to work properly to handle Spark backend.
4) Data Storage via GoraInputmap (3 weeks)
a) Gora’s internal classes should be updated and it should let to read/write data either as how it does now and via Spark style.
5) Test (1 week)
a) Implementation tests will be written and run.
6) Documentation (1 week)
a) Documentation will be prepared.