The Graph Package (Angrapa)
The graph package, called Angrapa, is an large-scale graph data management framework for analytical processing. It is still in heavy development. Angrapa will employ massive parallelism on Hadoop, and It aims to achieve the scalability for processing tera bytes or peta bytes graph data. Angrapa will be used in a variety of scientific and industrial areas, such as data mining, machine learning, information retrieval, bioinformatics, and social networks, required to process large-scale graph data.
The Main Goal
- Easy APIs familiar to graph features
- Storing techniques and the data communication method (i.e., BSP) without deterioration of graph data locality
An Overview of the Angrapa
The architecture of angrapa is similar to that of MapReduce except it is founded on the BSP. It consists of one master and many walkers. One master corresponds to a jobtracker of MapReduce, and many walkers correspond to task trackers. Like MapReduce, angrapa will be carried out on data sets on HDFS or HBase.
For processing, programs written in angrapa will be automatically parallelized and executed on walkers. The processing of angrapa consists of a sequence of parallel supersteps. Each superstep is also subdivided into three ordered phases, consisting of local computation in each walker; communications among the walkers, leading to transfers of intermediate data between walkers; and a barrier synchronization, waiting for all of the communications to complete.
In the local computation phase, each walker polls a vertex v1 - initially, it is given by a user or the program, or it can be obtained from intermediate results transmitted from other walkers after the first superstep. - from a local queue, and it starts to traverse from v1 on graph data that reside in a local storage. During this phase, walker enqueues additional vertices to be processed into the local queue. However, if some inserted vertex v2 do not reside in the local storage, the vertex v2 is removed from the local queue and inserted to the global queue that keeps vertices to be transmitted into other walkers at the next communication phase.
In the communication phase, each walker communicates intermediate data about vertices that reside in local but are necessary to be computed with other vertices that reside in another walkers.
In the barrier synchronization phase, the master controls all of the walkers with zookeepers. If there are no intermediate data to be processed, the program will finish, and it is the end of one superstep. Otherwise, next superstep will start with intermediate data transmitted among walkers.
We assume that given a large-scale graph data and a start vertex, we find all shortest paths from the start vertex to every vertices on the given graph data. In such a case, the program written in angrapa starts from the start vertex. During the local computation phase, each time walker dequeues a vertex from the local queue, each walker enqueues its adjacent vertices with paths from the start vertex to the current dequeued vertex. Then, an enqueued vertex is checked if it resides in the local partition. If not, it is picked to the global queue. (To be described in more detail)
For example, we assume that given a subgraph s, we find a set of subgraphs isomorphic to the given subgraph s. In such a case, angrapa will be very desirable because each walker can independently begin with some start vertex matched to the subgraph s. The program will needs only one superstep if the subgraph s is small enough to be within a partition. (To be described in more detail)Please update http://people.apache.org/~edwardyoon/site/hama_graph_tutorial.html