Shortest Paths Example
We are going to delve into a full Giraph example using the single source shortest paths algorithm. The actual code is part of the examples included in Giraph SimpleShortestPathsVertex. We will also define a very simple input and output format for the graph.
Be aware of the version 0.1 or 0.2 - in the later one the class SimpleShortestPathsVertex does not have a main() method, so you have to run it via the ToolRunner.
First, let’s start with the VertexInputFormat. In this example, we have extended VertexInputFormat to produce our own simple JSON based input format. VertexInputFormat is very similar to the Hadoop InputFormat. The relevant code snippet is below:
The idea is to split the graph into manageable parts to distribute across the Giraph workers. The first thing that happens is that getSplits() is called by the master and then the workers will process the InputSplit objects with the VertexReader to load their portion of the graph into memory. In this example, we use composition to internally use the TextVertexInputFormat to do most work for us of generating one split per input file in the directory. Then, the SimpleShortestPathsVertexReader can read line by line, one vertex per line. We implemented a very simple description of the graph, a json array that contains the vertex id, a vertex value, then a json array of edges (each of which are a json array of destination vertex id and edge value). For example, [2,100,[[3,200]]] specified the vertex id = 2, vertex value = 100, and there is a single edge (destination vertex 3 with edge value of 200). Also, one more thing to note is that we do load the vertex values but will override them in our application on superstep 0.
We also have to have an output format to store our graph. Hence we have the SimpleShortestPathsVertexOutputFormat.
The output format basically does the inverse of the input format, writing the vertices out as json arrays.
The next part of the code we focus on is the actual computation executed by every vertex:
In the superstep 0, all the vertices initiatlize their vertex values to the maximum value (unreachable). Then, the source vertex will propagate the cost of going to its neighbors. In subsequent supersteps, all vertices will propagate the minimum cost of getting to its neightbors until the application is complete. At this point, all the vertex values reflect the cost of going there from the source vertex.
The final bit of code to discuss here is starting up the job:
Implementing the Tool interface allows us to start Hadoop jobs from ToolRunner. The run method sets up the Giraph job with the appropriate vertex class (computation), vertex input / output formats, workers, etc. These configurations can also be set with generic Hadoop options as well. This is not the only way to start up a Hadoop job, but it’s a simple way.
Great! Now we need some input data. You can create your own, or you can download some example data from shortestPathsInputGraph. Untar the example data with the following command:
Then upload the graph to HDFS:
Running the example on a hadoop instance:
Now let’s check the results.