Giraph implementation of Nutch LinkRank Algorithm

Author

Ikhtiyor Akhmedov

Motivation

LinkRank algorithm implemented in Apache Nutch relies on MapReduce on top of Apache Hadoop, however recent studies show that graph processing on MapReduce can lead sub-optimal performance and usability issues1. Apache Giraph is good option for processing large graphs and instead of implementing PageRank type of algorithms every time for large graphs can be used Apache Giraph. As mentioned in GIRAPH-584 this can save a bit of code for Apache Nutch project also.

Project Description

LinkRank contains several MapReduce steps which should be implemented converted to BSP.

Counter job. Counter job is simplest in implementation. This job will count number of inlinks and outlinks in whole graph(edges), which is bit tricky in terms of BSP. Since, if we put counting as a superstep in compute() process, every node (or vertex) should send number of outlinks to every other node and then all nodes in next superstep should sum all incoming messages. This will create some overhead to network. Instead we can create two separate BSP jobs. First in its superstep all nodes will send number of outlinks and sum combiner and aggregator will be used, which decreases network overhead.

Initializer job. Will initialize nodes with default value for inlink score, depending on how Counter job is implemented this can be first or second supertstep. This superstep is straightforward and all nodes will set their respective default value.

Inverter job. Inverter will invert all outlinks as inlink. Inverter job in BSP consists from 2 supersteps. After finishing Initializer superstep (next jobs will be written as superstep if converted to BSP) 1st superstep of Inverter is sending outlink to its respective destination as a message and remove all outlinks. 2nd superstep of Inverter will get messages and add to its outlinks list (edges).

Analyzer job. Analyzer will be executed as long as it reaches iteration limit or it doesn’t have any outlinks and votes for halt.

Loop detector job. Loop detector is not explicit part of LinkRank job, but users can define whether loops should be considered or not. In that case they will need separate BSP job which detects and eliminates reciprocal links and cycles. Loop detector is also straightforward in Giraph framework, since Giraph is mainly for handling graph processing. Each node will send all outlinks to their neighbors in form of nodelist and in next supersteps incoming messages will be forwarded again to neighbors by adding itself to the nodelist(path) or by voting to halt if they’re already there which means cycle detected. After given iterations which is depth of cycles, all nodes will check their list of incoming messages for inclusion of themselves, if they found match will be included to loopDb list. This is very similar to Semi-Clustering which is described in Google’s Pregel paper, where vertices send semi clusters to neighbors by adding themselves, calculating score and sending to their neighbors high ranked semi clusters. Loop detector can be used for other graph algorithms also.

Implementation details

As implemented in Apache Nutch, we can write below pseudo-code for LinkRank on top of Giraph.

pseudo_linkrank.py

def compute():
 # counting number of links
 if superstep is 0:
   sendMessageToAll(vertex.edgeCount())

 numLinks = 0
 # below superstep should be optimized using aggregators
 if superstep is 1:
   for msg in messages:
     numLinks += msg
   vertex.setInlinkScore(default)

 # inverter 2,3
 elif superstep is 2:
   for e in vertex.edges():
     sendMessageTo(e.destination, e)
   vertex.clearAllEdges()
 elif superstep is 3:
   # get links and add them to edges which inverts whole graph
   for msg in messages:
     vertex.addEdge(msg)

 # analyzer
 elif superstep > 3 and superstep < iterationCount:
   totalInlinkScore = 1.f/numLinks
   if vertex.edgeCount() is 0:
     voteForHalt()
     return
   for e in vertex.edges():
     totalInlinkScore += e.destination.totalInlinkScore
   linkrankScore = (1 - dampingFactor) +
                   (dampingFactor * totalInlinkScore)

   vertex.totalInlinkScore = linkrankScore

Project timeline

The project represents for me a full time commitment (40+ hours/week), not intending to do anything else during the summer (besides school activities until the 10th of June). Project is divided into 2 sub tasks: LinkRank and Loop detector. Timeline created by adding 1-2 days reserve for each step for unexpected cases and hopefully project can finish earlier.

28 May - 10 June.

Prototyping and documenting implementation. Writing first draft of usage for Apache Giraph users.

11 June - 2 July.

Implementation period of each MapReduce jobs with BSP.

3 July - 10 July.

Preparing ground truth for tests and testing on real environment.

11 July - 18 July.

Bug fix and refactoring the code. Starting of profiling for further optimization.

19 July - 26 July.

Optimization of source codes depending on their implementation dependent part and checking for possible optimizations in algorithm dependent part.

27 July - 1 August.

Testing, profiling and bug fixing.

2 August - 16 August.

Working on Loop detector and adding configurations for LinkRank for considering loops.

17 August - 24 August.

Testing and bug fixing.

25 August - 2 September

Preparing for final release. Refactoring. Adding missing documentation for users.

3 September - 23 September

Finding and fixing any missing bugs, mistakes, comments and etc,.

References

1. Pregel: A System for Large-Scale Graph Processing, Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn,.link

2. Processing over a billion edges on Apache Giraph, Avery Ching. link

3. LinkRank.java

4. NUTCH-635

5. GIRAPH-584

Child pages

GSoC 2013 - Giraph Implementation of Nutch LinkRank Algorithm (2)