This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • Giraph implementation of Nutch LinkRank Algorithm - Ahmet Emre Aladağ

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Introduction

Current Development

Wiki Markup\[a\] [https://github.com/AGMLab/giraph/|https://github.com/AGMLab/giraph/] \
[b\] [https://github.com/AGMLab/giraph/tree/trunk/giraph-examples/src/main/java/org/apache/giraph/examples/LinkRank|https://github.com/AGMLab/giraph/tree/trunk/giraph-examples/src/main/java/org/apache/giraph/examples/LinkRank] \[c\] [
[c] https://github.com/AGMLab/giraph/blob/trunk/giraph-examples/src/test/java/org/apache/giraph/examples/LinkRankVertexTest.java|https://github.com/AGMLab/giraph/blob/trunk/giraph-examples/src/test/java/org/apache/giraph/examples/LinkRankVertexTest.java]

Wiki Markup\[d\] https://issues.apache.org/jira/browse/GIRAPH-584

Problem

LinkRank Scoring mechanism in Apache Nutch 1.x currently works in pure map-reduce pattern. Moreover, Apache Nutch is not optimized for graph-processing operations. Due to this nature of Nutch, scoring calculation could have been more efficient if done by a graph-processing library that runs with Bulk Synchronous Parallel model. Moreover, Apache Nutch 2.x which has slightly different architecture than 1.x, lacks LinkRank scoring. So rather than porting it to the new architecture, a cross-version solution would be nice to have.

...

Stage

Date Range

Task

Status

1

18.03.2013 - 27.03.2013

Practice crawling with Nutch 1.x and 2.x and Searching through SolrCloud.

DONE

2

28.03.2013 - 01.04.2013

Practice Hadoop and Mapreduce

DONE

3

02.04.2013 - 06.04.2013

Read on WebGraph, LinkRank and ScoringFilter mechanism

DONE

4

07.04.2013 - 15.04.2013

Write sample scoring plugins for Nutch 1.x and 2.x and debugging

DONE

5

16.04.2013 - 24.04.2013

Practice with Giraph. Write sample PageRank code from scratch and modify it

DONE

6

06.05.2013 - 01.06.2013

Run PageRank on sample graphs, practice more with Giraph

DONE

 

Milestone 1

Discovering & Learning

 

7

03.06.2013 - 07.06.2013

Design Graph Metadata Design

DONE

8

10.06.2013 - 14.06.2013

Duplicate Link Removal

DONE

9

17.06.2013 - 21.06.2013

Design input/output pipeline and serialization

DONE

10

24.06.2013 - 28.06.2013

Write Tests to make sure it's working properly.

DONE

 

Milestone 2

Generic LinkRank with Giraph

 

11

01.07.2013 - 05.07.2013

Read more on Nutch 1.x plugin mechanism

IN PROGRESS

12

08.07.2013 - 11.07.2013

Write Nutch 1.x proxy plugin

 

13

15.07.2013 - 19.07.2013

Test Nutch 1.x - Giraph Integration

 

14

22.07.2013 - 26.07.2013

Make sure original LinkRank and mine produces the same results

 

 

Milestone 3

Nutch 1.x Integration

 

15

28.07.2013 - 02.08.2013

Learn how to use Gora for accessing the scores in Nutch 2.x

 

16

05.08.2013 - 09.08.2013

Read more on Nutch 2.x plugin mechanism

 

17

12.08.2013 - 16.08.2013

Write Nutch 2.x proxy plugin

 

18

19.08.2013 - 30.08.2013

Test Nutch 2.x - Giraph Integration

 

 

Milestone 4

Nutch 2.x Integration

 

19

02.09.2013 - 06.09.2013

Loop Elimination

 

20

09.09.2013 - 13.09.2013

Testing Loop Elimination

 

21

16.09.2013 - 20.09.2013

Community Testing & Review & Writing Report

 

22

23.09.2013

Final Deliverable

 

 

Milestone 5

Final Milestone

 

 

Future Work

Improvements on LinkRank: similar and better versions of LinkRank.

 

...

I'm Ahmet Emre Aladağ, a 4th semester PhD Student in Boğaziçi University, Istanbul, Turkey. My research interests are Complex Network Analysis (Ranking algorithms, Influence networks, Information Spread, Finding the most influential person/page), Information Retrieval (Crawling, search engines, ranking the web pages via graph-theoretic measures and pattern recognition methods given implicit feedback.). I have taken Complex Networks, Information Retriveal, Aritficial Intelligence, Machine Learning courses that could be related to this project.

Wiki MarkupIn the Masters (GPA 4.00), I had 1 conference publication \ [1\] on Visualization of Protein Interaction Networks and 2 journal publications on highly reputable Oxford Bioinformatics Journal on the topics Clustering, Aligning and Visualizing Protein Interaction Networks \ [2\] \ [3\]. I have also taken Advanced Algorithms on graphs and Parallel Programming courses.

I had my (non-GSoC) internship in the Pardus Linux project (which was also involved in GSoC) and developed a Package-Content Search Engine and Multi-System Installation system for Pardus Linux. I have been a Linux user and Free Software Contributor since 2006. I contributed several Django applications and developed open source projects on github/bitbucket. I used mostly Python, Java and some C for my projects.

...

Currently, I'm working for a R&D company where I'm given the position for developing an efficient and precise ranking algorithm. We will be using Nutch 2.x and it's to-be-implemented  LinkRank scoring so they support me in contributing Nutch community. I will be working on this project in my working hours at the office and also at home. My company and our partners have been contributors to the Nutch project for some years. Moreover, my research area in my PhD studies is detecting the most important person/page on a network. So it will be very convenient and joyful for me to work on this project. Contributing to a project of Apache foundation is an honour for me.

Wiki Markup\[1\] A.E. Aladag, C. Erten, M. Sozdinler, ”An integrated model for visualizing biclusters from gene expression data and PPI networks”, Proc. International Symposium on Biocomputing, no.24, 2010. \
[2\] A.E. Aladag, C. Erten, M. Sozdinler, ”Reliability Oriented Bioinformatic Networks Visualization”, Bioinformatics, vol 27, pp. 1583-1584, 2011. \
[3\] A.E. Aladag, C. Erten, ”SPINAL: Scalable Protein Interaction Network Alignment”, Bioinformatics, vol 29, pp. 917-924, 2013.