Documentation

All the documentation of this project lives in Librdfa-rdf4j documentation.

Code

The code resulting as part of GSoC 2018 lives in GitHub in the repository of the ASF.

Proposal: ANY23-295  Implement ability to use librdfa

Description

In 2012, the Any23 community decided to migrate from its own RDFa parser implementation to Semargl[1] as discussed in [2]. Semargl is a modular framework for crawling linked data from structured documents [9] which provides a RDFa parser compatible with RDF4J through an integration module [3]. Since that issue [4] was closed, Semargl turned into the official RDFa parser for Any23.

Lewis John McGibbney re opened the discussion proposing to test librdfa [5], a C/C++ library which claims to be ‘The fastest RDFa processor on the Internet’ and supports RDFa 1.0 and 1.1 in many varieties such XML+RDFa, XHTML+RDFa, etc. That idea was launched in order to evaluate what kind of performance boost could Any23 achieve in parsing RDFa by using a native parser and how well librdfa would integrate with Any23.

In this context, the present proposal aims to accomplish the aforementioned objective and provide an seamless integration between Any23 and librdfa parser, which allows to conduct a fair performance comparison between Semargl and librdfa within Any23.

Student

Julio Caguano

Mentor

Lewis John McGibbney

JIRA Issue

https://issues.apache.org/jira/browse/ANY23-295

Full Proposal

Proposal Title : Integrate and evaluate librdfa RDFa parser into Any23 via JNI (Java Native Interface) [10].

Student Name: Julio Caguano .

Student Email : julio.caguanob@gmail.com

JIRA Issues: https://issues.apache.org/jira/browse/ANY23-295   Implement ability to use librdfa

Project Deliverables

  • New standalone module with a new RDFa parser compatible with RDF4J using librdfa.

  • JNI bridge to librdfa including interfaces and middleware utilities.

  • Unit tests for the new librdfa module

  • Benchmark tests comparing Semargl and librdfa.

  • Self-maintaining Any23 Website documentation which will vizualize integration test results in addition to Any23 compliance against the http://rdfa.info/test-suite/

Detailed description

Anything to Triples (Any23) is a library, a web service and a command line tool that extracts structured data in RDF format from a variety of Web documents. Currently it supports the following input formats: RDF/XML, Turtle, Notation 3, RDFa ... [6]. As explained in the initial description of this document, Any23 community would like to test new and probably more efficient mechanisms for processing data. This proposal specifically covers the RDFa format and how it is parsed within Any23 putting forward the integration and evaluation of a new RDFa parser based on the C/C++ library (librdfa). This integration problem is intended to be addressed via JNI using a set of interfaces and middleware utilities, which will be documented and evaluated.

Scope for the project


This project will be involved in the implementation of a new RDFa parser for Any23, which serves as a wrapper for the librdfa library. The project will also include a evaluation phase for measuring the improvements or drawbacks of using such parser as the main Any23 RDFa processor.

Design

The implementation process will rely on the pre existing parsers infrastructure of Any23 which is provided by RDF4J and will use JNI as integration mechanisms for librdfa. The development of the project will be divided in  three phases.

 

Implementation Approach

Bridge: This phase will tackle the communication issues between Any23 and librdfa and will be mainly focused on:

  • Loading librdfa binaries into Java.

  • Sending data streams from Java to librdfa (Documents’ content).

  • Sending parsing configurations to librdfa (ParserConfig parameters - RDF4J).

  • Handling and throwing exceptions.

  • Retrieving statements (triples) from librdfa to Java.

The Java objects to C structures and vise versa translation probably will be implemented with Protobuf [7] or similars, but it could be depend on the real issues that arise during developing time.

Wrapper: This phase will focuses on fulfilling the RDF4J interfaces in order to warranty compatibility with the existing parsers and other components of Any23. This phase will mainly deal with:

  • Implement the necessary superclasses of RDF4J (i.e. RDFParser, RDFParserFactory, etc. ).

  • Configure the project to work correctly with SPI.

Evaluation: This phase will compare the performance of the new parser with respect to the existing one. The main activities to be executed are:

  • Define a document dataset of RDFa.

  • Measure triples extraction time for Any23 with the existing parser.

  • Measure triples extraction time for Any23 with librdfa.

  • Compare, analyze and share the results.

Finally, it is worth to mention that every component coded during each phase will be accompanied with corresponding documentation in the Wiki [8].


Time Frame

 

Time Period

Expected Outcome

March 01 - April 23

Understanding the task and preparing proposal

April 24 - April 30

Community bounding

May 01 - June 10

Phase 1: Bridge.

June 11 - June 15

GSoC Evaluation 1

June 16 - June 29

Phase 2: Wrapper.

June 30 - July 8

Phase 3: Evaluation

July 9 - July 13

GSoC Evaluation 2.

July 14 - July 25

Camera-ready documentation and sharing results with the community.

July 26 - August 5

Receive feedback and fix minor issues.

August 6 - August 14

GSoC Final evaluation.


About Myself

I am Julio Caguano an undergraduate student of Computer Science at the University of Cuenca in Ecuador, I’m currently in my final year of college.

I got started into web technologies a couple years ago during my courses at the college where discovered the Linked Data and Semantic Web initiatives. Since then I have been playing with RDF and SPARQL in some of my assignments of college. I would like to deepen my knowledge in these technologies because their importance has been steadily growing in the recent years. Also, I would look forward continue studying these research areas in a postgraduate course.

I consider I have a pretty good knowledge of the Java language and related technologies (i.e. Maven, SPI). Also, I have taken some classes related to C/C++ at college and I played with this language on my own. So, I feel confident enough to work in this project and meeting the proposed goals.

My main motivation for applying this project was my background on Linked Data technologies, because I used Any23 in the past and I liked how it works. Also, I found the code pretty comprehensible and readable. On top of that, I personally always liked integration challenges I found them interesting because you have to push yourself out of your comfort zone and learn new technologies and how to interact with them.


Commitment

I estimate I could assign 25 hours per week to this project during the coding period (Including weekends and midweek free time). Nevertheless, It could be increased depending on the progress of the project or suggestions of my mentor. I would split my time into my studies and this project, which hopefully will not be a problem taking into consideration that the project will take place at the beginning of my school semester when the assignments load is small. In addition, I will be posting a weekly report on the GSoC section of the project´s wiki in order to share my progress in the planned tasks.

References

[1] https://github.com/semarglproject/semargl 

[2] http://markmail.org/thread/wn3fxkwozc3zkfqc 

[3] https://github.com/semarglproject/semargl-rdf4j 

[4] https://issues.apache.org/jira/browse/ANY23-137 

[5] https://github.com/rdfa/librdfa 

[6] http://any23.apache.org/ 

[7] https://github.com/google/protobuf 

[8] https://cwiki.apache.org/confluence/display/ANY23/CSoC+2018 

[9] https://github.com/semarglproject/semargl 

[10] https://es.wikipedia.org/wiki/Java_Native_Interface 

Project Reports

29/04/2018

Project description

Getting started with Any23 source code and start working on Any23-231.

Review of Previous Actions

N/A

Objectives

Currently, Any23 REST API for JSON has some issues regarding indentation and syntax. Make JSON Reporting output pretty print.

Future Actions

Discuss with the community and create a patch for the issue.

Mentors Comments

06/05/2018

Project description

Working on Any23-231 and add new output format (JSON-LD) for Any23.

Review of Previous Actions

N/A

Objectives

Submit PR to Any23-231, Fix formatting issues in the JSON Writer and add a JSON-LD Writer.

Future Actions

Close Any23-231

Mentors Comments

13/05/2018

Project description

Close Any23-231 and research about tools for JNI development.

Review of Previous Actions

N/A

Objectives

Update Any23 documentation about JSON-LD format, close Any23-231 and research about JNI tools : SWIG, JAVA CPP.

Future Actions

Start working on the Librdfa - Any23 bridge.

Mentors Comments

20/05/2018

Project description

Understanding of JNI and investigate maven to build automation.

Review of Previous Actions

Any23-231 was merged with the development branch.

Objectives

Get a broad understanding of JNI and make a small example to see the interaction Java/C. In addition, JNI makes some calls that must be executed in console, this has to be automated with maven in order to interact with the actual pipeline that Any23 uses. 

Future Actions

Build and install librdfa. Choose a tool for the bridge between librdfa and any23; we can use JNI, JNA, JAVA CPP, or SWIG.

Mentors Comments

27/05/2018

Project description

Build and install librdfa.

Review of Previous Actions

N/A

Objectives

Install librdfa and familiarise with the code base. Construct a small C program to interact with the pipeline used in librdfa to parse a XTML file into triples.

Future Actions

Start working on the Librdfa - Any23 bridge.

Mentors Comments

03/06/2018

Project description

Use JNI for communication between librdfa and any23

Review of Previous Actions

N/A

Objectives

Work in the communication of librdfa and any23

Future Actions

Connect Any23 with librdfa.

Mentors Comments

10/06/2018

Project description

Develop callbacks to interact between C (librdfa) and Java

Review of Previous Actions

N/A

Objectives

Implement basic interfaces Java/C.

Future Actions

Integrate librdfa bridge with any23

Mentors Comments

17/06/2018

Project description

Version of bridge between Java/C

Review of Previous Actions

N/A

Objectives

Find a work around of complex types .

Future Actions

Integrate code with maven

Mentors Comments

24/06/2018

Project description

Integrate project with maven build

Review of Previous Actions

N/A

Objectives

Make a build pipeline with maven and ease the integration with any23 .

Future Actions

Integrate code with any23

Mentors Comments

01/07/2018

Project description

Implement librdfa with Rio

Review of Previous Actions

N/A

Objectives

Use the default API for parsing RDF in RDF4J..

Future Actions

Implement module for parsing RDFa in RDF4J

Mentors Comments

08/07/2018

Project description

Implement librdfa with Rio, tests, and benchmarking

Review of Previous Actions

I found some memory issues that I am still working on.

Objectives

Use the default API for parsing RDF in RDF4J.

Future Actions

Integrate librdfa-rdf4j module with any23. More tests need to be added and a broader benchmarking analysis is needed. I am using semargl-rdf4j as baseline. 

Mentors Comments

15/07/2018

Project description

Extractor for librdfa

Review of Previous Actions

N/A

Objectives

Integrate any23 with librdfa-rdf4j 

Future Actions

Generate tests for any23 new extractor and test all functionality

Mentors Comments

22/07/2018

Project description

Integration of librdfa-rdf4j with ANY23

Review of Previous Actions

testBasic() is failing, it needs to be fixed. 

Objectives

Generate tests for any23 new extractor and test all functionality

Future Actions

Complete integration and make librdfa extrator configurable. 

Mentors Comments

29/07/2018

Project description

I fixed a memory problem that I found out in librdfa-rdf4j while making the tests. Also, I added the tests of RDFa 1.0/1.1 extractors since librdfa supports both. Finally, the librdfa extractor is configurable.

Review of Previous Actions

Objectives

Complete integration and make librdfa extrator configurable.

Future Actions

 Write documentation

Mentors Comments

05/08/2018

Project description

Write documentation and provide a final PR. The PR includes the librdfa extractor and the bridge between librdfa and java.

Review of Previous Actions

Objectives

Write documentation and provide final pull request.

Future Actions

Make changes according to mentor suggestions.

Mentors Comments

12/08/2018

Project description

Submit final suggestions reviewed by my mentor. Correction of code and documentation according to mentor suggestions that I will be providing as a result of GSoC.

Review of Previous Actions

Objectives

Make changes according to mentor suggestions.

Future Actions

Mentors Comments







  • No labels