Date: Tue, 19 Mar 2024 07:02:12 +0000 (UTC) Message-ID: <2085678424.54563.1710831732243@cwiki-he-fi.apache.org> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_54562_89904706.1710831732243" ------=_Part_54562_89904706.1710831732243 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Most of the text and original code from this page are originally= from WritingPlu= ginExample. It's been updated to work with the trunk as of revision 506= 842, and to add unit testing.
Consider this as a plugin example: We want to be able to recommend speci= fic web pages for given search terms. For this example we'll assume we're i= ndexing this site. As you may have noticed, there are a number of pages tha= t talk about plugins. What we want to do is have it so that if someone sear= ches for the term "plugins" we recommend that they start at the PluginCentral page, but we als= o want to return all the normal hits in the expected ranking. We'll seperat= e the search results page into a section of recommendations and then a sect= ion with the normal search results.
You go through your site and add meta-tags to pages that list what terms= they should be recommended for. The tags look something like this:
<meta name=3D"recommended" content=3D"plugins" />
In order to do this we need to write a plugin that extends 3 different e= xtension points. We need to extend the HTMLParser in order to get the recom= mended terms out of the meta tags. The IndexingFilter will need to be exten= ded to add a recommended field to the index. The QueryFilter needs to be ex= tended to add the ability to search againsed the new field in the index.
Start by downloading the Nutch sou= rce code. Once you've got that make sure it compiles as is before you make = any changes. You should be able to get it to compile by running ant from th= e directory you downloaded the source to. If you have trouble you can write= to one of the Mailing Lists<= /a>.
Use the source code for the plugins distrubuted with Nutch as a referenc= e. They're in [!YourCheckoutDir]/src/plugin.
For the example we're going to assume that this plugin is something we w= ant to contribute back to the Nutch community, so we're going to use the di= rectory/package structure of "org/apache/nutch". If you're writing a plugin= solely for the use of your organization you'd want to replace that with so= mething like "org/my_organization/nutch".
You're going to need to create a directory inside of the plugin director= y with the name of your plugin ('recommended' in this case) and inside that= directory you need the following:
<= /p>
== p>
=
The s=
ource code of your plugin in the directory structure recommended/src/java/o=
rg/apache/nutch/parse/recommended/[Source_Here].
=
<= p><= /p>
== p>
=
Your plugin.xml file should look like this:
<?xml version=3D"1.0" encoding=3D"UTF-8"?> <plugin id=3D"recommended" name=3D"Recommended Parser/Filter" version=3D"0.0.1" provider-name=3D"nutch.org"> <runtime> <!-- As defined in build.xml this plugin will end up bundled as re= commended.jar --> <library name=3D"recommended.jar"> <export name=3D"*"/> </library> </runtime> <!-- The RecommendedParser extends the HtmlParseFilter to grab the co= ntents of any recommended meta tags --> <extension id=3D"org.apache.nutch.parse.recommended.recommendedfilter= " name=3D"Recommended Parser" point=3D"org.apache.nutch.parse.HtmlParseFilter"> <implementation id=3D"RecommendedParser" class=3D"org.apache.nutch.parse.recommended.Recommend= edParser"/> </extension> <!-- TheRecommendedIndexer extends the IndexingFilter in order to add= the contents of the recommended meta tags (as found by the RecommendedParser) to= the lucene index. --> <extension id=3D"org.apache.nutch.parse.recommended.recommendedindexe= r" name=3D"Recommended identifier filter" point=3D"org.apache.nutch.indexer.IndexingFilter"> <implementation id=3D"RecommendedIndexer" class=3D"org.apache.nutch.parse.recommended.Recommend= edIndexer"/> </extension> <!-- The RecommendedQueryFilter gets called when you perform a search= . It runs a search for the user's query against the recommended fields. In ord= er to get add this to the list of filters that gets run by default, you have = to use "fields=3DDEFAULT". --> =20 <extension id=3D"org.apache.nutch.parse.recommended.recommendedSearch= er" name=3D"Recommended Search Query Filter" point=3D"org.apache.nutch.searcher.QueryFilter"> <implementation id=3D"RecommendedQueryFilter" class=3D"org.apache.nutch.parse.recommended.Recommend= edQueryFilter"> =09<parameter name=3D"fields" value=3D"recommended"/> =09</implementation> </extension> </plugin>
In its simplest form:
<?xml version=3D"1.0"?> <project name=3D"recommended" default=3D"jar"> <import file=3D"../build-plugin.xml"/> </project>
For Nutch-1.0 write the following:
<?xml version=3D"1.0"?> <project name=3D"recommended" default=3D"jar-core"> <import file=3D"../build-plugin.xml"/> =20 <!-- Build compilation dependencies --> <target name=3D"deps-jar"> <ant target=3D"jar" inheritall=3D"false" dir=3D"../lib-xml"/> </target> <!-- Add compilation dependencies to classpath --> <path id=3D"plugin.deps"> <fileset dir=3D"${nutch.root}/build"> <include name=3D"**/lib-xml/*.jar" /> </fileset> </path> <!-- Deploy Unit test dependencies --> <target name=3D"deps-test"> <ant target=3D"deploy" inheritall=3D"false" dir=3D"../lib-xml"/> <ant target=3D"deploy" inheritall=3D"false" dir=3D"../nutch-extension= points"/> <ant target=3D"deploy" inheritall=3D"false" dir=3D"../protocol-file"/= > </target> =20 <!-- for junit test --> <mkdir dir=3D"${build.test}/data"/> <copy file=3D"data/recommended.html" todir=3D"${build.test}/data"/> </project>
Save this file in directory [!YourCheckoutDir]/src/plugin/recommended
NOTE: Nutch-1.0 users make sure that you save all your java files in thi= s directory C:\nutch-1.0\src\plugin\recommended\src\java\org\apache\nutch\p= arse\recommended
This is the source code for the HTML Parser extension. It tries to grab = the contents of the recommended meta tag and add them to the document being= parsed. On the directory , create a file called RecommendedParser.java and add this as the contents:
package org.apache.nutch.parse.recommended; // JDK imports import java.util.Enumeration; import java.util.Properties; import java.util.logging.Logger; // Nutch imports import org.apache.hadoop.conf.Configuration; import org.apache.nutch.parse.HTMLMetaTags; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.HtmlParseFilter; import org.apache.nutch.protocol.Content; // Commons imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // W3C imports import org.w3c.dom.DocumentFragment; public class RecommendedParser implements HtmlParseFilter { private static final Log LOG =3D LogFactory.getLog(RecommendedParser.clas= s.getName()); =20 private Configuration conf; /** The Recommended meta data attribute name */ public static final String META_RECOMMENDED_NAME=3D"Recommended"; /** * Scan the HTML document looking for a recommended meta tag. */ public Parse filter(Content content, Parse parse,=20 HTMLMetaTags metaTags, DocumentFragment doc) { // Trying to find the document's recommended term String recommendation =3D null; Properties generalMetaTags =3D metaTags.getGeneralTags(); for (Enumeration tagNames =3D generalMetaTags.propertyNames(); tagNames= .hasMoreElements(); ) { if (tagNames.nextElement().equals("recommended")) { recommendation =3D generalMetaTags.getProperty("recommended"); LOG.info("Found a Recommendation for " + recommendation); } } if (recommendation =3D=3D null) { LOG.info("No Recommendation"); } else { LOG.info("Adding Recommendation for " + recommendation); parse.getData().getContentMeta().set(META_RECOMMENDED_NAME, recomme= ndation); } return parse; } =20 =20 public void setConf(Configuration conf) { this.conf =3D conf; } public Configuration getConf() { return this.conf; } =20 }
The following is the code for the Indexing Filter extension. If the docu= ment being indexed had a recommended meta tag this extension adds a lucene = text field to the index called "recommended" with the content of that meta = tag. Create a file called RecommendedInd= exer.java in the source code directory:
package org.apache.nutch.parse.recommended; // JDK import import java.util.logging.Logger; // Commons imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // Nutch imports import org.apache.nutch.util.LogUtil; import org.apache.nutch.fetcher.FetcherOutput; import org.apache.nutch.indexer.IndexingFilter; import org.apache.nutch.indexer.IndexingException; import org.apache.nutch.parse.Parse; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.crawl.CrawlDatum; import org.apache.nutch.crawl.Inlinks; // Lucene imports import org.apache.lucene.document.Field; import org.apache.lucene.document.Document; public class RecommendedIndexer implements IndexingFilter { =20 public static final Log LOG =3D LogFactory.getLog(RecommendedIndexer.clas= s.getName()); =20 private Configuration conf; =20 public RecommendedIndexer() { } public Document filter(Document doc, Parse parse, Text url,=20 CrawlDatum datum, Inlinks inlinks) throws IndexingException { String recommendation =3D parse.getData().getMeta("Recommended"); if (recommendation !=3D null) { Field recommendedField =3D=20 new Field("recommended", recommendation,=20 Field.Store.YES, Field.Index.UN_TOKENIZED); recommendedField.setBoost(5.0f); doc.add(recommendedField); LOG.info("Added " + recommendation + " to the recommended Field= "); } return doc; } =20 public void setConf(Configuration conf) { this.conf =3D conf; } public Configuration getConf() { return this.conf; } =20 }
Note that the field is UN_TOKENIZED because we don't want the recommende= d tag to be cut up by a tokenizer. Change to TOKENIZED if you want to be ab= le to search on parts of the tag, for example to put multiple recommended t= erms in one tag.
The QueryFilter gets called when = the user does a search. We're bumping up the boost for the recommended fiel= d in order to increase its influence on the search results.
package org.apache.nutch.parse.recommended; import org.apache.nutch.searcher.FieldQueryFilter; import java.util.logging.Logger; // Commons imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; public class RecommendedQueryFilter extends FieldQueryFilter { private static final Log LOG =3D LogFactory.getLog(RecommendedParser.cl= ass.getName()); public RecommendedQueryFilter() { super("recommended", 5f); LOG.info("Added a recommended query"); } =20 }
For ant installation in Windows, refer this - ant
In order to build the plugin - or Nutch itself - you'll need ant. If you= 're using MacOs you can easily get i= t via fink. Let's get junit while we're at it.
fink install ant ant-junit junit
In order to build it, change to your plugin's directory where you saved = the build.xml file (probably [!YourCheckoutDir]/src/plugin/recommended), an= d simply type
ant
Hopefully you'll get a long string of text, followed by a message tellin= g you of a successful build.
In order for ant to compile and deploy your plugin on the global build y= ou need to edit the src/plugin/build.xml file (NOT the build.xml in the roo= t of your checkout directory). You'll see a number of lines that look like<= /p>
<ant dir=3D"[plugin-name]" target=3D"deploy" />
Edit this block to add a line for your plugin before the </target>= tag.
<ant dir=3D"recommended" target=3D"deploy" />
Running 'ant' in the root of your checkout directory should get everythi= ng compiled and jared up. The next time you run a crawl your parser and ind= ex filter should get used.
You'll need to run 'ant war' to compile a new ROOT.war file. Once you've= deployed that, your query filter should get used when searches are perform= ed.
We'll need to create two files for unit testing: a page we'll do the tes= ting against, and a class to do the testing with. Again, let's assume your = plugin directory is [!YourCheckoutDir]/src/plugin and that your test plugin= is under that directory. Create directory recommended/data, and under it m= ake a new file called recommended.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html lang=3D"en"> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dut= f-8"> <title>recommended</title> <meta name=3D"generator" content=3D"TextMate http://macromates.com/"= > <meta name=3D"author" content=3D"Ricardo J. M=C3=A9ndez"> <meta name=3D"recommended" content=3D"recommended-content"/> <!-- Date: 2007-02-12 --> </head> <body> Recommended meta tag test. </body> </html>
This file contains the meta tag we're currently parsing for, with the va= lue recommended-content. After that gratuitous bit of free= publicity for my current favorite editor, let's move on to the testing cla= ss.
Create a new tree structure, this time for the test code, for example re= commended/src/test/org/apache/nutch/parse/recommended/[Test_Source_Here]. T= here you'll create a file called [TestRecommendedPars= er].java.
package org.apache.nutch.parse.recommended; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.protocol.Content; import org.apache.hadoop.conf.Configuration; import org.apache.nutch.util.NutchConfiguration; import java.util.Properties; import java.io.*; import java.net.URL; import junit.framework.TestCase; /* * Loads test page recommended.html and verifies that the recommended=20 * meta tag has recommended-content as its value. * */ public class TestRecommendedParser extends TestCase { private static final File testDir =3D new File(System.getProperty("test.data")); public void testPages() throws Exception { pageTest(new File(testDir, "recommended.html"), "http://foo.com/", "recommended-content"); } public void pageTest(File file, String url, String recommendation) throws Exception { String contentType =3D "text/html"; InputStream in =3D new FileInputStream(file); ByteArrayOutputStream out =3D new ByteArrayOutputStream((int)file.lengt= h()); byte[] buffer =3D new byte[1024]; int i; while ((i =3D in.read(buffer)) !=3D -1) { out.write(buffer, 0, i); } in.close(); byte[] bytes =3D out.toByteArray(); Configuration conf =3D NutchConfiguration.create(); Content content =3D new Content(url, url, bytes, contentType, new Metadata(), conf); Parse parse =3D new ParseUtil(conf).parseByExtensionId("parse-html",con= tent); Metadata metadata =3D parse.getData().getContentMeta(); assertEquals(recommendation, metadata.get("Recommended")); assertTrue("somesillycontent" !=3D metadata.get("Recommended")); } }
As you can see, this code first parses the document, looks for the
Now add some lines to the build.xml file located in [!YourCheckoutDir]/s= rc/plugin/recommended directory, so that at a minimum its contents are:
<?xml version=3D"1.0"?> <project name=3D"recommended" default=3D"jar"> <import file=3D"../build-plugin.xml"/> <!-- for junit test --> <mkdir dir=3D"${build.test}/data"/> <copy file=3D"data/recommended.html" todir=3D"${build.test}/data"/> </project>
These lines will copy the test data to the proper directory for testing.=
To run the test case, simply move back to your plugin's root directory a= nd execute
ant test
In order to get Nutch to use your plugin, you need to edit your conf/nut= ch-site.xml file and add in a block like this:
<property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(t= ext|html)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin id names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. B= y default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property>
You'll want to edit the regular expression so that it includes the id of= your plugin.
<value>recommended|protocol-http|urlfilter-regex|parse-(text|h= tml|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlno= rmalizer-(pass|regex|basic)</value>
<<< See also: HowToContribute
<<< PluginC= entral