The Pig tutorial shows you how to run two Pig scripts in local mode and hadoop mode.
The Pig tutorial file (pigtutorial.tar.gz) or the tutorial/pigtutorial.tar.gz file in the pig distribution) includes the Pig JAR file (pig.jar) and the tutorial files (tutorial.jar, Pigs scripts, log files). These files work with Hadoop 0.18 and provide everything you need to run the Pig scripts. To get started, follow these basic steps:
Make sure your run-time environment includes the following:
To install Pig, do the following:
$ tar -xzf pigtutorial.tar.gz |
To run the Pig scripts in local mode, do the following:
$ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local script1-local.pig |
$ ls -l script1-local-results.txt $ cat script1-local-results.txt |
To run the Pig scripts in hadoop (mapreduce) mode, do the following:
$ hadoop fs -copyFromLocal excite.log.bz2 . |
$ java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main script1-hadoop.pig |
$ hadoop fs -ls script1-hadoop-results $ hadoop fs -cat 'script1-hadoop-results/*' | less |
The contents of the Pig tutorial file (pigtutorial.tar.gz) are described here.
File |
Description |
---|---|
pig.jar |
Pig JAR file |
tutorial.jar |
User-defined functions (UDFs) and Java classes |
script1-local.pig |
Pig Script 1, Query Phrase Popularity (local mode) |
script1-hadoop.pig |
Pig Script 1, Query Phrase Popularity (Hadoop cluster) |
script2-local.pig |
Pig Script 2, Temporal Query Phrase Popularity (local mode) |
script2-hadoop.pig |
Pig Script 2, Temporal Query Phrase Popularity (Hadoop cluster) |
excite-small.log |
Log file, Excite search engine (local mode) |
excite.log.bz2 |
Log file, Excite search engine (Hadoop cluster) |
A better-documented version of script1-local.pig can be found at https://cwiki.apache.org/confluence/download/attachments/27822259/script1-local-with-added-documentation.pig . It includes comments showing samples from each intermediate relation.
The user-defined functions (UDFs) are described here.
UDF |
Description |
---|---|
ExtractHour |
Extracts the hour from the record. |
NGramGenerator |
Composes n-grams from the set of words. |
NonURLDetector |
Removes the record if the query field is empty or a URL. |
ScoreGenerator |
Calculates a "popularity" score for the n-gram. |
ToLower |
Changes the query field to lowercase. |
TutorialUtil |
Divides the query string into a set of words. |
The Query Phrase Popularity script (script1-local.pig or script1-hadoop.pig) processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day.
The script is shown here:
REGISTER ./tutorial.jar; |
raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query); |
clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); |
clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; |
houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query; |
ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram; |
ngramed2 = DISTINCT ngramed1; |
hour_frequency1 = GROUP ngramed2 BY (ngram, hour); |
hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count; |
uniq_frequency1 = GROUP hour_frequency2 BY group::ngram; |
uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1)); |
uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean; |
filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0; |
ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour, score); |
STORE ordered_uniq_frequency INTO '/tmp/tutorial-results' USING PigStorage(); |
The Temporal Query Phrase Popularity script (script2-local.pig or script2-hadoop.pig) processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours.
The script is shown here:
REGISTER ./tutorial.jar; |
raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query); |
clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); |
clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; |
houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query; |
ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram; |
ngramed2 = DISTINCT ngramed1; |
hour_frequency1 = GROUP ngramed2 BY (ngram, hour); |
hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count; |
hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram, $1 as hour, $2 as count; |
hour00 = FILTER hour_frequency2 BY hour eq '00'; |
hour12 = FILTER hour_frequency3 BY hour eq '12'; |
same = JOIN hour00 BY $0, hour12 BY $0; |
same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as ngram, $2 as count00, $5 as count12; |
STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage(); |