- Q: What shall I do if I saw "Failed to create DataStorage"?
- Q: How can I pass a specific hadoop configuration parameter to Pig?
- Q: I already register my LoadFunc/StoreFunc jars in "register" statement, but why I still get "Class Not Found" exception?
- Q: How can I load data using Unicode control characters as delimiters?
- Q: How do I control the number of mappers?
- Q: How do I make my Pig jobs run on a specified number of reducers?
- Q: Can I do a numerical comparison while filtering?
- Q: Does Pig support regular expressions?
- Q: How do I prevent failure if some records don't have the needed number of columns?
- Q: Is there any difference between `==` and `eq` for numeric comparisons?
- Q: Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?
- Q: Does Pig allow grouping on expressions?
- Q: Is there a way to check if a map is empty?
- Q: I load data from a directory which contains different file. How do I find out where the data comes from?
- Q: How can I calculate a percentage (partial aggregate / total aggregate)?
- Q: How can I pass a parameter with space to a pig script?
Q: What shall I do if I saw "Failed to create DataStorage"?
This usually happens when you are connecting hadoop cluster other than standard Apache hadoop 20.2 release. Pig bundles standard hadoop 20.2 jars in release. If you want to connect to other version of hadoop cluster, you need to replace bundled hadoop 20.2 jars with compatible jars. You can try:
- do "ant"
- copy hadoop jars from your hadoop installation to overwrite ivy/lib/Pig/hadoop-core-0.20.2.jar and ivy/lib/Pig/hadoop-test-0.20.2.jar
- do "ant" again
- cp pig.jar to overwrite pig-*-core.jar
Some other tricks is also possible. You can use "bin/pig -secretDebugCmd" to inspect the command line of Pig. Make sure you are using the right version of hadoop.
This issue will be solved in Pig 0.9.1 and beyond.
Q: How can I pass a specific hadoop configuration parameter to Pig?
There are multiple places you can pass hadoop configuration parameter to Pig. Here is a list from high priority to low priority (configuration in high priority will override the configuration in low priority):
1. set command
2. -P properties_file
4. java system property/environmental variable
5. Hadoop configuration file: hadoop-site.xml/core-site.xml/hdfs-site.xml/mapred-site.xml, or Pig specific hadoop configuration file: pig-cluster-hadoop-site.xml)
Both 3 and 5 require the configuration file in classpath.
Q: I already register my LoadFunc/StoreFunc jars in "register" statement, but why I still get "Class Not Found" exception?
Try to put your jars in PIG_CLASSPATH as well. "register" guarantees your jar will be shipped to backend. But in the frontend, you still need to put the jars in CLASSPATH by setting "PIG_CLASSPATH" environment variable.
Q: How can I load data using Unicode control characters as delimiters?
The first parameter to PigStorage is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See java.util.regex.Pattern for more information on the way to use special characters in regex.
If you are loading a file which contains Ctrl+A as separators, you can specify this to PigStorage using the Unicode notation.
Q: How do I control the number of mappers?
It is determined by your InputFormat. If you are using PigStorage, FileInputFormat will allocate at least 1 mapper for each file. If the file is large, FileInputFormat will split the file into smaller trunks. You can control this process by two hadoop setting: "mapred.min.split.size", "mapred.max.split.size". In addition, after InputFormat tells Pig all the splits information, Pig will try to combine small input splits into one mapper. This process can be controlled by "pig.noSplitCombination" and "pig.maxCombinedSplitSize".
Q: How do I make my Pig jobs run on a specified number of reducers?
You can achieve this with the PARALLEL clause. For example:
Besides PARALLEL clause, you can also use "set default_parallel" statement in Pig script, or set "mapred.reduce.tasks" system property to specify default parallel to use. If none of these values are set, Pig will only use 1 reducers. (In Pig 0.8, we change the default reducer from 1 to a number calculated by a simple heuristic for foolproof purpose)
More details can be found at http://pig.apache.org/docs/r0.9.0/perf.html#parallel.
Q: Can I do a numerical comparison while filtering?
Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc.
Q: Does Pig support regular expressions?
Pig does support regular expression matching via the `matches` keyword. It uses java.util.regex matches which means your pattern has to match the entire string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"` not `"fred"`).
Q: How do I prevent failure if some records don't have the needed number of columns?
You can filter away those records by including the following in your Pig program:
This code would drop all records that have fewer than five (5) columns.
Q: Is there any difference between `==` and `eq` for numeric comparisons?
There is no difference when using integers. However, `11.0` and `11` will be equal with `==` but not with `eq`.
Q: Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?
You can run the following set of commands, which are equivalent to `SELECT COUNT` in SQL:
Q: Does Pig allow grouping on expressions?
Pig allows grouping of expressions. For example:
If the grouping is based on constants, the result is the same as GROUP ALL except the group-id is replaced by the constant.
Q: Is there a way to check if a map is empty?
In Pig 2.0 you can test the existence of values in a map using the null construct:
m#'key' is not null
Q: I load data from a directory which contains different file. How do I find out where the data comes from?
You can write a LoadFunc which append filename into the tuple you load.
Here is the LoadFunc:
In Pig 0.8/0.9.0/0.9.1, you need to set "pig.splitCombination" to false for PigStorageWithInputPath work correctly. 0.9.2 fix the issue.
Q: How can I calculate a percentage (partial aggregate / total aggregate)?
The challenge here is to get the total aggregate into the same statement as the partial aggregate. The key is to cast the relation for the total aggregate to a scalar: