DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
...
For more information on Pig and Spark projects, see hereReferences.
Motivation
The main motivation for enabling Pig to run on Spark is to:
...
Test | Comment | Owner | ETA | Status |
TestJoin | Joining empty tuples fails | Kelly | 1 day | ✔ |
TestPigContext, TestGrunt | PIG-4295. Deserializing UDF classpath config fails in Spark because it’s thread local. | Kelly | 3 days | ✔ |
TestProjectRange | PIG-4297 Range expressions fail with groupby | Kelly | 2 days | ✔ |
TestAssert | FilterConverter issue? | 1 day | ||
TestLocationInPhysicalPlan | 1 day |
Other Tests
ETA: 1 week
Several tests are failing due to either:
...
Investigate
And fix if needed
Feature | Comment | Owner | ETA | Status |
Streaming | Test are passing, but we need confirmation that this feature works. | 1 day |
Spark Unit Tests
Few Spark engine specific unit tests have been written so far (for features that have Spark specific implementations). Following is a partial list of what we need to add. Need to update this list as we add more Spark specific code. We should also add tests for POConverter implementations.
Test | Comment | Owner | ETA | Status |
TestSparkLauncher | ||||
TestSparkPigStats | ||||
TestSecondarySortSpark | Kelly | ✔ |
Enhance Test Infrastructure
...
Performance Optimizations
Feature | Comment | Owner | ETA | Status |
Split / MultiQuery using RDD.cache() | ||||
In Group + Foreach aggregations, use aggregateByKey or reduceByKey for much better performance | For example: COUNT or DISTINCT aggregation inside nested foreach is handled by Pig code. We should use Spark to do in more efficiently | |||
Compute optimal Shuffle Parallelism | Currently we let Spark pick the default | |||
Combiner support for “Group+ForEach” | ||||
Multiple GROUP BYs on the same data set can avoid multiple shuffles. | See MultiQueryPackager | |||
Switch to Kryo for Spark data serialization | Are all Pig serializable classes compatible with Kryo ? | |||
FR Join | ||||
Merge Join (including sparse merge join) | ||||
Skew Join | ||||
Merge CoGroup | ||||
Collected CoGroup |
Note that there is ongoing work in Spark SQL to support specialized joins: See SPARK-2211. As an example, support for merge join is in Spark SQL in Spark 1.4 (SPARK-2213 and SPARK-7165). This implies that Spark community will not be adding support for these joins in Spark Core library.
...