This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • Java API Internals

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

NOTE: This Wiki is obsolete as of November 2016 and is retained for reference only.


This page documents the design and internals of Spark's Java API and is intended for those developing Spark itself; if you are a user and want to learn to use Spark from Java, please see the Java programming guide.

...

Transformations like `map` return different RDDs depending on the type of function that's applied to the RDD's elements. Due to type erasure, this overloading is ambiguous and won't work:

Code Block
scala
scala

def map[R](f: Function1[T, R]): JavaRDD[R]
def map(f: Function1[T, Double]): JavaDoubleRDD
def map[K, V](f: Function1[T, (K, V)]): JavaPairRDD[K, V]

Instead, we define a hierarchy of Java Function classes that allow functions like map to be properly overloaded:

Code Block
scala
scala

def map[R](f: Function[T, R]): JavaRDD[R]
def map[R](f: DoubleFunction[T]): JavaDoubleRDD
def map[K2, V2](f: PairFunction[T, K2, V2]): JavaPairRDD[K2, V2]

This works because PairFunction, DoubleFunction, and Function aren't subclasses of each other. Rather, this is the Function class hierarchy:

Code Block

AbstractFunction1 (scala.runtime)
    WrappedFunction1 (org.apache.spark.api.java.function)
        DoubleFunction (org.apache.spark.api.java.function)
        PairFlatMapFunction (org.apache.spark.api.java.function)
        PairFunction (org.apache.spark.api.java.function)
        DoubleFlatMapFunction (org.apache.spark.api.java.function)
        Function (org.apache.spark.api.java.function)

AbstractFunction2 (scala.runtime)
    WrappedFunction2 (org.apache.spark.api.java.function)
        Function2 (org.apache.spark.api.java.function)

...

Here's an excerpt from JavaRDDLike.java, showing how we generate dummy manifests. In this example, the user-defined PairFunction has keyType() and valueType() methods to get its arguments' ClassManifests. To produce the Tuple2[K2, V2] ClassManifest, we just cast an Object class manifest:

Code Block
scala
scala
   /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[K2, V2](f: PairFunction[T, K2, V2]): JavaPairRDD[K2, V2] = {
    def cm = implicitly[ClassManifest[AnyRef]].asInstanceOf[ClassManifest[Tuple2[K2, V2]]]
    new JavaPairRDD(rdd.map(f)(cm))(f.keyType(), f.valueType())
  }

...