S4 applications can now be deployed on YARN. This provides easy deployment and automatic failover!

S4 integration has been tested with Hadoop/YARN 2.0.2-alpha, the latest release available. The code is currently available in the S4-25 branch and not yet merged into the main branch.


YARN is a refactoring of Hadoop, that allows the deployment various kinds of applications in addition to MapReduce applications. It delegates resource management and job scheduling to separate daemons:

  • a global ResourceManager (RM)
  • a Node Manager (NM) per host.

The Resource Manager contains an applications manager and a scheduler. Each application defines an ApplicationMaster, which drives the deployment of the actual application, by scheduling resource allocation through the Resource Manager.

Some advantages of using YARN are that it provides access to computing resources as a service, as well as monitoring and failover capabilities.

See YARN architecture for more information.

Projecting S4 on YARN

We integrated S4 with YARN by preserving S4 deployment model but performing a projection on YARN.

As usually, S4 nodes are coordinated through ZooKeeper and applications are downloaded by S4 nodes.

However, instead of relying on scripts or daemons to start S4 nodes, we take advantage of YARN to start S4 nodes.

YARN will automatically restart failed nodes. This provides automatic failover.

S4 provides a command line client for easily deploying S4 applications on YARN: s4 yarn. This allows an easy deployment of S4 applications.

Deployment flow

HDFS is assumed to be the underlying file system.

The S4ApplicationMaster is responsible for finding available resources, preparing the environment and starting S4 nodes.

The S4YarnClient prepares the S4 deployment context by creating a logical cluster with the S4 application metadata.

The following figure illustrates the deployment process of a single application with YARN:

Example: Deploying Twitter example on YARN

This example will guide you through a deployment on a local YARN cluster, but you could also use an existing YARN infrastructure. We will actually deploy 2 applications: the twitter trending application, and the twitter adapter application.


We assume that:

  • you have built an S4 distribution, as well as the s4-yarn project:
    s4Dir$ ./gradlew install -DskipTests
    s4Dir$ ./gradlew s4-yarn:installApp
  • you have a YARN distribution already built. See this tutorial for instance.
  • you have yarn-site.xml, core-site.xml mapred-site.xml and hdfs-site.xml correctly configured and available in the /path/to/hadoop-2.0.2-alpha/conf directory.
  • in particular, you should make sure the yarn.scheduler.minimum-allocation-mb variable is configured correctly, since it is used to compute the maximum number of active applications per user, through the formula max((int)ceil((clusterResourceMemory / minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1). Default settings on a single machine with less than 10GB RAM give: max((int)ceil((clusterResourceMemory / 1024) * 0.1 * 1), 1) = 1 !
  • you have built and packaged the twitter-counter and twitter-adapter apps:
    s4Dir$ ./s4 s4r -a=org.apache.s4.example.twitter.TwitterCounterApp -b={{pwd}}/test-apps/twitter-counter/build.gradle twitter-counter
    s4Dir$ ./s4 s4r -a=org.apache.s4.example.twitter.TwitterInputAdapter -b={{pwd}}/test-apps/twitter-adapter/build.gradle twitter-adapter

Start YARN cluster

  1. Initialize shell environment
    cd /path/to/hadoop-2.0.2-alpha
    yarnDir$ export HADOOP_DEV_HOME={{pwd}}
    yarnDir$ export YARN_HOME=${HADOOP_DEV_HOME}
    yarnDir$ export HADOOP_CONF_DIR=${HADOOP_DEV_HOME}/conf
    yarnDir$ export YARN_CONF_DIR=${HADOOP_DEV_HOME}/conf
  2. Optionnally clean hdfs
    yarnDir$ rm -Rf /private/tmp/hadoop-/
    yarnDir$ bin/hadoop namenode -format
  3. Start YARN daemons
    yarnDir$ sbin/hadoop-daemon.sh start namenode
    yarnDir$ sbin/hadoop-daemon.sh start datanode
    yarnDir$ sbin/yarn-daemon.sh start resourcemanager
    yarnDir$ sbin/yarn-daemon.sh start nodemanager
    yarnDir$ sbin/mr-jobhistory-daemon.sh start historyserver

Prepare environment

  1. Copy S4R files to HDFS (in this example, we just put them at the root of the file system)
    yarnDir$ bin/hadoop fs -copyFromLocal /s4-directory/test-apps/twitter-counter/build/libs/twitter-counter.s4r /
    yarnDir$ bin/hadoop fs -copyFromLocal /s4-directory/test-apps/twitter-adapter/build/libs/twitter-adapter.s4r /
  2. Start a ZooKeeper instance (or reuse one) - you could use a new shell
    s4Dir$ ./s4 zkServer


  1. Deploy twitter-counter application
    s4Dir$ export HADOOP_CONF_DIR=/path/to/hadoop-2.0.2-alpha/conf
    s4Dir$ ./s4 yarn -cluster=counter -flp=10000 -nbPartitions=2 -s4r=hdfs://localhost:9000/twitter-counter.s4r -zk=localhost:2181 -num_containers=2
    --> you should see the monitoring information from the S4 Yarn client
  2. Deploy twitter-adapter application (new shell)
    s4Dir$ export HADOOP_CONF_DIR=/path/to/hadoop-2.0.2-alpha/conf
    s4Dir$ ./s4 yarn -cluster=adapter -flp=11000 -nbPartitions=1 -s4r=hdfs://localhost:9000/twitter-adapter.s4r -zk=localhost:2181 -num_containers=1 -p=s4.adapter.output.stream=RawStatus,twitter4j.debug=<true\|false>,twitter4j.user=<youUser>,twitter4j.password=<yourPassword>


Applications will go from ACCEPTED state to RUNNING state (that may take a moment). You can check from the shell output or from the web console, typically at http://localhost:8088.

If you check the logs of the containers from the web console, you will be able to see the output of the S4 nodes, as in the twitter example from the S4 walkthrough. (You might need to browse a few, the names are not currently very explicit.) One of the S4 nodes will dump the top 10 topics in a local file, which will be given in the log with something like Timer-TopNTopicPE] INFO o.a.s4.example.twitter.TopNTopicPE - Wrote top 10 topics to file []
You may simple tail -f this file to view the top trending topics (if that's what you really want).

  • No labels