Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

What is it?

S4 is a general-purpose,near real-time, distributed, decentralized, scalable, event-driven, modular platform that allows programmers to easily implement applications for processing continuous unbounded streams of data.

S4 0.5.0 (aka 'piper') is a complete refactoring of the previous version of S4 released on github.

It grounds on the same concepts (partitioning inspired by map-reduce, actors-like distribution model), but with the following objectives:

  • cleaner and simpler API
  • robust configuration through statically defined modules
  • cleaner architecture
  • robust codebase
  • easier to develop S4 apps, to test, and to use the platform

We added the following core features:

  • TCP-based communications
  • state recovery through a flexible checkpointing mechanism
  • inter-cluster/app communications through a pub-sub model
  • dynamic application deployment
  • toolset for easily starting S4 nodes, testing, packaging, deploying and monitoring S4 apps

What are the cool features?

Symmetrical deployment:

  • keys are homogeneously sparsed over the cluster: helps balance the load, especially for fine grained partitioning
  • dynamic load balancing can be easily implemented

Modular design:

  • both the platform and the applications are built by dependency injection, and configured through independent modules.
  • makes it easy to customize the system according to specific requirements

Dynamic and loose coupling of S4 applications:

  • through a pub-sub mechanism
  • makes it easy to:
    • assemble subsystems into larger systems
    • reuse applications
    • separate pre-processing
    • provision, control and update subsystems independently

Fault tolerant

  • Fail-over mechanism for high availability
  • Checkpointing and recovery mechanism for minimizing state loss

Pure Java: statically typed, easy to understand, to refactor, and to extend

How does it work?

Some definitions

Platform

  • S4 provides a runtime distributed platform that handles communication, scheduling and distribution across containers.
  • Distributed containers are called S4 nodes
  • S4 nodes are deployed on S4 clusters
  • S4 clusters define named ensembles of S4 nodes, with a fixed size
  • The size of an S4 cluster corresponds to the number of logical partitions (sometimes referred to as tasks)

Applications

  • Users develop applications and deploy them on S4 clusters
  • Applications are built from:
    • Processing elements (PEs)
    • Streams that interconnect PEs
  • PEs communicate asynchronously by sending events on streams.
  • Events are dispatched to nodes according to their key

External streams are a special kind of stream that:

  • send events outside of the application
  • receive events from external sources
  • can interconnect and assemble applications into larger systems.

Adapters are S4 applications that can convert external streams into streams of S4 events. Since adapters are also S4 applications, they can be scaled easily.

A hierarchical perspective on S4

The following diagram sums-up the key concepts in a hierarchical fashion:

Where can I find more information?

  • The website is a good starting point.
  • The wiki currently contains the most up-to-date information: general information (this page), configuration, examples.
  • Questions can be asked through the mailing lists
  • The source code is available throught git, here are instructions for fetching the code.
  • A nice set of slides was used for a presentation at Stanford in November 2011.
  • The driving ideas are detailed in a conference publication from KDCloud'11 (joint workshop with ICDM'11)