Page History

What is it?

S4 is a general-purpose,near real-time, distributed, decentralized, scalable, event-driven, modular platform that allows programmers to easily implement applications for processing continuous unbounded streams of data.

S4 0.5.0 (aka 'piper') is a complete refactoring of the previous version of S4 released on github.

It grounds on the same concepts (partitioning inspired by map-reduce, actors-like distribution model), but with the following objectives:

cleaner and simpler API
robust configuration through statically defined modules
cleaner architecture
robust codebase
easier to develop S4 apps, to test, and to use the platform

We added the following core features:

TCP-based communications
state recovery through a flexible checkpointing mechanism
inter-cluster/app communications through a pub-sub model
dynamic application deployment
toolset for easily starting S4 nodes, testing, packaging, deploying and monitoring S4 apps

What are the cool features?

Symmetrical deployment:

keys are homogeneously sparsed over the cluster: helps balance the load, especially for fine grained partitioning
dynamic load balancing can be easily implemented

Modular design:

both the platform and the applications are built by dependency injection, and configured through independent modules.
makes it easy to customize the system according to specific requirements

Dynamic and loose coupling of S4 applications:

through a pub-sub mechanism
makes it easy to:
- assemble subsystems into larger systems
- reuse applications
- separate pre-processing
- provision, control and update subsystems independently

Fault tolerant

Fail-over mechanism for high availability
Checkpointing and recovery mechanism for minimizing state loss

Pure Java: statically typed, easy to understand, to refactor, and to extend

How does it work?

Some definitions

Platform

S4 provides a runtime distributed platform that handles communication, scheduling and distribution across containers.
Distributed containers are called S4 nodes
S4 nodes are deployed on S4 clusters
S4 clusters define named ensembles of S4 nodes, with a fixed size
The size of an S4 cluster corresponds to the number of logical partitions (sometimes referred to as tasks)

Applications

Users develop applications and deploy them on S4 clusters
Applications are built from:
- Processing elements (PEs)
- Streams that interconnect PEs

PEs communicate asynchronously by sending events on streams.
Events are dispatched to nodes according to their key

External streams are a special kind of stream that:

send events outside of the application
receive events from external sources
can interconnect and assemble applications into larger systems.

Adapters are S4 applications that can convert external streams into streams of S4 events. Since adapters are also S4 applications, they can be scaled easily.

A hierarchical perspective on S4

The following diagram sums-up the key concepts in a hierarchical fashion:

Where can I find more information?

The website is a good starting point.
The wiki currently contains the most up-to-date information: general information (this page), configuration, examples.
Questions can be asked through the mailing lists
The source code is available throught git, here are instructions for fetching the code.
A nice set of slides was used for a presentation at Stanford in November 2011.
The driving ideas are detailed in a conference publication from KDCloud'11 (joint workshop with ICDM'11)

Child pages

Versions Compared

Old Version 8

New Version Current

Key