S4 has been accepted into the Apache Incubator program. Forums have moved to the Apache mailing list.

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

motivation

S4 fills the gap between complex proprietary systems and batch-oriented open source computing platforms. We aim to develop a high performance computing platform that hides the complexity inherent in parallel processing system from the application programmer.

implementation

The core platform is written in Java. The drivers to read from and write to the platform can be implemented in any language making it possible to integrate with legacy data sources and systems.

open source

S4 was released by Yahoo! Inc. in October 2010 under the Open Source Apache 2.0 license, which allows the user of the software the freedom to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software, under the terms of the license.

overview

proven

S4 has been deployed in production systems at Yahoo! to process thousands of search queries per second.

decentralized

All nodes are symmetric with no centralized service and no single point of failure. This greatly simplifies deployments and cluster configuration changes.

scalable

Throughput increases linearly as additional nodes are added to the cluster. There is no predefined limit on the number of nodes that can be supported.

extensible

Applications can easily be written and deployed using a simple API. Many basic applications for stream processing are available out of the box and more are being written.

cluster management

S4 hides all cluster management tasks using a communication layer built on top of ZooKeeper, a distributed, open-source coordination service for distributed applications.

partial fault-tolerance

When a server in the cluster fails, a stand-by server is automatically activated to take over the tasks. The state of the server may be lost but quickly recovered from new input data.