5outline

Overview

This is the top level section for all Flume NG documentation. Flume NG is a refactoring of Flume and was originally tracked in FLUME-728. From the JIRA's description:

To solve certain known issues and limitations, Flume requires a refactoring of some core classes and systems. This bug is a parent issue to track the development of a "Flume NG" - a poorly named, but necessary refactoring. Subtasks should be added to track individual systems and components.

The following known issues are specifically to be addressed:

This is a large and far reaching set of tasks. The intent is to perform this work in a branch as to not disrupt immediate releases or short term forthcoming releases while still allowing open development in the community.

For reference, we refer to the code branch flume-728 (named for the refactoring JIRA) as "Flume NG." We call the current incarnation of Flume "Flume OG" ("original generation" or the slightly funnier definition, "original gangsta") which corresponds to the code branch trunk and that which was previously released under the 0.9.x stream.

Historically, NG code has been worked on by Arvind Prabhakar, Prasad Mujumdar, and E. Sammer (me). Jon Hsieh, Patrick Hunt, and Henry Robinson have provided help in vetting design. Will McQueen has provided usability and correctness testing. Development is obviously open to all and we'd greatly appreciate anyone who wants to jump in and help!

It goes without saying that NG is based on the fantastic work led by Jon Hsieh and all of the other contributors put into Flume (OG).

Architecture

Flume NG's high level architecture solidifies a few concepts from Flume OG and drastically simplifies others. As our goals state, we are focused on a streamlined codebase that meets the common use cases in a "batteries included," easy to use, easy to extend package. Flume NG retains Flume OG's general approach to data transfer and handling (a N:M push model data transport, where N is big and M is significantly smaller).

The major components of the system are:

Data Delivery Semantics

TODO.

Everything below this line is outdated. In the process of updating things now... -esammer

Notes

These are esammer's raw notes while hacking on NG. There's no guarantee they match the code exactly as they're taken in vim during development and then dropped here for reference. Ideally, they wil be refined over time and integrated into a developer handbook for Flume.

Critical Features

Having spoken to a large number of both potential and current Flume users, the following features seem to be the most important (beyond "transfer this data").

Common Use Cases

Route Java application logs to HDFS

"Full event collection"

Known Issues, Limitations, Concerns

...of the NG branch.

Traits

It's super common that sources and sinks share common attributes / behaviors / configuration parameters. To simplify documentation and ensure feature parity across end points, we define the notion of traits. A trait is a set of shared behaviors and related configuration parameters. Traits act like mixins; any end point that wants to adopt common behavior / configuration (and user understanding) can simply reuse a trait. Note that traits do not prescribe a specific implementation strategy (e.g. inheritance, scala style traits, etc.). This is purely a logical convention (at least at this point).

It's legal to implement traits and not support all options (think UnsupportedOperationException). End points are validated prior to configuration being accepted so they have a chance to consider all options. All traits are theoretically orthogonal although, in practice, some may be correlated.

The following traits exist

Groups (client / server trait param)

In the existing Flume config language there's an incongruence in how logical nodes and sources / sinks are presented. This is largely the side effect of a Way-Back Decision(tm) so I won't go into details. Here's the issue.

User defines:

The problem: The user has to know that collector must be hostname2 or the config fails. Autochains were meant to solve this but the notion of flow is so under-documented it's not clear how to separate sets of compatible agents and collectors. Let's use a naming convention people get.

Clients are always in a group. Servers are always in a group. There's no way (in the built in sources / sinks) to communicate to a specific machine from a client; you always specify a group. The result is the same, but I think it's clearer. The inability to specify hosts simplifies the code in that group resolution must always occur (and is a first class feature).

The same config in "groups" (in pseudo-new-config-slash-old-config):

collector-client { group: "a" }; collector-server { port: 12345; group: "a"; } --> someSink; } host B { collector-server { port: 12345; group: "a"; } --> someSink; } ]]>

Groups probably also support a notion of mode. A mode is one of round-robin, fan-out, or least-loaded. This becomes both fan-out and load balancing across active-active collectors.

Catalog of Sources / Sinks

End point

Traits

thrift-agent

agent, server

avro-agent

agent, server

exec-agent

agent

scribe-agent

agent, server

syslog-agent

agent, server

collector-client

client, queue

collector-server

server

fs

file, queue

hdfs

file, queue

hbase

client, queue

batch

queue

Possible filter-ish thing?

Diagrams

TODO.