This is about tracer design where we are recording data, but not necessarily sending to zipkin. The top-level use cases include metrics, span aggregation, sampling to different repositories under different rates.

The overall idea is that data is collected locally based on some conditions that overlay normal B3 or similar sampling. A hook, possibly named "firehose" is a function of the propagated context and accumulated data about a span. Regardless of if that function processes the data, normal reporting would occur if the normal "sampled" status is true. Some nuance, notably about mutability is discussed as an overhead limiting mechanism.

While the context is Brave (the java tracer), it could apply to similarly designed tracers in any language provided you ignore subtle idiomatic and runtime related differences. For example, hooks in similar ways could avail the same use cases.

Relevant existing features


Before discussing this, we should introduce what features are in Brave or similarly designed tracers that make some of this easy. Without these parts, later features could be considered cart-before-the-horse.

Trace IDs are generated always by default

Brave, even when not sampled (to zipkin) generates trace IDs for log correlation. While this feature is likely to be disablable in the future, it allows for consistent traces even with late sampling decisions. Technically, if two sub-graphs sample, you can still stitch them together on trace ID, at the expense of lost parenting (aka virtual root node).

Extensible TraceContext propagation

Brave instrumentation propagate a TraceContext object which holds a primary data (analogous to B3 or traceparent), and has a place called "extra" to place extended data into. For example, this extended data area is used to carry details such as in aws-trace-id or tracestate, which don't fit in the *normal* place (which only holds primary trace identifiers and sampling status)

Extensible and composeable propagation plugins

Propagation plugins can place private data into a TraceContext upon extracting it from headers, and put it back on outbound requests (injecting). This hides data by default, though there are tools folks can opt-into to retrieve values. These propagation plugins can also decorate trace contexts (extra data) as they fork internal to the process, for example, to continue propagating data to children or to stop it. Most importantly, you can compose multiple propagation plugins to for example have two sets of headers simultaneous for the same trace data.

The above imply that we can design features which can overlay the primary sampling status of a trace, including sampling it somewhere else when it is explicitly not chosen for sampling up front. Not all features discussed require the above prerequisites, but they are related.

Use cases

1 hour customer support


The initial request for "firehose mode" came from Yelp, Ascend and other sites who wanted to sustainably support 100% sampling for a limited window of time. For example, if retention was limited to an hour, and indexing turned off, you could send data to a different zipkin cluster at full speed.


Here are some example scenarios:
    * functions that act on span data (yelp internal idea)
    * 100% of "skeletal traces", ex only RPC spans (ascend internal idea)
    * server side index disablement (yelp and others with splunk or similar) https://github.com/openzipkin/zipkin/issues/1869
    * issue leading directly to this doc: https://github.com/openzipkin/brave/issues/557

Propagation based sampling overlays


In Netflix, there are concurrent independent tracing systems. Some need to sample at different rates, and some at 100% limited to a number of network hops. By teasing apart the normal sampling function of the primary trace ID, we can support multiple outputs potentially different rates of each.


Here are some example scenarios:
    * edge network samples 100% down 3 hops (netflix internal idea)
    * propagated hint samples 5% of a subgraph relating to services (netflix internal idea)

Path-based service aggregation


A future requirement of Netflix (and a prior feature of Skywalking) was to not only have service links generated from trace data, but also full paths. For example, you would know all the paths a request takes from a specific edge http route, which is different than knowing two service have traffic between them. For this data to be accurate, it would need to be aggregated 100% even if only 0.1% is sampled to zipkin


Elaboration on path aggregation: https://docs.google.com/document/d/1inII52RwdjcQ1gZnFWGXVE8l9Eo2jilN6x8aTsD_Mpw/edit#
Notes about link aggregation: https://docs.google.com/document/d/1QulozaBhJemNgy4Db8uIc_1ycSLZufLfNeGW8QQ42vg/edit

Skywalking formerly had this feature, but removed it for simple aggregated links. They removed it because customers were unsure what to do with the data, and also the dimensionality of the paths was staggering. The cost to perform aggregation was also a consideration (ex how many nodes needed to do processing and keep up with backlog at a site like Huawei).

Metrics aggregation


Metrics aggregation typically requires 100% data for statistics to be meaningful. While certain things can be extrapolated from a rate, data from requests is not normal distributions either. To get valid 99% or other data, especially from low-volume services, collecting data from all requests are helpful.


While metrics apis are typically used for this, we've had numerous requests to integrate span metrics, more recently in a way exported to micrometer. Here are a few issues around this request:

Data manipulation


When integrating with other sorts of systems, certain dimensions should be low-cardinality and consistent. By providing a way to mutate data prior to reporting, we can leverage other tools, sanitize or otherwise clean the data. This then can go downstream consistently into other sinks such as metrics or service aggregation. Here are a few issues around this request:
    * filtering data: https://github.com/spring-cloud/spring-cloud-sleuth/pull/1037
    * filtering and mutating data server side: https://github.com/spring-cloud/spring-cloud-sleuth/pull/1037

Design


The overall design is to allow recording to alternate sinks given a (tracecontext, mutablespan) pair. This hook is primarily controlled by TraceContext.sampledLocal: when this flag is set, any handler will receive the data.

The primary side-effect is that spans usually of type NoOp due to sampling being false.. they can be RealSpan types on account of a local decision. This allows them to collect data needed to pass to the alternate sinks.

  DR: Why "sampledLocal"? That name gives me the idea of something happening only locally that is not propagated downstream.
  AC: The TraceContext type is built with a (primary) sampling decision. sampledLocal means something else is listening for data so record.  It does not imply sampling remote (hence the name), but does not preclude it either. This supports metrics for example, while also supporting any other *secondary* trace fields. Propagation implementations can choose to write several headers with different values by looking at other state.
  AC: we will rename sampledLocal and I'll find/replace here

  DR: Also, is the idea to only ever have up to 2 sinks? Such a flag would let you send spans to a 2nd sink (firehose or whatever), but won't let you have 3. Is that something that might be useful? Not sure about this, maybe it's better to go with the simpler version until someone asks for 2+ sinks.
  AC: This doesn't have any constraits on the sink. You can think of sampledLocal as an OR condition between any local sink that needs data. In fact the test I am working on has 3: Normal 0.1%, propagated conditional 5%, and metrics 100%. There's nothing about sampledLocal that precludes 100 listeners if we needed them, it means there is at least on listener. It is added the way it is to not introduce any more data in the case there are none, as sampledLocal shares the same internal bitflag with other things.

How to approach recording of data for sinks such as metrics

Data used by consumers like metrics could be added very late in the span lifecycle. For example, the http route is often added much later than when a span is started. There's a dillemma where we need to make a decision before we know for sure a consumer will need the data or not.

Based on the complexity of this and also prior art not attempting to solve it, we can simplify the problem by focusing on limiting overhead. For example, if we reduce the overhead of recording, it may be fine to record data only to drop it later. We should also be able to know for sure if there are guaranteed to be no consumers of span data.. in this case, obviously we should not record anything.

Add type that allows cheap recording and readback


Brave has a mutable model called MutableSpan which provides for readback. Mutability is cheaper, and the internal design of the type is more efficient than zipkin2.Span. It also allows normalization of data, which can ensure any related sinks have consistent lookup keys. It was named "MutableSpan" not "SpanData" because it is important to highlight the impact of mutability than other factors. In-flight, recorded spans are held in a mapping named "PendingSpans" because other words are not appropriate. For example, pending spans are not necessarily started, so terminology about running or unfinished isn't accurate.

This was implemented in https://github.com/openzipkin/brave/pull/736 later revised as https://github.com/openzipkin/brave/pull/744 clarifying that the MutableSpan type is held while a span is pending, then delivered to a consumer initially just a normal span reporter.

Conditionally recording even when not sampled to Zipkin

Normal sampling means both headers are propagated and data is sent out-of-band in for example Zipkin's format. In order for other sources to readback data, it needs to be recorded *even if* the operations are not being sampled in the more typical sense. We'll call this "local sampling", again understanding normal sampling implies local sampling.

  DR: What does readback data mean?
  AC: Reading data written, for example, reading back tags written. Since spans can be NoOp (tosses data), being explicit about reading data written is mildly important

At the end of the day, we need to ensure a span is not in NoOp state when we have a situation where 0.1% to zipkin is false, but 5% to XXX is true: we need to overlay TraceContext.sampled? with another decision. In order to achieve a consistent decision, attaching a "sampledLocal" decision to the context is simplest. To support both cherry-picking and subgraph models easily, we make this inherited to children by default, but obviously not propagated downstream across network boundaries.

  DR: > we make this inherited to children by default, but obviously not propagated downstream across network boundaries.
I don't understand this. I thought you wanted to propagate an extra header to inform downstream services that they're being "firehose" sampled.
If you want this to be service bound, then you can't do the 5% to XXX you were saying above since every service would pick a different 5% and you'd end up with broken and useless traces.
  AC: This is a design for data recording and listeners. It does not require a specific header propagation design. The 5% for a service group will wotk in root or subgraphs, for example playback services. It indeed does not work for arbitrary groupings for reason you mention. However, making it possible to do this does not require one to do it wrong.
  AC: added intro section to clarify this

  DR: Imo the options are 2: if we only want to support X-B3-Sampled and 100%, then you don't need to propagate an extra header and you can use this localSampled flag. If you want the ability to say 1% is X-B3-Sampled and 5% is firehose sampled, then you need to propagate an header to inform downstream services on what's your decision.
    AC: True, though again sampledLocal does not describe in any way extra propagation mechanisms. In brave, for example, there's an "extra" place to store state about propagated values. The 5% would store data about that decision into "extra" and the propagation impl would handle it. I think what is confusing possibly, is that we aren't trying to simultaneously design a new header system while dealing with the local data flow. We do know a possible header system, but making that standard wouldn't be smart because it should be practiced. In any case, we'd need to know to make "real spans" or not.

  DR: One other thing that you cannot do with a localSampled flag is ensure a request is NOT firehose traced. Let's say it's a very important request (or a request you really don't care about), with a header you could set it to 0 to tell downstream services to not firehose sample it.
  AC: We currently don't have a header propagation model in mind for "not firehose traced", or any concept of header propagation for a secondary decision. I don't think this precludes this, provided you understand that currently, in Brave, we already have an additional state area named "extra" which is what implementations look at to decide what to do. This is intentionally why the design is a function2 of (traceContext, mutableSpan) as well, as you would indeed need to look at the context to see if some extension relevant to you is present.

Triggering a sampledLocal decision

The naive approach to signaling a "sampledLocal" decision is a global or scoped flag. For example, a metrics handler's presence could imply trace contexts always have a "sampledLocal" flag set. On the other hand, propagation implementations can also modify this with no changes to existing code.

For example, sampling overlays can be implemented with Propagation.Factory.decorate, where the decoration adds the sampledLocal flag and propagates hints read later about where to send the data. Notably the latter requires the reporting step to be a function of (traceContext, mutableSpan) as the traceContext is what holds the propagated info.

One technical question might be: How do you encode potentially multiple sampling decisions? The only place to achieve this is via TraceContext.extra. Technically, you'd make a mapping object of a key to a decision. For example "playback" -> true, "edge" -> false. This object would be placed into the context and read back for the reporting step.

MutableSpan

An extremely common request is to adjust span data later, for example normalizing field names or values (such as http.route). For this reason, the span should be mutable, allowing the first handler to have side effects that result in consistency, and without the overhead of copying data.

On the other hand, some outputs will want completely different data formats, intentionally inconsistent with what goes to Zipkin. For this to work, a "clone" method should be present on MutableSpan, to make a cheap copy which won't mutate the source. This can be used to route specific data downstream to other handlers while leaving the source unaffected for example to report to zipkin.


How to handle "skeletal traces"


Skeletal traces, or ones that only include remote data are indeed a tough thing. The basic process is that you need to buffer all children until such time as the entrypoint is complete. Once you reach an "exit span" (ex client or producer spans), you can re-parent it to its first "entry span" (server or consumer). Doing this as a separate handler can allow longer term storage to have all remote spans even if the parenting isn't correct. Provided logs are not stained with parent IDs, they should also be in tact.


Note: this design is highly theoretical, so would need a learning test to prove it works.

  DR: This seems a slightly different problem then the other use cases.
  AC: agreed

  DR: I'm not sure how brave works, but at least in py_zipkin you could write a firehose_handler that drops all intermediate spans and fixes the parent_span_id for the outgoing span. Only question is: you've already sent the wrong parent_span_id to your downstream, how do you handle the incosistency in zipkin-server? Will the client's or the server's parent_id win?
  AC: This is a good point, so there are a couple approaches to not writing the "incorrect parent ID" downstream. One is to not write a parent ID at all. This has the downside of a possible missing parent in case the client span is lost. It would be better to do that vs write different IDs. Regardless, grouping by entry span is something common in APMs and skeletal trace sortof proves that would work. OTOH, we don't have to implement all possible uses of this.

  DR: Another option would be to allow spans to opt-out of firehose tracing by adding an extra argument to the constructor. If `skip_firehose` is set, you don't generate the span object at all. This would prevent you from doing both metrics and skeletal traces at the same time, but since this document is only talking about adding 1 extra sink, you can't do that anyway.
  AC: This follows from my "original sin" of not describing the existing feature in Brave, which is that implementations can place extra state. But maybe it isn't. We usually don't change the api surface area for features that are not fully fleshed out. For example, extra field (baggage) propagation is an extension, not a primary field in the trace context. Internally, it uses this thing called "extra". Back to the firehose thing.. becuase they use the same data, it wouldn't make sense for localSampled false to override sampled = true. We still need the data to satisfy normal existing tracing. So the only case we look at this is when sampled is not true. If there was a "firehose" implementation, for example, metrics, its job is to set that flag. Otherwise it would default to false, and not record anything. In other words, back to the combination of OR decisions above, we only record data when there is at least one consumer of it.
  A secondary source, though, should not be able to make the primary sampling of true impossible. Even if we wanted that, that seems a completely different thing, such as Tracing.noop which is a "break glass" mechanism to ignore sampling, or a propagation decorator to switch sampling back to false. Through extensions, we wouldn't want to break semantics of the primary sampling mechanism defined in B3, for example, see "true" but not send data, as this would break hierarchy 


Prior art and design inputs


Tracing libraries have had decorators or features for collection of data for a fair amount of time. In fact, most APMs always collect data locally, especially to enable aggregations. Here are some key things that influence the design

Readback of data


The most major implication of firehose is the ability to read-back data. For example, you need to read the name or tag to decide whether or not to keep the data or ignore it. For this to work, it needs to be collected and stored in an intermediate model even when not sampled for remote stuff
 * census has a readback api called SpanData ( it is a accessor that mirrors the setters on Span)
 * opentracing has a contributed extension also named SpanData (an immutable data structure)

Chicken/egg problem of recording data


A chicken/egg problem of readback is.. how do you know to record it in the first place? For example, the decision to sample or not can sometimes be a function of input data. For example, in Brave there is an HttpSampler which can say if a span should be sampled based on the http request line. How do you get that data to a consumer like metrics?


In some cases, a preliminary builder object collects values which can then enter a sampling decision. OpenTracing has no sampling api, so it is implementation-specific how to decide whether or not to collect data, regardless of if it is sent remotely.

Sidebar on "stateless" aggregation: is it really stateless?


OpenTracing's java-api-extensions library suggests two observer models: stateful and stateless. Incidentally, both imply accumulation of state, as SpanData present in both models has readback apis! You'll notice this looking carefully at SpanObserver and TracerObserver. For example, while MetricsObserver in the java-metrics project indeed doesn't collect intermediate state (even if maybe it should as route is usually added late), initial data was collected as SpanData. The default implementation of SpanData (APIExtensionsSpan) caches all data.


The summary is that at least in OpenTracing's default contrib library, reporting to other sources implies recording data always. This occurs even when edge cases are not expressly addressed (such as late data about http routes).

OpenCensus and its two local flags: sampleToLocalSpanStore and recordEvents


OpenCensus has interestingly two flags which are often set true in tandem: sampleToLocalSpanStore and recordEvents
Primarily, this supports local "agent like functionality" such as zPages and latency based heuristics. However, the relationship between the two properties is not entirely clear, and usually these are hard-coded or code-generated to true

Ingress/Egress only span metrics


Particularly metrics can be derived from spans about inbound and outbound traffic. For example, both opencensus and opentracing record data by default on RPC spans. It matters less why the data is consumed, but OpenCensus setRecordEvents(true) does so for its zPages functionality and the OpenTracing java-metrics extension does so for metrics.

Dependency graph generation


Skywalking and other APM systems use propagation values to stream service dependency information. Provided the services are identified consistently, this does not require any request-scoped data except the inbound service name. However, this can only provide links, not complete paths of a service call. For example, to tell that an outbound request related to an inbound request, you'll need to retain parent-child relationships. This implies that a full subgraph needs to be analyzed, including parent/child identifiers. To get a full subgraph, recording state must be inherited from parent to child.


  • No labels