2019-05-19 Zipkin Pow-wow (Sitges)

Goal

Get together and unravel the sinergy and mystic Zipkin volunteers to build something related to post facto sampling (and something else that we will refine until the day comes)

Date

19-20 May during working hours in CEST (UTC +1)

Location

Sitges, Catalonia (40 min by train from Barcelona center)

Output

We will add notes at the bottom of this document including links to things discussed and any takeaways.

Attendees

This pow-wow is for PPMC and routine contributors (iotw future committers). It will not be remote friendly or strictly scheduled, though we may have some ad-hoc video calls.

Attending on-site

Jodrián Culé (Adrian F. Cole)
José Carlos Chávez
Jorg Heymans
Jorge Quilcate
Lance Linder
Bas van Beek (Monday only)

Homework

None yet

Agenda

If any segment is in bold, it will be firmly coordinated for remote folks. Other segments may be lax or completely open ended.

Sunday, May 19

morning	breakfast somewhere in Sitges (possibly Cafeteria Montroig as it is good rated, but not the most popular)
afternoon	work on post facto sampling
evening	Explore the town

Monday, May 20

morning	more work on post facto (aka tail, late, after the fact) sampling
afternoon
evening	Enjoy the town

Outcomes

TBD

Notes

Intro notes:

Lance at SmartThings

Smarthings in the beginning had a monolithic application which was beginning to be split up. They attempted using some commercial products and ended up working with Zipkin to understand requests. The transport is SQS We run zipkin in one region, but SQS reports from many regions. Development is spread across multiple continents and it is not easy to control what data ends up in Zipkin. We are mainly interested in means to control the amount of data, currently larger than desired indexes with common problems such as high cardinality values (ex http.path)

We tell our users how to force a trace when they need something, for example, a feature to trigger 100% sampling. We've done things like this, such as capturing certain devices at 100%: in the lower environments it is 100%. Some questions come up from familiarity of other tools, like sumologic.. people expect tools to stitch together data in the same way as Zipkin does. After the fact sampling is interesting because any fixed rate based on percentage will blow up with certain services. It would be nice to pick some traces for a week, so that people or processes can analyze the data for performance improvements. Before the fact sampling can be be limited.

Another area to focus more is messaging. We don't currently use messaging very well.. our initial messaging work was vert.x in our hub. Ends up a consultancy type work where first 90% is done by lance then the app teams finish it up. Investigating into ways to use Zipkin and Skywalking together. A couple of the services have hundreds of servers, all in some thousands of servers. Licensing fees that are on a per-host basis is near impossible.

Jorg

Working in an environment with a lot of directorates, which translate into silos with duplication. There are efforts to share works, and this translates into microservices architectures. Most of the applications are weblogic and we used to FTP server logs to understand what is going on. We are in the early stages of an observability approach, this includes some demonstrations of Zipkin diagrams. Currently working with Brave instrumentation, despite a consultant initially recommending OT. Jorg is working on internal evangelism and rolling out instrumentation such as JMS. I'm interested in driving observability from the instrumentation POV, in Brave. The backend is elasticsearch with http transport until we get to a level where this is a problem. At that point, it would likely be Kafka as internally there is a movement from Weblogic queuing towards that anyway.

Jorge

Working in consultancy for 10yrs half in Norway. How to introduce tracing with service bus was my first project. Two of my current projects are Kafka related. One is a company that deals with the collection and transport of energy related data. We are using Elasticsearch as a backend with Kafka (Confluent open source) as a backend (deployed in OpenShift). Current challenge is storage with 1% about 5-10k spans/second, to see the value we need to get more data. The other project is gambling related. It is in very early stages of tracing.

We may want to be able to collect something like "last trace per device" similar to a gauge as used in last latency recordings.

People are used to splunk, and they prefer to use this for as much as they can. It is sometimes hard to offer alternatives to the convenience of everything in splunk.

The initial problem with tracing was asking about a resource, for example it could be easier to use an existing Kafka service vs provision another storage service. The operational problem was about where in the pipeline a process failed. The first work towards that was using kafka streams which is working well. A good way to show value could be to show anomaly detection as a part of the pipeline.

José Carlos

Work at typeform which is an online forms SaaS. Started using Zipkin as we wanted to understand requests after splitting the monolith. Right now, we have about 15 services most in golang, some node.js and PHP. We primarily look at bottlenecks and sources of latency. Other times for production debug to understand some failed request. One challenge faced was lack of PHP and go (now we have a decent amount). We are still lacking messaging instrumentation.. we have SNS in front of SQS, and need go instrumentation for this. Some of the libraries are not instrumentation friendly. We have a fixed rate of sampling of 30%, but then a special flag related to VPN access which is 100% sampled. Sometimes we cannot catch everything we want. Another trouble is how to control the sampling rate.. ex an expression language for sample rates based on endpoint, environment etc. After the fact sampling would also be interesting as it could reduce the amount of uninteresting data we have. Currently two employees contribute to zipkin-go, and sporadic contributions to other zipkin libraries. 100% without analysis probably makes less sense unless it is subject to some bounds like an A/B test.

When it comes to payments we have very large latency, ex 8 seconds. This is an area of focus. We have one library specific to Stripe payloads that can add annotations based on the returned service call.

Indexing discussion:

There is no problem with ingest at smart things, it is more about the amount of data in the indexes and when search response times degrade. Due to the nature of compactions, you may need 50% space available.

There was a question about what the largest cassandra site. It might be SmartThings or Yelp or Criteo, depending on if Criteo is still a Zipkin. Probably the largest open source tracing site might be Expedia even if it is isn't Zipkin as they aren't.

The index size tends to be the largest source of data problems. With a streaming intermediary and access to sources of http queries, it could be possible to have an adaptive indexing approach. This doesn't imply we have to solve that on our own.

Kafka Streaming Storage

Lucene and Kafka both are disk based storage, though in Kafka you can use rocksdb for in memory storage. Jorge chose Lucene to satisfy the query api as it wasn't easy to do searches except by trace ID natively in Kafka. One constraint is that everything is on one node. In Kafka streams, we can do one instance per partition. However, if we did this, the search would have to query across. So, this uses global store, which means each node has a full copy of the data. Another challenge was the fact that batching isn't based on trace ID. This added more load to the broker vs if spans were sent from brave based on trace ID instead. This made me wonder if we can't use batching built-in to kafka client, which is partition aware.

some ideas are to make a pure kafka reporter which only serializes message and uses async kafka code. Another is making the generic reporter traceId/localRoot aware.

Canary analysis

If you do analysis purely on metrics, you may not catch service problems caused by another. For example, 400 with empty body that makes another service fail. Incorporating trace data could make it easier to tell second order effects. One cheap win could be to include a trace query that matches the tag associated with the canary and "error". This would pick up on any error in the stack. https://github.com/spinnaker/kayenta

Notes will run below here as the day progresses. Please don't write anything that shouldn't be publicly visible!

Messaging abstraction

Which transports are important for us to discuss, in terms of diversity.

Kafka
SQS
Kinesis - smartthings
RabbitMQ
JMS

MessagingTracing is the bridge between the library in use and the tracing abstraction

Question: Should this split producer and consumer like we have http.

one strain in the http design is about the remote service name

Question: Should we add something to the span customizer to affect the service name?

we could make a remote span customizer, and that would be something that include remote instrumentation https://github.com/apache/incubator-zipkin-brave/issues/911

Question: on MessageProducer: what arguments <P, Chan, Msg> did you choose and how well is that working?

P is a delegate to the concrete producer
Chan in JMS it is the destination, but in Kafka it is ProducerRecord

Question: Why do we need the extractor on the producer?

this allows you to look for an explicit context embedded in the message.
followup: so current trace context is prioritized.

Question: how do we handle start/finish lifecycle?

each implementation has an implementation of the finish part. This might be implementation specific.
adrian: the partition info might be needed from the callback
jorg: in JMS world we have the request/reply and this correlation id.. it is possible we'll need to know this on the return path

One idea is that we use only idempotent header formats. meaning that there are no headers that need to be read before written. ex lower b3 single format

sendMessageBatch problem

when sending a message batch, the current trace context could be that of a background thread.. might be null as it is in a polling loop on an in-memory queue. When we start a new span for the batch produce, we run the risk of suppressing traces embedded in various messages. We are breaking those traces in other words, if we use a single span representing the batch send. One option is to allocate messages in a batch into buckets based on whether or not they have an embedded parent or not. The unallocated ones go against the trace for the batch put, and the others continue their respective traces. In the future, if we have a model change that works with multiple parents, we can revisit this part. Meanwhile, if people need to authoritatively represent the batch put, the best way is to add a tag corresponding to a batch identifier, possibly the b3-encoded trace. For example, whether a batch put is represented by a single span or multiple ones, they could all share a common tag representing the batch.

partitioned message trace

people have over time mentioned concerns about the length of messaging traces, particularly in stream processing. We need to revisit how to partition/break these or create helpers to do that if people want to. It is worthwhile to look at what can be done in UI code prior to this.

Q/A

JC: How do you handle instrumentation of the messaging vs propagation.

Lance: There's already a place to stash metadata in SQS, so we treat this part separately. We have a wrapper around the SQS client.

Jorg: will messaging pattern apply to gRPC? multiple replies for example. These could go on for hours.

Lance: we have some analytic things that are writing things out at a stream. it isn't bidi, but it is important to analyze the stream as a unit. For example, it could be running for hours.

Mesh

Bas: Let's say you have a service coming into the sidecar, but you are doing consecutive calls to the next service without the agent. What's the impact of this?

Bas: If you defer tracing to the proxy, you still need to address propagation. Do we also show a server call, or prefer the one inbound from the sidecar? For example, we could tune away the server span from the app.

Lance: if the latency so short, is there a need to wrap a trace around it?

JC: when we want to do more sophisticated things like sampling at the client level or local spans in a trace, we have to be conscious about the integration of this data.

Bas: if you think about the control-plane, it can address the sampling policy pushed to the sidecars, but this would only affect service spans (ex not locally originated ones such as scheduling)

Bas: we should take the concern POV of the user and what their experience is likely to be. For example, it could be simply that they have no choice to do local activity (propagation only). what would be the options and drawbacks, and how would the user perceive these

JC: we use HAProxy for every service, and I've not had the need to have traces to describe this behavior. If I am debugging I may want to turn on but otherwise, it is transparent.

Bas: What about developers who do not want to be concerned at all, and where would the mesh come in. The story sold to mesh consumers is that functions like tracing is taken out of the app as a function of putting it into the mesh.

Jorg: in terms of added value, should the mesh do all the reporting then?

JC: The idea is that the mesh is its own traced service, so controlling the timing and the reporting of data.

Bas: one option could be to add an agent or some mix-in service to push the data into the mesh for control of trace reporting. A benefit of doing it through the mesh could also be consistency of data. ex data filled for a single request (the process and the sidecar) in the same message.

Bas: proper story on how envoy works together nicely with a control plane, where tracing is harder than metrics

Adrian: is metrics even working well with mesh anyway (integration problem between mesh and app controled)

Bas: certain types of metrics will integrate better, for example latency metrics on service points.

one simple win could be hints to suggest a call is coming from some proxy, and which tracing model we prefer to use (ex no RPC spans)

JC: why would you skip the more sophisticated instrumentation which lives in the app?

bas: world view. some world views may prefer lesser data that is in a consistent view even with uninstrumented apps.

adrian: we could heuristically tell from the app whether a proxy was there and report it as a tag. anyway as long as the pipeline can read it, we are good.

bas: we could also teach the proxies to add tags relating to the role they are serving. at the analysis level we can see if there was an instrumented app or not.

adrian: this ties into the proxy_for endpoint which we've discussed in the past

bas: specifically the proxy can know precisely the service it is acting on based on the routing table it is given.

adrian: I like the idea of being able to send data to two places simultaneously. one has complete mesh opinion and other has the app. This would unwind tension about dropping data devs add due to mesh policy.

Page tree