2019-04-17 UX workshop and Lens GA planning at LINE Fukuoka

Goal

This is the second time we are meeting at LINE in Fukuoka. While our focus is GA planning (or trigger pulling) for the Lens UI, we have enough time and diverse people to do more. When we meet, we'll decide how to best use time with folks present. Likely, we'll have secondary outputs collaborating on Haystack and Armeria. Quite possibly, we'll also do some instrumentation work considering that was an area of focus last time in Fukuoka.

Date

17 - 19 April 2019 during working hours in JST (UTC +9)

Location

LINE Fukuoka 株式会社

Folks attending will need to coordinate offline

Output

We will add notes at the bottom of this document including links to things discussed and any takeaways.

Attendees

The scope of this workshop assumes attendees are intimately familiar with Zipkin and should have some understanding of Haystack and Armeria.

Being physically present is welcome, but on location constrained. Remote folks can join via Gitter and attend a zoom video call.

Homework

Please review the following before the meeting

https://github.com/openzipkin/zipkin/tree/master/zipkin-lens

https://github.com/ExpediaDotCom/haystack-ui

https://github.com/line/armeria

The case of the "Favorite trace"

In the last several years folks have thought of various ways to favorite a trace, eg archive past TTL possibly into another storage. This was brought up again by Daniele, so this topic needs to be refreshed so that we can decide what to do about it.

first time someone wanted to move a trace elsewhere (save trace as json, later re-instated) https://github.com/openzipkin/zipkin/pull/73
I think daniele will be collecting other stuff (he can edit this wiki )

Agenda

We will start together at 10am, not earlier, to allow local folks to do their work. Out of town guests can of course do coffee etc.

If any segment is in bold, it will be firmly coordinated for remote folks. Other segments may be lax or completely open ended.

Wednesday, April 17

10:00am	introductions
10:30am	The case of the "Favorite trace" (daniele)
??	Meet LINE SRE team
12:00pm	RAMEN
1:30pm	Pseudo-tags and query time aggregation (adrian)
4:00pm	Jitpack and other developments in branch testing (bsideup)
5:30pm	Decide what to do tomorrow
7pm	LINE meetup

Thursday, April 18

10:00am	over capacity and what to about it wiki 2481 (logic32 might make it)
11:00am	new and notable on haystack by jason
12:00pm	RAMEN


6:00pm	Funeral party for classic UI (also beer)

Friday, April 19

10:00am	how or when to differentiate nodes in the same service in the trace graph
2:00pm	meet SRE team
12:00pm	RAMEN
5:30pm	round up
6:00pm	craft beer tour (also beer)

Outcomes

Notes

Favorite Trace by Daniele

Favorite trace

Right now zipkin only stores data for a week.

Main use case is people copy/paste url into Jira and then within a week the link is broken due to expiry.

tommy: even if someone didn't post a link, they would send it in chat and maybe the data is gone by the time it would be read.

common problems are additional infrastructure, UI mounting and how to query results with the favorite trace amongst normal ones.

daniele's recommendation is to add ARCHIVE_URL and have a UI element to post the archive trace to. when you POST return the full link including the trace ID. That would be the link people access.

allows you to decouple the functionality from implementation (ex the second zipkin could have different storage setup)
probably don't need a separate cluster, rather a different keyspace or index with a much longer TTL (ex 1yr)
opt out of weaving the UI into searching both places, which avoids query complexity.
nothing requires search enable either, which keeps the impact minimal.

tommy: curious how often search would even be used on the archive storage.

daniele: maybe you lost the link?

tommy: I think there could be some feedback about which url to go

daniele: we already have two urls at yelp (sampled and firehose), so that's not a big deal for us

tommy: any issues on this? have people been surprised?

daniele: because the firehose has the search screens disabled, it is less confusing.

jason: we have 3 day retention, then athena query for long term.. rarely used in practice ex every other month. This is complicated because they have to use AWS infrastructure to do search. a second url could be more attractive.

adrian: you can use any storage for archive.. might even be cheap for SaaS based options if rentention matches.

daniele: is there a problem with ES and a large amount of daily indexes?

adrian: not sure

tommy: we could revisit this index feature.

daniele: there is a write access concern both access and CORS

adrian: in setup guide we have to say CORS access and such are needed exactly same as browser tracing

tommy: we need instructing error messages vs fail cryptically.

adrian: have you hacked this yet

daniele: not in a formal way

tommy: do we do this only in lens?

adrian: lens only please (carrot)

daniele: no problem if only lens

adrian: how do we design this? is it only the trace view or trace list

jason: I think it is less helpful to allow to click several traces at the same time. it might encourage people to save all.

tommy: initial feeling is trace view screen

jason: would you have an unarchive button?

adrian: there's no UI for deleting, so can't

daniele: there's not much issues I think with multiple clicks I think

do we tag on the POST trace to make it easier to lookup?

adrian: maybe we tag with JIRA, but then a problem if tagged with the wrong one

tommy: maybe it could help, but maybe it would be burdensome. this is something we can park for now.

daniele: ideally we could a feature flag everything

magesh: commonly requested in haystack and considering scatter/gather query

TODO: check how this would effect elasticsearch. concretely would 365 daily indexes with a small amount of data be a problem.

TODO: double-check the UI dedupes at read time to ensure multiple archiving doesn't mess things up

we may be able to guard on POST with a HEAD or GET on the same trace ID

rag: another option to doing a separate service is to embed the trace content as a query parameter for the UI to render. This coupled with a url shortener could be an easy solution with less to maintain vs a separate zipkin service. The UI would just need to unpack that URL.

adrian: some disclaimers might need to be made to not literally use bit.ly unless it is really not private. also we might want to re-render the trace to be more efficient in size (removing redundant tags etc)

How to deal with the poison trace problem

One common concern with distributed tracing is that an instrumentation problem can poison the centralized trace repository, and this can not only affect the service in question, but anyone querying data that includes that service.

tommy: in rakuten we had a thread context leak in one service that would keep a trace open for hours, resulting in thousands of spans

huy: even in normal usage we can have redis making so many calls that it doesn't group or summarize similar spans

adrian: we recently had complaints about a trace too large to load https://github.com/apache/incubator-zipkin/issues/2496

jason: we collapse by default because of this reason. the UIs are limited in the ability to zoom/search to a specific place of interest in a trace.

adrian: delete functionality would be nice, but it wouldn't help with the zombie service which keeps writing the same trace ID until restarted (then it would be a different persistent trace ID)

adrian: we originally moved logic from the server to javascript in order to show bugs in instrumentation, or allow overlays to correct things. Right now, we don't do any soft erroring. For super large traces we can set an absolute limit, provided we have an incremental json parser so that we can avoid loading something too big. The problem space is mainly around the trace list page as we wouldn't want a poison trace.

tommy: we have a problem in coming up with a good number because limitations are sometimes about what the machine is capable of

adrian: since 10k spans is a common problem, we could consider a threshold of this.

jason: we have problems with people not doing parent IDs properly, so we make a fake root span (zipkin does this too)

One issue is we currently don't have a way to show soft errors at all in the trace list, we just load and correct data or hang. We may need to work on soft error independently of applying the activity of soft errors to poison traces.

Pseudo-tags and query time aggregation (adrian)

We want to gather some use cases and determine if doing this is a good idea at all.

How do we display this extra information aggregated from the queried data?

This can be done without changing instrumentation because we can do it after-the-fact on the tracing data. Alternatively, we could just reference some external data that exists for this already like we do for linking to a log viewer.

Does this aggregated data get added to the retrieved data as a tag (even though the tag does not exist in the underlying data store)?

We're having trouble discussing this without concrete use cases, so we are going to table the discussion for now and maybe come back to it later in the workshop if we gather the use cases, so we can discuss more fruitfully.

tommy: some of this might change when we have grafana plugin, also we don't have very concrete use cases. let's park until things change

Grafana Plugin

rag: maybe hard to integrate with grafana datasource, maybe better to embed the lens ui

tommy: "exemplar traces" seem to have integrated the view between grafana and zipkin

huy: we have metrics and the timeline, so you could click in the metrics timeline and get to a trace query view. not necessary to have the trace view inside.

the "all dots problem" if we are doing 100% tracing, instead of low rate sampling, you will have a lot of dots and this will impede the ability to render individual tracing, possibly the performance.

tommy: we need to have a component that will select traces from the buckets that correspond to the same dimensions used in the heat graph.

adrian: one of the data integration problems might be matching the data generated by metrics instrumentation with zipkin data (Assuming metrics weren't derived from spans)

For example, a 90% histogram might look like this, broken down by uri, method and instance. There could be some misses here due to uri normalization or lack of instance tag.

          "targets": [
            {
              "expr": "histogram_quantile(0.9, sum(rate(http_server_requests_seconds_bucket[1m])) by (uri, method, instance, le))",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "{{method}} {{instance}} {{uri}}",
              "refId": "A",
              "step": 20
            }
          ],

huy: we don't necessarily have to match generic metrics. If you see a spike, at worst you could get the timerange or service (with no other dimensions).

tommy: maybe you could even just click in a verticle timebox corresponding with a timeframe, and jump to the trace view screen.

rag: grafana already lets you add links to dashboards, could we use this?

tommy: it would be ideal if A we had a zipkin panel and B labels would align api queries. It is hard to reason with how aligned data would be if there is nothing people can try. also, many clients won't have metrics anyway.

TODO follow-up with JBD to see what happened if anything about this https://medium.com/observability/want-to-debug-latency-7aa48ecbe8f7

SPIKE IDEA: create a tabular panel for grafana which has a row for each trace matching the timeframe and when clicking it jumps to the zipkin trace view.

Jitpack and stuff by Sergei

package repository for JVM and android projects that builds git projects on demand into jar

maven central has some problems as you will never see branch builds, and it can take time to get artifacts visible (15m is better than before)

jitpack allows you to publish artifacts with no upload information at all (in gradle) it does this by scanning for files made by the install tasks

you can substitute a git hash or branch-SNAPSHOT, for an artifact version (it automatically builds source jars as well)

– the group id is the name of the fork

- the deployments survive deletion of the branch

you can also classify by module

main thing is it will only build requested versions and direct relationship to the source.

note: if you request a version before it is built, it can cache the miss. to fix this login delete it and build again.

there are some features you can use to do light tests before publishing.

tommy: how does this help with docker? before I would run the jar itself, so this would be very straightforward.

sergei: you can make a slight change to the default image to use the jitpack binary instead. also you can switch and make some image to download from jitpack on demand.

adrian: what about resource limitations? we have hit some things on jcenter

sergei: it is free for oss, and not really any limitations. maybe jitpack would actually like to have a higher profile project like zipkin.

sergei: there are some service-side factors that de-risk dependence on them. they are on s3 and also they have some features to redirect until a build is done. you can also be explicit with a yaml file. docs are small but cover most things you'd need. you may run into some cache problems, just be aware that there are fixes for that. the developers working on it are very responsive.

adrian: I think this should be helpful in source release verification, and verifying common changes work (ex infra changes needed upstream for downstream projects like zipkin-aws,-gcp,-dependencies

Collector surge and error handling by Craig

Collector surge and error handling

Right now there is a semaphore that hard drops messages when over a concurrency level, which is an attempt to not overload zipkin-server.

we are using using enqueue to help not block callers on the storage path of a POST request.

with https://github.com/apache/incubator-zipkin/pull/2502 we went from 11% drop per day to 2%

we bump up the 10k span limit regularly due to deeply nested microservice/database calls; this limit can cause large gaps and missing roots in traces

Adrian: do you know what the largest trace you have?

Craig: we don't know because ES can't retrieve more than 10k spans (edit: have seen 100k spans in a legitimate trace)

Craig: we sometimes load a collection of items into a bundle. To avoid the 10k span limit issue we fork the individual items off into a new trace (so we have a high-level "collection" view and more detailed item view) and link them together with a source/childTraceId tag. would be nice to recognize trace links

Adrian: we are also sometimes talking about summary.

Craig: bulk thread pool is the one that's usually overwhelmed, and this PR is only on the write path.

Huy: we have some thoughts about backpressure and circuit breaker related approach we might try.

Craig: one problem we have is that we are still dropping spans even though we aren't hitting the queue limit. This was due to an exception in elasticsearch.. we could implement logic to try again.

New and Notable in Haystack

new is alerting which is wired into the UI

you can also deploy components independently (like UI with only service graph)

alerts are classified strong or weak and dispatched to slack or email

also new is range operators with key/value pairs. right now zipkin has range only for duration. after we rolled this out, we had requests for tags to whitelist, such as depth of the request in the stack. other things are like currency or retry count.

adrian: were there any devils in details about the range indexing in elasticsearch?

jason: nothing comes to mind

We are also looking into easier deployment mechanisms and cost attribution. Cost attribution would focus on cpu and storage costs.

Also open sourcing blobs.

huy: what is Svc duration?

jason: service duration. it will be the percentage of the total trace data (there is another for the operation duration)

huy: is the adaptive alerting useful? some people say mostly false alarms

jason: we are early stages, but we have a separate UI that will show what has been detected as an anomaly with a thumbsup|thumbsdown

tommy: presumably it will improve quality over time with the feedback. will be nice to see the quality improvement numbers

How or when to differentiate nodes in the same service in the trace graph

>> how do you see inter-host communication if all traces are categorized under one service?

>> in zipkin how would I see inter hosts in the timeline?

>> it seems like a lot of clustered software, like say cockroachdb publish itself as one service in traces, which I sort of get, but you lose all the inter-node visibility in the timeline

>> well I'm tracing CDB so yeah it would all be CDB

>> but what I mean is, I can't visualize inter-node communication in the timeline

>> I've tried both jaeger and zipkin and both display service name but none break down actual nodes or threads even

<< because the shards are not identifiable and everthing uses the same service name too

>> right

>> is there a way to visualize nodes in the timeline without changing the service name to match node names?

>> yeah if there was a common way to identify an actor like host port or thread id

<< I think there are some UI things to consider about when to zoom into ip/port level detail, but literally we are today having a UX workshop so good timing

>> can you share a trace that shows this problem (json button)

https://gist.github.com/kellabyte/c07bc34b4155231743c61edf5b977f42

Overview of the issue is that if there is a lot of activity within a service, we can't differentiate visually as all UI clues are service-scoped, not a smaller scope like container ID or host/port.

huy: localEndpoint could have an identity field added which can help with containers which might not be easy to identify based on host/port.

adrian: htrace did something like this with a delimited field https://github.com/apache/incubator-retired-htrace/blob/2ce9d3b25a49d371a7b48e389b56d50a0164c8a0/htrace-core4/src/main/java/org/apache/htrace/core/TracerId.java

sometimes they only use service/ip

adrian: thread id could be hard to do in the trace data, we usually look at logs for that, correlated with trace/span ID

tommy: we should punt thread id until that's more requested, and focus on the data we can use (host and port)

adrian: there have been people adding tags about the abstract resource, such as amazon instance ID, and census have a work in progress of cataloging common keys for this

tommy: we should have a heuristic to add ip/port information, but be aware that in doing so, we could confuse users (like why does trace A not include the ip port? and B does?). A default of 1 (aka monolith service) might not be common enough to help with.

adrian: should we make users click on every trace

tommy: make it possible is probably our first goal.

here's the issue to follow-up on https://github.com/apache/incubator-zipkin/issues/2512

SRE meeting

promgen creates metrics configuration on demand and this becomes a selection in a shared grafana.

this includes an alerts screen and the ability to manage them (rules are written in promql)

alertmanager is on slack, and there is some effort to format data appropriately. In slack, the alertmanager will reply to the alert with graphs that could help. chatops is considered, too

prometheus is designed to be simple, and a focus on recent information. thanos adds a query layer across multiple prometheus servers. it does this with a sidecar architecture which copies data from a prometheus server into its object store for query purposes.

biggest pain points were around querying for large ranges, s3 integration, and data that overlaps between recent and historical data.

one use case is about servers that have different rentention, like one server holds a month for one service, but another a lot less for others.

goal is to keep a couple weeks online and anything after that in s3 buckets. problem in execution relate to youth of the project and time constraints.

matsuzaki: first SRE at LINE and working on LINE shop

LINE service graph is complex (large amount of services), want to filter based on my service.

(huy) a lot of reason the service graph is complex is due to ips and hostnames in the service names

(jason) you can filter based on service in the vizceral service

huy - how do you use Zipkin in troubleshooting?

start with promgen, look at the alert and then look at dashboards. for example, shop-overview. If there is a spike, you look for the service name and then look for a dashboard based on the one involved. Next would be to look at zipkin for that service. In a high load situation, there's many times the trace you need is sampled out.

what's the minimum time you would need if you wanted all traces?

15m should be fine

how does 100% work

application sends to two places one at the normal rate and the other 100%

tommy - in order to know if a solution is feasible for 100%, we'd want to know how much data that could be.

I have one case in taiwan team, they don't know metrics, so they use traces. later they use aggregation in ES to get metrics.

when do you create dashboard?

some have an overview dashboard, and then have different dashboards for common drill-down purposes. We have some conventions to group things by exporter, for example armeria.

one problem is that labels don't match between systems like metrics and traces, an organizational problem to agree on terms. For example, how would you get Grafana data correlated with what would be in zipkin. sometimes the mapping doesn't work out 1-1.

huy - if we agree on some low-level terms like service and operation, we could get value.

this add pressure to derive metric from span data (where it is relevant), as there's then no mapping concern at that level.

https://developers.soundcloud.com/blog/using-kubernetes-pod-metadata-to-improve-zipkin-traces is an example of data cleanup

rag: do people use the service graph

tommy: at previous job we used it, it was handy to see the count between the services even if the data is sampled. This was how we found a problem where one service was creating a lot of junk spans and overloading the system.

huy: how was your experience at google with tracing

rag: there was no global dapper UI at the time, so you need to know the trace ID from logs or chrome tools. some percentage of traces were broken, but it would not be hard to find another trace ID that worked. we only used it for latency troubleshooting, not errors. success case was in tuning the implementation of maps tiles which could have been done parallel.

Duration sort chat

one question that comes up often is why doesn't the duration query rank over time.. meaning if the query is for the last day, why isn't the longest duration sent back? Main problem is in the indexing and how that works. In SQL, we can do arbitrary order by, but not currently in ES and C*. When we have grafana plugin, it might be possible to help people answer the "what is the longest duration" question with metrics and then jump into zipkin with that number pre-populated.

UI-related priorities

what's a reasonable expectation to give to people who ask... is grafana going to land first or deleting old UI? Since most work falls on Igarashi, and he wants delete UI first, we'll tell folks that Grafana is next after deleting the old UI. To tighten the time gap, we'll try to give and resolve UI related feedback and help with topics such as documentation.

Zipkin Grafana plugin

We are discussing with the LINE team if a trace table corresponding to the window of the grafana dashboard would be helpful to the SRE team. The main goal is to move more quickly or accurately from a graph view into zipkin

we are thinking about an element to exist beside the normal latency which returns a table view.

another is to annotate the existing graph.

another is to have a link at the top of grafana that says zipkin and sends the formatted query to a zipkin url.

least amount of work is the link approach.

all options have a data matching concern, of which if any labels correspond to zipkin service or tags.

in LINE there's no expectation of names being the same
one idea is to have a function specified in config.json to allow some mappings

simple-json-datasource could be helpful in reducing the work involved. one prototype is to keep track of prom alerts to group into an outage later. this was implemented by a django app that uses the same endpoints defined in the simple-json-datasource. on the grafana side, you have to install the datasource and configure it. this is visualized as a line or a range.

the manual way for creating annotations allows for simple data with tags, which is visualized as a vertical line. There is nothing to visualize something with a Y.

one integration approach is a mixed query, where zipkin is mixed with a prometheus plugin.

the future integration of grafana will be go based datasource plugins, some already are in go (like the stackdriver plugin)

Notes will run below here as the day progresses. Please don't write anything that shouldn't be publicly visible!

Page tree

Goal

Date

Location

Output

Attendees

Attending on-site

Homework

The case of the "Favorite trace"

Agenda

Wednesday, April 17

Thursday, April 18

Friday, April 19

Outcomes

Notes

Favorite Trace by Daniele

How to deal with the poison trace problem

Pseudo-tags and query time aggregation (adrian)

Grafana Plugin

Jitpack and stuff by Sergei

Collector surge and error handling by Craig

New and Notable in Haystack

How or when to differentiate nodes in the same service in the trace graph

SRE meeting

Duration sort chat

UI-related priorities

Zipkin Grafana plugin

9 Comments

HUY DO

HUY DO

Daniele Rolando

Adrian Cole

Logic32

Tommy Ludwig

Logic32

Adrian Cole

Logic32