2019-01-15 Expedia Haystack and Adaptive Alerting

Goal

Contributors from Expedia, Hotels.com and HomeAway are hosting a workshop to see how their OSS projects (Haystack, Pitchfork and Adaptive Alerting) and Zipkin projects can complement each other. The goal is a clarified set of solutions towards a broader distributed tracing/observability community of users.

Special thanks to Expedia for sponsoring travel expenses, especially as most guests are volunteers. Many thanks to volunteers, some had to use PTO or forego income to participate.

Date

15-17 Jan 2019 during working hours in IST (UTC +5:30)

Location

Expedia, Gurgaon, India

Folks attending will receive details offline

Output

We will add notes at the bottom of this document including links to things discussed and any takeaways.

Attendees

The scope of this workshop assumes attendees are contributors to Zipkin or Expedia OSS products.

Being physically present is welcome, but on location constrained. Remote folks can join via Gitter and attend a TBD video call.

Homework

Please review the following before the meeting

haystack-ui can work against a stock zipkin install as an alternate UI. This doesn't support all features, though, and so we can discuss api buddies or other practice for those who desire to bolt on zipkin + some other haystack components such as the anomaly detection https://github.com/ExpediaDotCom/haystack-ui/wiki/Configuring-Subsystem-Connectors#using-haystack-ui-as-replacement-for-zipkin-ui

pitchfork can take existing zipkin ingress from sources such as py_zipkin or brave (sleuth) and tee it to both a zipkin destination (which could even be X-Ray as we support that!) and haystack. Essentially those who can afford it can copy data to two places to use it with this https://github.com/HotelsDotCom/pitchfork

haystack-servicegraph can also process zipkin data into (service, operation) aggregated links

Agenda

If any segment is in bold, it will be firmly coordinated for remote folks. Other segments may be lax or completely open ended.

Tuesday, Jan 15

9:30am	introductions
10:00 am	Haystack Architecture (45 mins)	Ashish
11:15 am	Break
11:30 am	Adaptive Alerting + Alert Manager overview (45 mins)	Willie Wheeler
12:45 pm	Haystack Trends (30 mins)	Ayan
1:00	Lunch
2:00 pm	Zipkin - Haystack Opportunities	Adrian

Wednesday, Jan 16

9:30-10:30am	RNN (algorithm for anomaly detection)	Keshav Peswani
10:30-11:00am	Blobs (request/response collection)	Keshav Peswani
11:00am-11:30am	Graph networks - ML predictions and anomaly detection based on the Haystack Service Graph (see https://arxiv.org/abs/1806.01261 for more information)	Willie Wheeler
11:30-12:30pm	Kafka Streams instrumentation	Jorge
12:30pm	rapid topics
12:31pm	Graph anomalies	Willie
2:00-3:00pm	Collaboration opportunities between Zipkin Haystack Pitchfork and Adaptive Alerting	Adrian Cole
3:00pm-3:30pm	Discussion: What people do about high cardinality data emitted in trace data, intended for user labels. (besides ask people to stop doing that)	Adrian Cole
5:30pm	round up

Thursday, Jan 17

TBD team outing

Outcomes

Notes

Notes will run below here as the day progresses. Please don't write anything that shouldn't be publicly visible!

Introductions

Tommy Ludwig - 2 years on Zipkin. Previous job was working on tracing and metrics for Rakuten Travel (largest travel site in Japan). Now working at Pivotal.

get questions about alerting or pulling metrics out of zipkin data. hear stories and how to answer
which features work on sampled data vs not

Eduardo - Homeaway previously Paypal. Worked a little bit for moving data into Zipkin.. Now works with Expedia group with Haystack.

learn more about developing in the open

Adrian - no

intermingled architectures for easy onboarding of eachothers' tech

Willie Wheeler - work in timeseries anomaly detection, formerly statistics looking into machine learning.

learning about distributed tracing
understand how adaptive alerting can help out
automating diagnostics

Daniele - drolando - work for yelp the maintainers of the python zipkin libraries for about 3 years

learning how we use adaptive alerting and anomaly detection for surfacing interesting traces

Jeff Baker - Amazon → Haystack, would have killed for this vs the collection of metrics at amazon at the time. 4-5 years learned about it

what do we need to do better for better open source

Shreya - technical manager haystack; onboarding of consumers onto haystack internally and externally. before a developer for 5 years, but love this role!

pulse of OSS community are thinking, for example a roadmap

Ashish - in expedia close to 6 years, observability about 2 years.

how can we improve on open source, this is our first big try
understand different asks and challenges from consumers and how to work together

Raghavendra - worked on trending and distributed systems, now in the haystack team

know more about how api design and system design applies to haystack

Keyshav - dev on the haystack team almost 2 years, one blobs and RNN (keyshav)

integrating the haystack model with zipkin data

Kapil - haystack team for about 2 years

how to add more value in the coming years based on what people are wokring on

Abhishek - Haystack UI for about 2years formerly operations analysist

improve overall and how to improve UI integration with other systems

Jason - Haystack about 2 years ago.. started with mocks for the current UI

How can we better OSS things
how do we make integration better

Brian - Zipkin about 2.5 years used in two iterations: dealer.com platform then cox automative, this one AWS X-Ray backend. one of two owners of zipkin-aws

learn how haystack integrates with zipkin
clean integration for the AWS ecosystem

Ayan - 6 years expedia started in tracing 2013 so made a kafka collector then hacked to make UUID stuff work instead of 64bit IDs. This led to Haystack from 2015

process around open sourcing
reflection on haystack open source process

Jorge - jeqo - data integration projects in Norway, contributor to Kafka and Zipkin (instrumentation)

understanding gaps between formats
deployment challenges and how to reduce friction
interested in the analytical part

Raja - zeagord - Zipkin past two years, contributing last 1 year. Works at Ascend bootstrapped metrics and tracing across 6 countries and different clouds AWS, GCP on-prem

understand how we develop complementary features
interested in the adaptive alerting
pitchfork and its roadmap

José Carlos - jcchavez - typeform in Barcelona run zipkin also contributor mostly instrumentation. team looking for adaptive alerting for last 3 months

learning about adaptive alerting
synergy in collaboration

Daniel - worldtiki - Hotels.com looked into zipkin about 1.5 year ago, when trying to understand bottlenecks. Then we learned about haystack and integrating with it

how to work better between haystack and zipkin teams
how to get advanced features like haystack and trends when only 1% rate

Magesh - distributed systems long time and streaming for 6-7 years. 2010 tracing in Expedia, but not distributed.. this header propagation is what was the UUID integrated with now. Initially SQL Server 2011-12 then splunk, which was better but teams had different indexes. Led to desire to reduce time.. 2013 tried to "black box" tracing (zipkin) 2014-15 made doppler which was a hacked zipkin to work with IDs etc already in use.. which allowed trend analysis, then led to funding of Haystack

how to help existing zipkin sites ease into new features like alerting, trending, request/response blobs (our most widely used feature for diagnosing business data problems)
how to standardize features for open source considering they may have different requirements internally
discuss how to increase accessibitliy of the codebase whether that is repository in nature or code structure.

Ankitchaudhary - infrastructure side of Haystack for a couple months, deployment related elasticsearch terraform etc.

more deeply how haystack can be used inside and outside expedia
key takeaways for faster deployment

01-15

The entire Expedia Haystack suite was originally written in Github. It was not 'open sourced' from an internal repository. This is cross brand, such as VRBO.

Haystack Architecture with focus on Tracing (Ashish)

Thanks to zipkin for setting the right tone for the community, we are excited to collaborate and make things better.

around 2014 tried zipkin including changing code to work internally, coupled with other tools developed.

users wanted a single screen
same data can be used for different purposes, like trending

main thinking was that an integrated system would be easier to write from scratch at the time

close to 2000 microservices, not everyone are onboarded yet. They can push data directly from app or use an agent (sidecar) that forwards data to Kafka. This decouples also from Kafka also, for example a switch to Kinesis (also used in Expedia). Some use haystack which supports pull or push. Most want to push, don't want an agent running. some want to do HTTP POST. Lambda also uses http post.

Kafka is the backbone, as regardless it ends up in Kafka. protobuf was chosen vs avro for lightness. One Kafka topic for protospan and many consumers of this.

4 Major features: trace, trends, pipes, service graph, possibly later blobs. This decoupling makes it possible to do partial deployment ex components not having inter-dependencies. Also anomaly detection uses same kafka.

haystack-traces is most like Zipkin

indexer groups by trace ID for an amount of span if timeout it pushes to elasticsearch
trends is a little different as it both reads and writes to kafka

indexing was originally challenging as each tag was uniquely indexed, which caused problems due to cardinality and quality of the data. Now we switch to whitelisting keys which is global configuration. Ex if the key is whitelisted it is for everyone. data and indexes kept for 4 days.

1.0 was very tied to cassandra except indexing of tag keys. Difficult to add nodes to a very big cluster, so considering if it makes sense to use other backends such as AWS. This would make cassandra available but others available. the backend uses timestamped IDs to allow multiple writes to the same trace ID. supporting different compression techniques.

case in point: trace with 1000 spans end up mostly identical, most of the index fields are the same. this source field is just the trace ID ends up as small data

reader delegates a trace ID fetch to the backend driver (cassandra or backend) which is a gRPC service. reader code is collocated with indexer in the same codebase. If you need to implement only one direction, you'd implement a no-op for example the read path.

Traces UI was inspired by Splunk (the timeline view), then timeline view which is more standard trace graph.

--- Trends

trends are based on service and operation name

Trends subsystem is there to allow multiple users of the same data. query is provided by metrictank a graphite drop-in replacement. Orbitz originally wrote graphite and understood there were scaling problems. Metrictank is stream based so allows anomaly detection. backend is based on elasticsearch + cassandra. Grafana open sourced this based on a facebook paper.

--- Service Graph

includes node-finder and graph-builder..

In homeway data is cleaned upstream, there's no distinct stage for cleaning in expedia.

node-finder remembers the pattern used for ID sharing between services. Caching is implemented with Kafka.

--- Anomaly Detection

it is decoupled so in other words Haystack can use other detection systems and anomaly detection can be used without haystack.

--- Pipes

this allows you to push span data to other consumers such as AWS athena

--- Collector (optional)

Kinesis is heavily used, Http POST for lambda and those who don't like agents, pitchfork for zipkin consumers

--- Haystack-agent (optional)

helps with kafka upgrading (version updates hard to push into user code)
helps with broken user config
separate out infra and schema registry
- ex look for a topic for spans, result is kafka or kinesis
- allows easy shifting of data

Demo (magesh)

visceral based thing shows interactions between services. one of the dashboards used by ops teams where red is a hint of a problem, not necessarily an outage.

universal search bar is similar to splunk

service graph node detail gives a breakdown with statistics between links

trace view has a two label overlay. If a call is remote, service labels are stacked with client and server overlay. it is understood the server's name is more authoritative vs what the client calls it. Without regard to ID sharing client+server spans are merged and stacked into the same row.

Q: willie: can we see the intemediates such as proxies or hystrix between client and server

A: magesh: we may retain data and present it accordingly, though this data can confuse users.

A: Ayan: what was mentioned about a third service name (proxied_for) could help

the tags "request" and "response" hyperlink to the blobs of these items. This is not open sourced yet, and not tracing based, but it is easier to find within the context of a trace.

trends allows you metrics for every service, operation pair regardless of if it is remote or not. not much on messaging yet.

anomaly is on the raw data, only if it has an subscription is there an alert (currently slack and email ex to pagerduty)

Deeper dive by Ayan

Trends has two components: transformer and aggregator

transformer receives spans into metricpoint 2.0 (grafana spec)

data is key timepoint and tags (tags are operation and service name).

The key used in kafka partitioning includes service and operation name which allows more balance vs just the service name. The service name, especially an entrypoint could be too heavy.

aggregator aggregates these metricpoints

metrictank
adaptive alerting
it has no domain knowledge to know span or node based

Aggregations

service operation
service count
success (based on the tag named "error")

Trends

All apps that read and write from Kafka are KStreams apps. Time is always based on span.startTime and buckets start for 5 minutes. ending the bucket is implicit. Windowing is used for out-of-order messages. A second 5 minute window is available for late data, but once closed it is committed and writes not available afterwards. This is a metrictank limitation (not writing afterwards). Metrics on the UI and anomaly detection are effectively 5m latent. The windowing logic is custom based on HDRHistogram which was faster and more configuration based on codahale which was also available at the time.

Node Finder

This needs to wait for the receiving edge to occur (5 second timeout between when client sends and also server), then writes to Kafka. In-memory graph is using KSQL. This can be actively queried until duration of the topic expires (24 hours), but it is snapshot every 15 minutes. This uses state store for crash protection.

Aside: zipkin based dependency linker for Kafka

Sidebar on partitioning

Teams want to only see services about their services. Another way is to partition based on entrypoint path. For example, an edge gateway endpoint. While a challenge to propagate this (vs timeout when a trace completes) it can help attribute graphs to the right team. A weight basis on the edge could also be added to emphasize an important service (ex tier)

Another option in homeway is to show a bubble based representation which is a nice way to show traffic 1 hop downstream. Large bubbles correspond to higher counts of traced requests. There's also another internal tool which is a grid-based heat map (vector is its name) with success percentage over time.

Adaptive Alerting and Simplifying Ops

We need to challenge thinking as a lot of money is at stake.

--- Key Challenges

Too many alerts

--- Adaptive Alerting

Adaptive Alerting improves MTTD with accurate detection on time series data. It supports ML-based approaches including automated model selections. Most models are constant threshold approach (like disk space), but another model is variable thresholds.

aa-metrics reads from a kafka topic then pushes to a mapper. model service knows the available models and detectors, this annotates the metric with the appropriate detector. Both aa-metrics and mapped-metrics are KStreams apps. The manager classifies these as anomalous or not and if so puts it onto an anomalies topic. This process only classifies: it doesn't do any alerting. Additional managers based on external detectors will happen in the future.

they take training data and ship it to S3, where algoritms can be executed maybe by Athena, then a trained model can be sent to the Manager for classification.

Model selection and Optimization is used to pick the right family, ex count for constant streams of data. There might be multiple families that match, so there could be some means to choose the best model for the metric points. two examples are time series forecasting (possible because the data is all labeled). Another, actual bookings dropped would be fed back into the system so that it can learn how to detect patterns leading to dropped bookings. Feedback on classifications could be thresholds like 10% anomaly might indicate a bad model choice.

runtime performs detection, training trains the model, and model selection automates model selection. After the model selection is done there's one topic per detector type, with a possible exception for the built-in ones.

The mappers send mapped metrics to only the detectors relevant to the data. For example, if the data isn't fit for constant threshold, the constant threshold detector won't get data for bookings. Not everything needs to be mediated through the model service, and external detector could be mapped

--- Simplifying Operations

When something goes wrong, it notifies. if it gets worse, there's another notification, diagnosis and fix attempt.

concrete playbook: when there's a bookings drop caused by bot traffic, get rid of the bots by routing etc.

Jeff: a lot of time, when an error occurs it has effects downstream. I might not want to be bothered by the downstream alerts. If that cascades down it would be nice for those leaf recipients to know that something is underway so that they have knowledge but not direct need to take action.

simple minded dedupe: same email subject. alternative is roll-up by a higher level context.

haystack generates metrics, but also other sources do, which end up in graphite. graphite streams to adaptive alerting for classification. Important to know that these anomalies could be high volume. Vector diagnosis looks at the classifications to compare against eachother with other data such as splunk queries etc, producing a stream of actionable signal. This high detail, actionable signal is sent to armor.

Armor can produce a deployment (roll back for example), bot treatment, and alerts. The alert goal is to reduce the alerts to infrequent ones that should be looked at individually. an idea would be a code red, which could disable release etc.

Currently, in haystack, there's an alert manager which has the prior version of functionality discussed here. Currently operations implement playbooks to do "vector diagnostics" manually. An experiment could be using service graph as a way to understand what's going on. A system outage will have a pattern and clustering to it on the way to an outage. We want to look at this pattern and see if we could have predicted the cause. Another is to predict potential failures from a single node.

Image recognition can be a good model.. it can classify a dog regardless of which direction it faces and noise in the image. This style of approach can be used in timeseries and graph datapoints to identify a state in the system.

Q: long term plan

want to be able to use haystack as a provider of data for vector diagnostics. It will be nice to have resiliency mechanisms in the path of the data. That's nice to know when considering remediations.

nested query is currently possible. also out-of-band can clean the data and add link-based tags (ex client.service, server.service). another concern is if you can peel off unrelated traffic off the graph.

would it be possible to somehow possible to let a user get access to raw data frames used by service graph spanshots.

completely stitched service graph is an option but that's only available for individual trace IDs

would like to train on expedia data, but implicit is local behaviour for example hystrix. can I tell when one broke it caused a cascade. for example 17 patterns of failures.. as the services change I dont need to train more data. some means to accomplish this could be data collection in nature, like a state flag relating to circuit or retry annotation.

automated incident report could be a feature that helps collect metadata about incidents secondarily for training purposes.

01-16

RNN

Adaptive alerting proof of concept needs a automated option (deep learning otherwise) as there are too many metrics to manually understand anomalies, especially over time. Some facets that an algorithm must address include time based (ex busy during mornings) as well holiday.

--- LSTM (Long short term memory)

units of a recurrent neural network (RNN). For example, what are the current trends considering the long term ones. with a goal of prediction.

This is appropriate as time series data is by definition focused on time.

--- Anomaly detection

anomaly is when there is a difference between a predicted value and an actual one. multi-variant example is contextual anomaly detection. For example, booking dropping overall vs client type or marketing channel.. these can collectively determine context.

--- Methods

thresholds are easy to explain, ex metric crossing a numeric constant like 100. however, this constant itself needs tuning periodically.

stdev, IQR and zscore are good for understanding short term behaviour, but not in consideration of long term.

stats models: arima/stl - need to be fitted per time series and retuned occasionally, also limited to single-dimension

bookings and traffic follow the same pattern, but you cannot use the same model against both timeseries.. they have to be trained individually

--- issue

monitor regularly spaces time series for anomalies: failure count, duration

goal: fully automate to forecast accurately, detect deviations from baseline and learn without manual training/tuning

repeat(train → forecase → detect → intervene)

operators when receiving an alert (intervene) can up or downrank the alert in case it was helpful or not. this is the feedback that results in repeat of the training.

example (magesh): there may be no booking for hours how does this handle lack of data? a: the model attempts to predict the next booking base on the void understood from the last duration. q: how does this result in a daily accuracy? a: long term window is how long it is trained (ex 3 months or 1 year of data fed through); short term is how long it waits to process the prediction. this short term might be windowed to a day in the example of infrequent data. a: because there is so little data, you wouldn't outage alert as it is usual for zero values. however you might want to know if there is a change in the volume. you can re-scale the metric to a coarser grain to help with low volume. You might want to have multiple scales of data.. ex to know that 1hr is not good but daily is still fine.

example with hotels. 1 minute is ok, but hours is not OK. in this case you can be bleeding a small amount of bookings every minute but not know it is a problem if you weren't looking at the hour.

The differences in anomaly count per period (5m, 1hr, 1day) etc can itself be a feedback used to retrain the model. Another angle is that if people always thumbs down 5m the system can learn that period is not useful.

a trained model initialised with 30 days data (M1) can be retrained in a shorter interval like 7 days.

--- architecture

anomaly detection is a Kstreams app. there's a UI to display them

cron job takes metrics into S3, which triggers a lambda that decides if training should occur or not. A revised model will be processed with another lambda and stored also into S3.

this same things can work with pipeline.io alternatively, right now the infrastructure choices are expedia specific, but it isn't necessary to achieve the same functionality.

--- demo

alerts are slack notifications, only tier-1 services. the alert UI shows actual value verses predicted ones. in the future it could look at traces for the same duration of the anomaly. human intervention (thumbs up, thumbs down) can take time to integrate into the model, as "normal" may have changed. muting is another consideration in this case. first implementation could not implement muting, but you can always forward alerts to a system which can (ex pagerduty)

q (raja): can you see historical anomalies? in haystack UI you can and an expected future functionality is 30 day history.

q (shreya): how do we handle anomalies that happen in high cardinality like 2000 similar anomalies?

a (magesh): relational data is expected to help resolve these. ex propagate causal information to reduce the noise. The system may raise on everything, but another layer like vector can reduce these. this allows the source anomalies to be simpler (single dimension) as graph (the component in alert manager) can combine these signals for classification.

model selection is more likely to be about the type of metric vs the value of it (ex service/operation). however, the model itself is partitioned by the value itself.

can you monitor decrease in values as well? for example, backend reducing its need to make calls could initially be an anomaly.

Blobs

Tracing is done in the framework itself, but for binary data we allow the developers to use a different api to record the blob associated with the current span. For example, this will result in an S3 upload and a tag with the corresponding URL.

A few ways to do this.. one way is a library. The internal code is a blobstore api with some shared properties with the span, for example, trace identifiers and timestamp. there is some conditional code to try to not tag a span unless the upload succeeds.

q: how do you decide to record blobs or not? a: there is a debug header that when present it will record like this. but also services like bookings can record regardless via configuration. Use cases like canary recording is also done.

q (JC): what type of data is recorded? a: it takes data like json or xml, and a content type header can clarify

q (Jorge): what about privacy? a: there is a place you can say what will be scrubbed. ex based on object type it can run a custom redacter. One can also run another scrubber on the other side. there are two things we scrub, span tags and blobs. the former is open sourced now.

idea 1 is provide a separate api, when A calls B we know the context. it can make a new span for the tags or update an existing one. right now opentracing has no explicit capacity to update a span after it is finished (flush in brave). this might lead to arbitrarily creating child spans or delaying finish both being bad options. Another option is to pre-emptively mark the spans and not tag conditionally on the outcome of the post.

This could be used to fix some data recording problems, ex if a tag is larger than a specific amount make this "blob" value which dispatches the command to the blob service itself. the challenge remains to decide which apis own what.

Tommy: at rakuten we just logged the request/response with trace IDs. then we just navigate to the logs from the trace UI

Magesh: we used to do this, too. just some issues like truncation as sometimes the response are > 15MiB, so we moved those to urls in logs, but then you had to hop from trace → log → S3, so Blobs was a simplification of this process.

Kafka Tracing

In the beginning there was no viable way to trace across kafka streams as headers weren't available across phases. As of 1.1 people can now use headers so it is possible to trace. the thinking changes to how to model tracing. there were no hooks available to automatically instrument, so the first phase was just producer to consumer. However, this didn't show transformations. The next step was to add hooks to wrap transformations such that they are decorated in new spans. This doesn't show all activities, only ones that are explicitly instrumented.

Eduardo: in our case, we use opentracing which creates one big span with log statements (annotations) for the stage.

Adrian: One constraint of log approach is not all backends actually support "logs" for example, in stackdriver these are converted to timestamp based keys which isn't attractive.

Not every transform or map should be traced, so it is important to be able to tune the tracing topology.

Magesh: how does the context come in? how does it propagate?

Jorge: the context is header based.

What about the interceptors should they be in brave?

The configuration of connectors need to construct its instance of brave somehow. The interceptors project bootstraps this and we can consider options for those who need to externalize config independently. There were other requests for initializing zipkin-reporter with a url-based syntax which is helpful for those who have custom CLI/bootstrapping concerns.

Pitchfork

2 years ago hotel.com → aws, one component at a time implying lots of wan traffic. This slowed down and made things fragile. We need to know more about things like whether calls are sequential or parallel. Grafana and splunk were used. however it was hard to tell what's going on by looking at timestamps. Also there were not many who knew end-to-end behaviour. We started with Zipkin, but only one application notably with hystrix method executions. We could then see where sequential vs parallel calls. Then other teams found they had similar problems, which led to organic growth of zipkin installation/instrumentation. Finally, we learned about haystack.. we didn't know it existed before. One benefit of haystack, the main one, was hotels.com calls end up in expedia systems, so the traces could continue for the whole trace. The problem was that we had a lot of investments in brave, sleuth etc, so we wanted a bridge or proxy to join the data. Once we did this, we noticed that folks like Eduardo at homeaway also had similar needs for converting data from zipkin into haystack model. We then considered this could be helpful for open source in general (took 6months to get out there). Now, it pretends to be zipkin, accepting all formats and can convert and push to a haystack kafka or kinesis queue. It can also do kafka-kafka conversion. There's no roadmap, which is why attending this meeting is helpful.

magesh: for opportunities, 1. if there's a thrift or otherwise implementation of zipkin, having pitchfork allows trends run in parallel. 2. if someone is already using kafka into zipkin, you can use pitchfork to transform and send to trends via a kstreams app.

Adrian: this is a good opportunity for the zipkin meeting in march because "rosetta stone" processes are a proven interesting idea.

We use http primarily as it is easier for users to test locally, also. In homeaway cleaning up of names is separately, and it could be added also for a bolt-on to pitchfork.

adrian: kafka adds significant support load so we can try to consider to what degree we can hide it.

jeff: do the terraform scripts help

adrian: problem is they are also unlikley to run terraform. we have users who won't run docker even.

control plane

experimental admin app for throttling services sending data to haystack. 3 stage is measure, throttle, scale

another way to describe this is: measure recommender actuator (which implements the rule)

measure ends up an attribution task, for example, the service involved. a rule is pushed to an agent to throttle traffic. They can also request more flow which could end in a scale event.

attribution: service, operation cardinality, span count, size, blob count, size

health: is broken down by subsystem, trends, traces, service graph, collector. also, by infra like kafka

next steps is to add throttling. throttling is an onboarding decision. the rule is controlled by the control plane but implemented in the agent. this takes into consideration the instance count. also, this will take whitelisted tags.

point for the limits are for abuse not sampling.

pitchfork may also be a scaling factor, some factors such as number of kafka partitions may result in a less obvious workflow implication.

Opportunities

Haystack gets simpler startup

Zipkin gets Adaptive Alerting

Minutes of meeting

Regarding sampling and metrics

SDK approach to trace data collection to X-ray gives traces/sec budget, low volume service is fine
main problem with sampling is so many inputs. When looking at individual metrics, you look at operation without other calls
one approach - have big enough bucket, another is have 100% collection of data even if sampling
Jose: https://people.mpi-sws.org/~jcmace/papers/lascasas2018weighted.pdf - regarding sampling and metrics
Keshav: MCMC sampling - if you have distributed sampling, it tends towards the actual value
1. https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo
2. issue - It's never a normal distribution
3. You would need representative sampling - not just 5% of whole group
Adrien: Why not capture metrics locally (e.g. inside agent)?
- Adrien: Store durations, etc in tracer, aggregate locally
  - If the tracer is there for 100% of calls, you can do 100% recording
- Magesh: Haystack agent was implemented not just for tracing, but also metrics and business data
  - By moving it to the agent, we can choose to enable sampling without compromising trends
- Ayan: in the context of gathering spans for more than just tracing, allowing aggregation locally may affect other subsystems
  - If we start sampling trace data, that data is only useful for traces, maybe trends, maybe service graph, but not great for long term analytics
  - Must work with the assumption, "If you send more data, you'll get more out of our systems"
- Ashish: Strip what you need at the source, then do aggregation. Data flowing through Kafka is huge.
- Adrien: There always exists complaints about budget - lower costs = community growth
  - (Showing Brave's MetricsFinishedSpanHandler.java https://github.com/openzipkin/brave/blob/master/brave/src/test/java/brave/features/handler/MetricsFinishedSpanHandler.java)
- Ayan: Take everything into Kafka? - if storage cost is the problem, don't store it
  - Still pay computation and transfer costs
  - Leveraging the "group by traced" buffer during collection
- Magesh: If we do the eager transform at client level, disable transformation process on server side
  - With that approach, you decouple one part of the system where it happens
  - Client have to understand, you enable sampling at that point, it affects traces and offline analytics
    - Ayan: if you take 2% of data, that 2% has to be random.
      - "Is it storage cost or cpu cost that is a bigger problem?"
    - Adrien: When people complain, their executor queue is too full and they don't know what to do.
  - Magesh: Telemetry data offers more than just tracing. As long as data value doesn't change, it makes sense to move aggregation to the client
  - Jeff: What if we choose to send lightweight spans?
    - Adrien: Strip down to error tag, span id, and parent id
    - Magesh: If you turn sampling on, label tag as "sampled" and add tags becomes no-op
      - Jeff: Extend to throttling - if they're throttled, make their spans lightweight
      - Ashish: Even if you strip off tags, events matter
    - Adrien: In twitter, everything was the same 5 spans, metrics were coherent, you could quickly decide if you could afford to trace at a certain level
    - Ashish: See LPAS (lodging traces with thousands of spans) where they have low amounts of tags. Lightweight spans don't help here
      - Magesh: that problem won't go away with skeletal spans in LPAS, most sampling will occur at sampling and not booking. It won't solve the problem but it will certainly improve
        "Churn more at the same cpu cost"
        Storage cost will improve
      - Ashish: Does it make sense to send the whitelisted tags
      - Ayan: we apply compression and deduce
        Adrien: if we do too much compression, we lose what they sent
- Magesh: How can we make trends and alerting work with zipkin without causing issues with existing sampling
  - As long as dataset makes it to our Kafka, the zipkin community should be able to utilize trends and alerts
    - Adrien: contributors can test it, having an HTTP endpoint would be the easiest way
      - Use a transition in Haystack UI
    - Ayan: Emit metric points to a separate Kafka topic
    - Adrien: If you have configuration set, you can redundantly send data to both zipkin and haystack
    - Ayan: What's the point of zipkin and haystack having different data models?
      - Adrien: Zipkin tags are strings, not a list but a dictionary
- Magesh: What are the next steps for getting trends/service graph/anomaly pipeline working separately, so if someone spins up haystack-ui and changes to zipkin connector, there should be no difference
  - Adrien: Bundle the stripped down metric points
  - Magesh: UI should come up with zipkin connector,
  - Adrien: Call it bale
  - Magesh: http://metrics20.org/
    - Take span data, make it simple into metrics 2.0, send it out and we can start trending it.
    - Integrating becomes simpler and less risky if we can add it in the brave client code
    - Test it somewhere that uses zipkin (maybe Homeaway)
- Eduardo: Proposal is - take zipkin reporter, one sends to zipkin with sampling and other gets simplified and sent to a http collection and goes straight into aggregator
  - Magesh: Only real piece that needs to be written is the http collection

Ease of use

* Adrien: On the story of ease of use at zipkin -

* People (that were leaving Zipkin) would fail on the way of setting up tracing, which fails to give you feedback

* Not UX issues, but rather being able to use

* Introduced JSON format for better troubleshooting

* Row limiting to prevent blowing up

* Explain spinning up architecture so it 'feels' easy

* To prevent support channels blowing up, only promise a subset of features

* Ensure consumers avoid a new deployment for each service

* Give a limited amount of choices

* Keep jars small

* Avoid too many endpoint changes

* Mono-deployment

* Get better at answering questions that are similar - avoiding unnecessary cognitive burden

* Include screenshots of what the UI will look like with the test/stub data

* Besides custom servers, Kafka is next time-sink

* Changed kafkasender toString for ease-of-use

* Closed PR for explanations, e.g. bring in Kafka - https://github.com/openzipkin/sleuth-webmvc-example/pull/8

* The whole idea of a single bundle is to be able to step back people into a working config, so they can move forward one step at a time

* Smoke test circle-ci - https://github.com/openzipkin/zipkin-php-example/blob/master/.circleci/config.yml

* Haystack UI to utilize this, maybe check for node endpoints returning data

* Explore example apps with dummy micro services from different languages

* Ashish/Ayan: Why avoid deploying with Kafka, as we can encapsulate it and open a http endpoint

* Does it make sense to have Cassie with data baked in, rather than using fake spans

* Jeff: Should the UI have functionality to post a span to the UI

* Jason/Magesh: Why not just have apply_compose/docker compose send initial spans

* Magesh: you don't need apply_compose. Yaml file + docker compose. Take the same pieces at integration and put it at the top level

* Tommy: Running locally vs deploying to real environment

* Adrien:

* Avoid cacheing right off the bat, if there are less than 3 services, because it's likely that you're playing around

* (Currently 5 minutes on Haystack UI - LoaderBackedCache)

* Use "real" fake data

* JSON button on the UI

* Makes it easier to discover problems while starting up

* WIlly: Would running AA make more sense in JVM than docker compose?

* Adrien: people don't want to go through an "integration" process. The simplest story will probably be the best

* Start simple, save fancy stuff, trust that people doing the fancy stuff are still around

Tommy: one thing a lot of sources do is aggregate locally and export at interval to the metrics service which then re-aggregates

Magesh: when we used to aggregate locally with codahale, it wasn't usable the aggregation format.

Eduardo: one advantage of span metrics even when you are already collecting http latency is that heterogenous sources don't create compatible data.

Magesh: this was one reason for us to move into metrics 2.0 format.

Magesh: if we try to do send two things we may have some network rules to change to allow the data to flow between VPCs. We can start with a sample app.

Homeaway spike

Eduardo and Adrian will spike some integration between brave and pitchfork to solve the analysis of sampled data problem. Because they are a relatively homogenous site (sleuth+brave) feedback loop should be short.

We will do 3 patterns:

sample app reports normal sampling + 100% skeletal spans to pitchfork /api/v2/spans
sample app reports normal sampling + 100% skeletal spans to haystack http collector format
sample app reports normal sampling + 100% metricpoints to pitchfork /metricpoints

Page tree