Introduction and Scope

  • At LINE, we have few hundreds of engineers on dozens of teams managing dozens of services. 1.5 engineer works on distributed tracing. Almost all services are in Java, some are in Erlang.
  • At LINE, we use < 5% of infra cost and < 1% of human cost to maintain our observability stack includes:
    • Metrics (~ 50 millions metrics) (in-house build using opentsdb / mysql)
    • Logging (in-house build using elasticsearch)
    • Tracing (Using openzipkin)
  • Primary integration point with zipkin is armeria, which has native support for instrumenting server and client requests.
    • Some users using spring sleuth, and some using envoy to send span

System Overview

  • Instrumentation

    • Use Brave for instrumentation.

    • Most instrumentation happens within armeria, which instruments server and client requests.

    • Other standard brave adapters used include mysql.

    • Custom brave adapters have been implemented for data stores like redis, mongo.

    • Custom reporter that uses armeria RPC-over-HTTP2 (Thrift) to send spans

  • Data ingestion

    • Primary transport is normal HTTP2, as used by armeria for RPC

    • Monitoring API server between all instrumented servers and the data store.

      • Implements zipkin-api as well.

      • Exposes zipkin-ui as well

      • Fully asynchronous, non-blocking

    • API is a simple thrift service with a union of two fields

      • binary encoded_spans - Spans that have already been serialized as a list of zipkin thrift spans. This would be the result of using zipkin-java’s ThriftCodec. This field is prefered for languages with good implementations of zipkin already.

      • list <zipkin.Span> spans - Spans that can be filled in by filling in generated language code. This should be useful for languages that don’t have implementations of zipkin as they can just fill in the generated code without worrying about duplicating models + business logic. Not used yet, but probably would be used from erlang.

  • Data store and aggregation

    • Elasticsearch cluster - 6 nodes on physical machines with xeon CPUs, 32GB ram, 6 SAS HDDs each

    • Use elasticsearch’s Curator tool with cron to clean up indexes that are more than a week old

    • Monitoring API server uses zipkin’s elasticsearch-storage to write spans into elasticsearch with no extra processing.

    • Best effort - randomly lost spans will be lost (no separate storage like kafka)

    • In practice, don’t see many failures

  • UI part: we created https://github.com/line/zipkin-lens to fit our usage and and moved it to openzipkin.

Goals

  • Best effort - as long as latency investigation can happen, occasional broken traces isn’t a big deal.

  • Eventually want to instrument all servers - currently only instrumenting one team’s servers which is comprised of several services each with dozens of serving machines.

  • Will need erlang instrumentation

Current Status (10-2018)

  • Ingest rate is around 8000 spans per sec

Service name

At LINE, service name is created freely by our users. Mostly user likely to create service name to represent their cluster purpose, like "bot-frontend-service" or "shop-ownership-service". 

Site-specific tags

The following are span tags we frequently use in indexing or aggregation

TagDescriptionUsage
instance-idOur company-wide naming for project
instance-phaseOur company-wide naming for enviroment
  • No labels

1 Comment

  1. Maybe tapper can help if your erlang services are Elixir https://github.com/Financial-Times/tapper