Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Telemetry refers to data emitted from a system, about its behavior. The data can come in the form of Traces, Metrics, and Logs.

  • logs  

    Discrete events actively recorded by users, the recorded information is generally unstructured text content, which can provide more detailed clues when users analyze and judge problems.

  • Metrics 

    The collected data with aggregated attributes is designed to show users the running status of a certain indicator in a certain period of time, so as to view some indicators and trends.

  • Traces 

    Record the entire life cycle of a request call Process, which includes information such as service invocation and processing time.

    a . Trace refers to the call links of all services that an external request passes through. It can be understood as a tree structure composed of service calls, and each link is identified by a globally unique ID.

    b. Span refers to a call within a service or between services, that is, a node in the Trace tree, and there is a parent-child relationship between Span nodes. Span mainly includes Span name, Span ID, parent span ID, Timestamp, Duration and other information.

2. OpenTelemetry architecture

  • Application: General applications, such as doris' fe and be.
  • OTel LibratyLibrary: Also known as SDK, it is responsible for collecting and exporting telemetry data in the program.
  • OTel Collector: The OpenTelemetry Collector offers a vendor-agnostic implementation of how to receive, process and export telemetry data. It removes the need to run, operate, and maintain multiple agents/collectors. This works with improved scalability and supports open-source observability data formats (e.g. Jaeger, Prometheus, Fluent Bit, etc.) sending to one or more open-source or commercial back-ends.
  • Backends:  Responsible for persisting and presenting telemetry data, and providing the ability to analyze telemetry data. such as zipkin, prometheus, etc.

3. What traces can do

  • Slow Query Location
    trace and span record the query time consumption, through trace you can count the longest time consuming queries over a period of time.
  • Performance bottleneck analysis
    span records the time consumption of the network between fe and be nodes and the time consumption of each execution node of be, and the time consumption of each span in a batch of queries can be counted to analyze the performance bottlenecks.
  • Quickly locate query failures
    Combine trace with log and metric, and quickly locate the relevant log and metric information by trace_id and span_id.

...