You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Status

Current state[One of "Under Discussion", "Accepted", "Rejected"]

Discussion thread: https://lists.apache.org/thread/mvcfhj12hpk00ov1rhkw1k5d811jk8pj

JIRA or Github Issue: 

Released: <Doris Version>

Google Doc: <If the design in question is unclear or needs to be discussed and reviewed, a Google Doc can be used first to facilitate comments from others.>

Motivation

  1. Telemetry data traces, metrics, and logs are often known as the three pillars of observability. Currently, Doris lacks traces telemetry data collection, which makes it difficult to locate slow queries and troubleshoot system bottlenecks. With OpenTelemetry, traces data can be collected to effectively monitor the process of request execution and greatly improve system observability.
  2. Doris currently does not implement a uniform open standard for telemetry data collection, which is not conducive to exporting telemetry data to third-party systems for analysis. OpenTelemetry implements a set of open source standard semantic conventions, provides vendor-independent instrumentation libraries, and supports multiple programming languages for telemetry data collection and easy export of telemetry data to different back-end nodes (including Zipkin, Jaeger, Prometheus, etc.).
  3. The telemetry data currently collected by Doris is not correlated with each other, and it is impossible to quickly locate one kind of telemetry data to another. By introducing OpenTelemetry, traces, metrics, logs can be correlated. For example, we can inject traceid and spanid into metrics through exemplars to correlate traces and metrics, and inject traceid and spanid into logs to correlate traces and logs, so as to quickly locate all telemetry data of the problem.

Related Research

1. Telemetry

Telemetry refers to data emitted from a system, about its behavior. The data can come in the form of Traces, Metrics, and Logs.

  • logs  

    Discrete events actively recorded by users, the recorded information is generally unstructured text content, which can provide more detailed clues when users analyze and judge problems.

  • Metrics 

    The collected data with aggregated attributes is designed to show users the running status of a certain indicator in a certain period of time, so as to view some indicators and trends.

  • Traces 

    Record the entire life cycle of a request call Process, which includes information such as service invocation and processing time.

    a . Trace refers to the call links of all services that an external request passes through. It can be understood as a tree structure composed of service calls, and each link is identified by a globally unique ID.

    b. Span refers to a call within a service or between services, that is, a node in the Trace tree, and there is a parent-child relationship between Span nodes. Span mainly includes Span name, Span ID, parent span ID, Timestamp, Duration and other information.

2. OpenTelemetry architecture

  • Application: General applications, such as doris' fe and be.
  • OTel Libraty: Also known as SDK, it is responsible for collecting and exporting telemetry data in the program.
  • OTel Collector: The OpenTelemetry Collector offers a vendor-agnostic implementation of how to receive, process and export telemetry data. It removes the need to run, operate, and maintain multiple agents/collectors. This works with improved scalability and supports open-source observability data formats (e.g. Jaeger, Prometheus, Fluent Bit, etc.) sending to one or more open-source or commercial back-ends.
  • Backends:  Responsible for persisting and presenting telemetry data, and providing the ability to analyze telemetry data. such as zipkin, prometheus, etc.

Detailed Design

Query trace collection and export:

1-creating trace


2-collecting span of fe


3-collecting span of be


4-propagating trace between fe and be


5-exporting span


Scheduling

specific implementation steps and approximate scheduling.

  • No labels