Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 36 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
36
Dung lượng
1,02 MB
Nội dung
OpenTracing emerging industry standard for distributed tracing Table of Contents Introduction 1 OpenTracing basics 2 OpenTracing API 6 Context propagation 9 Distributed tracers Zipkin Span ingestion Storage Jaeger Span ingestion Storage Zipkin vs Jaeger Other tracers 11 11 14 15 21 23 24 31 32 Supported instrumentation 32 1 Introduction As organizations are embracing the cloud-native movement and thus migrating their applications from monolithic to microservice architectures, the need for general visibility and observability into software behavior becomes an essential requirement. Because the monolithic code base is segregated into multiple independent services running inside their own processes, which in addition can scale to various instances, such a trivial task as diagnosing the latency of an HTTP request issued from the client can end up being a serious deal. To fulfill the request, it has to propagate through load balancers, routers, gateways, cross machine’s boundaries to communicate with other microservices, send asynchronous messages to message brokers, etc. Along this pipeline, there could be a possible bottleneck, contention or communication issue in any of the aforementioned components. Debugging through such a complex workflow wouldn’t be feasible if not relying on some kind of tracing/instrumentation mechanism. That’s why distributed tracers like Zipkin, Jaeger or AppDash were born (most of them are inspired on Google’s Dapper large-scale distributed tracing platform). All of the aforementioned tracers help engineers and operation teams to understand and reason about system behavior as complexity of the infrastructure grows exponentially. Tracers expose the source of truth for the interactions originated within the system. Every transaction (if properly instrumented) might reflect performance anomalies in an early phase when new services are being introduced by (probably) independent teams with polyglot software stacks and continuous deployments However, each of the tracers stick with its proprietary API and other peculiarities that makes it costly for developers to switch between different tracer implementations. Since implanting instrumentation points requires code modification, OSS services, application frameworks and other platforms would have hard time if tying to a single tracer vendor OpenTracing aims to offer a consistent, unified and tracer-agnostic instrumentation API for a wide range of frameworks, platforms and programming languages. It abstracts away the differences among numerous tracer implementations, so shifting from an existing one to a new tracer system would only require configuration changes specific to that new tracer For what it’s worth, we should mention the benefits of distributed tracing: out of the box infrastructure overview, how the interactions between services are done and their dependencies efficient and fast detection of latency issues ntelligent error reporting. Spans transport error messages and stack traces. We can take advantage of that insight to identify root cause factors or cascading failures. trace data can be forwarded to log processing platforms for query and analysis 2 OpenTracing basics In a distributed system, a trace encapsulates the transaction’s state as it propagates through the system. During the journey of the transaction, it can create one or multiple spans. A span represents a single unit of work inside transaction, for example, an RPC client/server call, sending query to the database server, or publishing a message to the message bus. Speaking in terms of OpenTracing data model, the trace can also been seen as a collection of spans structured around the directed acyclic graph (DAG). The edges indicate the casual relationships (references) between spans. The span is identified by its unique ID, and optionally may include the parent identifier. If the parent identifier is omitted, we call that span as root span. The span also comprises human-readable operation name, start and end timestamps. All spans are grouped under the same trace identifier The diagram above depicts the transit of an hypothetical RPC request. The client makes an HTTP request to the server which results in generating one parent span. In order to satisfy the client’s request, the server sends a query to the storage engine. That operation produces one more span. The response from the database engine to the server and from the server to the client creates two additional spans. Spans may contain tags that represent contextual metadata relevant to a specific request They consist of an unbounded sequence of key-value pairs, where keys are strings and values can be strings, numbers, booleans or date data types. Tags allow for context enrichment that may be useful for monitoring or debugging system behavior While not mandatory, it’s highly recommended to follow the OpenTracing semantics guidelines when naming tags. Such as that, we should assign component tag to the framework, module or library which generates span/spans, use peer.hostname and peer.port to describe target hosts, etc. Another reason for tagging standardization is making the tracer aware of existence of certain tags that would add intelligence or instruct the tracer to put special emphasis on them. As illustrated on Figure 2, the spans are annotated with tags that obey OpenTracing semantic conventions. Furthermore, the spans are rendered with different chart. This type of waterfall-like visualization adds the dimension of time and thus makes it easier to spot the duration of each span. Besides tags, OpenTracing has a notion of log events. They represent timestamped textual (although not limited to textual content) annotations that may be recorded along the duration of a span. Events could express any occurrence of interest to the active span, like timer expiration, cache miss events, build or deployment starting events, etc. Baggage items allow for cross-span propagation, i.e., they let associate metadata that also propagates to future children of the root span. In other words, the local data is transported along the full path as request if traveling downstream through the system. However, this powerful feature should be used carefully because it can easily saturate network links if the propagated items are about to be injected into many descendant spans. As at the time of writing, OpenTracing supports two types of relationships: ChildOf – to express casual references between two spans. Following with our RPC scenario, the server side span would be the ChildOf the initiator (request) span. FollowsFrom – when parent span isn’t linked to the outcome of the child span. This relationship is usually used to model asynchronous executions like emitting messages to the message bus. 3 OpenTracing API OpenTracing API is modeled around two fundamental types: Tracer – knows how to create a new span as well as inject/extract span contexts across process boundaries. All OpenTracing compatible tracers must provide a client with the implementation of the Tracer interface. Span – tracer’s build method yields a brand new created span. We can invoke a number of operations after the span has been started, like aggregating tags, changing span’s operation name, binding references to other spans, adding baggage items, etc. SpanContext – the consumers of the API only interact with this type when injecting/extracting the span context from the transport protocol. Let’s see some code. Although we’ll focus on Java, the API semantics are identical (or at least they should be) for any other programming language (OpenTracing has API specs for Go, Python, JavaScript, Java, C#, Objective-C, C++, Ruby, PHP). Figure 4 represents the role of OpenTracing API instrumentation within the tracing landscape. It’s important for the tracer clients to be compatible with the OpenTracing specification. For instance, we could be biased to use Zipkin tracing system. The instrumentation points in our applications are created via OpenTracing API despite we’re using Zipkin clients for span reporting. After evaluating other tracers, we could figure out Jaeger fits better our needs. In that case, switching from Zipkin to Jaeger would be a matter of registering the corresponding instance of the tracer, while instrumentation points would remain the same, i.e., we wouldn’t have to adapt any code. Because Jaeger tracer is compatible with Zipkin span formats, we could use the same Zipkin client to submit span requests to Jaeger. Before being able to create a span, we have to register the tracer. This step is tied to the particular tracer implementation, but basically it consists on indicating the tracer’s endpoint and the component which sends the instrumentation data to the tracer. In case of Jaeger tracer, we would have the following code snippet: import com.uber.jaeger.Configuration ; import io.opentracing.util.GlobalTracer ; Configuration config = n ew Configuration(component, new C onfiguration SamplerConfiguration( "const" , 1), new Configuration ReporterConfiguration( true , host, port, 1000, 10000) ); GlobalTracer register (config.getTracer()); To start a new span use the buildSpan method within try block which automatically finishes the span and handles any exceptions: io.opentracing.Tracert racer = GlobalTracer get (); try (A ctiveSpan s pan = tracer.buildSpan( "create-octi" ) setTag( "http.url" , "/api/octi" ) setTag( "http.method" , "POST" ) setTag( "peer.hostname" , " apps.sematext.com" ) .startActive()) { // HTTP request code here } { "key" : "db.instance", "value" : "apps", "endpoint" : { "serviceName" : "opentracing-jdbc", "ipv4" : "192.168.1.23" } }, { "key" : "db.statement", "value" : "INSERT INTO apps (name) VALUES (slack)", "endpoint" : { "serviceName" : "opentracing-jdbc", "ipv4" : "192.168.1.23" } }, { "key" : "db.type", "value" : "sql", "endpoint" : { "serviceName" : "opentracing-jdbc", "ipv4" : "192.168.1.23" } } ] } } 20 5.2 Jaeger Despite not being as mature as Zipkin, Jaeger is another distributed tracing system that’s in process of massive adoption. The backend is implemented in Go language and it has support for in-memory, Cassandra and Elasticsearch span stores. Jaeger’s architecture is built with scalability and parallelism in mind. The client emits the traces to the agent which listens for inbound spans and routes them to the collector. The responsibility of the collector is to validate, transform and store the spans to the 21 persistent storage. To access tracing data from the storage, the query service exposes a REST API endpoints and the React based UI 22 Jaeger can be installed from sources using the Go toolchain (go 1.7, glide and yarn package managers are necessary to run the build process): The command above will build and run all components (agent, collector, query) together with in-memory storage enabled. If that’s too much pain, we can fetch the official Docker image and spawn a container: Additionally, you can run each of the components in a separate container by pulling the corresponding image. To build the images manually and orchestrate the execution of the containers use this docker-compose deployment descriptor To explore the traces, navigate to http://localhost:16686. 23 5.2.1 Span ingestion The agent receives the span requests from the client over UDP (port 5775) on the local machine. The spans are batched, encoded as Thrift structures and submitted to the collector. Agent is able to poll for a sampling strategy from the tracing backend and propagate the sampling rate to all tracer clients. That’s an important design decision since it avoids establishing fixed sampling rates – crucial for environments with dynamic network behavior. Abstracting away the routing and discovery phase of the collectors from the client library is also the responsibility of the agent. Jaeger can accept and handle Zipkin span requests transparently. To enable Zipkin HTTP inbound adapter, the collector has to be started with –collector.zipkin.http-port flag. 24 5.2.2 Storage Elasticsearch storage is enabled by running the collector with the following flags (assuming Elasticsearch node is running on the local machine): Jaeger creates two indices in Elasticsearch – one for storing the services and the other one for the spans of given services. Mapping that describes the structure of the document for the span: "mappings" : { "span" : { "_all" : { "enabled" : false }, "properties" : { "duration" : { "type" : "long" }, "flags" : { "type" : "integer" }, 25 "logs" : { "properties" : { "fields" : { "type" : "nested", "dynamic" : "false", "properties" : { "key" : { "type" : "keyword", "ignore_above" : 256 }, "tagType" : { "type" : "keyword", "ignore_above" : 256 }, "value" : { "type" : "keyword", "ignore_above" : 256 } } }, "timestamp" : { "type" : "long" } } }, "operationName" : { "type" : "keyword", "ignore_above" : 256 }, 26 "parentSpanID" : { "type" : "keyword", "ignore_above" : 256 }, "process" : { "properties" : { "serviceName" : { "type" : "keyword", "ignore_above" : 256 }, "tags" : { "type" : "nested", "dynamic" : "false", "properties" : { "key" : { "type" : "keyword", "ignore_above" : 256 }, "tagType" : { "type" : "keyword", "ignore_above" : 256 }, "value" : { "type" : "keyword", "ignore_above" : 256 } } } } 27 }, "processID" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "references" : { "type" : "nested", "dynamic" : "false", "properties" : { "refType" : { "type" : "keyword", "ignore_above" : 256 }, "spanID" : { "type" : "keyword", "ignore_above" : 256 }, "traceID" : { "type" : "keyword", "ignore_above" : 256 } } }, "spanID" : { 28 "type" : "keyword", "ignore_above" : 256 }, "startTime" : { "type" : "long" }, "tags" : { "type" : "nested", "dynamic" : "false", "properties" : { "key" : { "type" : "keyword", "ignore_above" : 256 }, "tagType" : { "type" : "keyword", "ignore_above" : 256 }, "value" : { "type" : "keyword", "ignore_above" : 256 } } }, "traceID" : { "type" : "keyword", "ignore_above" : 256 } } 29 } } } Fields that comprise the body of the span’s document: traceID – an unique identifier for the trace spanID – span identifier parentSpanID – the identifier of the parent span operationName – human-readable operation name linked to the span startTime – span start time expressed as UNIX epoch duration – operation’s duration in millis tags – list of annotations attached to the span Here’s an example of a document indexed by Jaeger collector: { "traceID" : "fd447889b1ad2e1e", "spanID" : "9772c18c9d589627", "parentSpanID" : "fd447889b1ad2e1e", "flags" : 1, "operationName" : "extract", "references" : [ ], "startTime" : 1502716789425000, "duration" : 76, "tags" : [ { "key" : "http.status_code", "type" : "int64", "value" : "200" }, { 30 "key" : "http.url", "type" : "string", "value" : "http://localhost:8081/extract" } ], "logs" : [ ], "processID" : "", "process" : { "serviceName" : "opentracing-extractor", "tags" : [ { "key" : "hostname", "type" : "string", "value" : "archrabbit" }, { "key" : "jaeger.version", "type" : "string", "value" : "Java-0.20.6" }, { "key" : "ip", "type" : "string", "value" : "127.0.0.1" } ] }, "warnings" : null } } 31 5.3 Zipkin vs Jaeger The following is a comparison matrix between the two tracing systems. As seen on the table below, Jaeger has better OpenTracing support and more diversity of OT-compatible clients for different programming languages. This is due to Jaeger decision to adhere to the OpenTracing initiative from inception. JAEGER ZIPKIN Yes Yes OT compatibility OT-compatible clients Python Go Node Ruby * Ruby * Python (work in progress) Go Java Java C++ * PHP * Storage support Sampling C++ (work in progress) In-memory In-memory Cassandra MySQL Elasticsearch Cassandra ScyllaDB (work in progress) Elasticsearch Dynamic sampling rate Fixed sampling rate (supports (supports rate limiting and probabilistic sampling probabilistic sampling strategy) strategies) Span transport UDP HTTP HTTP Kafka Scribe Docker ready Yes Yes * non-official OT clients 32 5.4 Other tracers Tracer – designed after Dapper, not production ready. Lightstep – cloud-based commercial tracing instrumentation platform. AppDash – based on Zipkin and Dapper. Limited clients availability (Go, Python and Ruby). Instana – commercial product Focused on APM and distributed tracing. 6 Supported instrumentation Many frameworks and libraries ship with native OpenTracing instrumentation support or have extension points that add tracing capabilities. ● https://github.com/rnburn/nginx-opentracing/tree/opentracing-api - Instrument nginx requests via OpenTracing compatible tracing. ● https://github.com/opentracing-contrib – A collection of libraries that add OpenTracing instrumentation for Spring, Spring-Cloud, JDBC, JMS, Kafka and Mongo clients, Python, Ruby, and many other frameworks. ● https://github.com/uber-common/opentracing-python-instrumentation Instrumenting popular Python frameworks / clients 33 – ... OpenTracing emerging industry standard for distributed tracing Table of Contents Introduction 1 OpenTracing basics 2 OpenTracing API 6 Context propagation 9 Distributed. .. management can be simplified with Sematext opentracing- common library. TracerInitializer t racerInitializer = new TracerInitializer( Tracers Z IPKIN ); tracerInitializer.setup( "localhost"... tracers OpenTracing hides the differences between different distributed tracer implementations, so in order to instrument the application via OpenTracing standard, it s necessary to have