Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 45 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
45
Dung lượng
4,37 MB
Nội dung
Welcome to Always Bee Tracing! If you haven’t already, please clone the repository of your choice: ▸ Golang (into your $GOPATH): git clone git@github.com:honeycombio/tracing-workshop-go.git ▸ Node: git clone git@github.com:honeycombio/tracing-workshop-node.git Please: also accept your invites to the "Always Bee Tracing" Honeycomb team and our Slack channel Always Bee Tracing A Honeycomb Tracing workshop A bit of history ▸ We used to have "one thing" (monolithic application) ▸ Then we started to have "more things" (splitting monoliths into services) ▸ Now we have "yet more things", or even "Death Star" architectures (microservices, containers, serverless) A bit of history Now we have N problems (one slow service bogs down everything, etc.) ▸ ▸ 2010 - Google releases the Dapper paper describing how they improve on existing tracing systems ▸ Key innovations: use of sampling, common client libraries decoupling app code from tracing logic Why should GOOG have all the fun? ▸ 2012 - Zipkin was developed at Twitter for use with Thrift RPC ▸ 2015 - Uber releases Jaeger (also OpenTracing) ▸ Better sampling story, better client libraries, no Scribe/Kafka ▸ Various proprietary systems abound ▸ 2019 - Honeycomb is the best available due to best-in-class queries ;) A word on standards ▸ Standards for tracing exist: OpenTracing, OpenCensus, etc ▸ Pros: Collaboration, preventing vendor lock-in ▸ Cons: Slower innovation, political battles/drama ▸ Honeycomb has integrations to bridge standard formats with the Honeycomb event model How Honeycomb fits in Understand how your production systems are behaving, right now QUERY BUILDER INTERACTIVE VISUALS RAW DATA TRACES DATA STORE High Cardinality Data | High Dimensionality Data | Efficient storage BEELINES (AUTOMATIC INSTRUMENTATION + TRACING APIS) BUBBLEUP + OUTLIERS Tracing is… ▸ For software engineers who need to understand their code ▸ Better when visualized (preferably first in aggregate) ▸ Best when layered on top of existing data streams (rather than adding another data silo to your toolkit) Instrumentation (and tracing) should evolve alongside your code EXERCISE: Find Checkpoint Go Node → let’s see what we’ve got Checkpoint Takeaways ▸ Tracing across services just requires serialization of tracing context over the wire ▸ Wrapping outbound HTTP requests is a simple form of tracing dependencies Stretch Break Mosey back to seats, please :) ANALYSIS USER WALL our (second) service TWITTER.COM a third-party dependency (LAMBDA FN: PERSIST) a black-box service EXERCISE: Find Checkpoint Go Node → let’s see what we’ve got Checkpoint Takeaways ▸ Working with a black box? Instrument from the perspective of the code you can control ▸ Similar to identifying test cases in TDD: capture fields to let you refine your understanding of the system EXERCISE: Who’s knocking over my black box? ▸ First: what does "knocking over" mean? We know that we talk to our black box via an HTTP call What are our signals of health? ▸ What’s the "usual worst" latency for this call out to AWS? (Explore different calculations: P95 = 95th percentile, MAX, HEATMAP) ▸ Hint: P95(duration_ms), and request.host contains aws Puzzle Time Symptoms: we pulled in that last POST in order to persist messages somewhere, but we’re hearing from customer support that behavior has felt buggy lately — like it works sometimes but not always What’s going on? Think about: ‣ Verify this claim Are we sure persist has been flaky? What does failure look like? ‣ Look through all of the metadata we have to try and find some correlation across those failing requests Scenario #1 response.status_code request.content_length HEATMAPs are great :) Symptoms: everything feels slowed down, but more importantly the persistence behavior seems completely broken What gives? Think about: ‣ What might failure mean in this case? ‣ Once you’ve figured out what these failures look like, can we anything to stop the bleeding? What might we need to find out to answer that question? Scenario #2 response.status_code app.username Symptoms: persistence seems fine, but all requests seem to have slowed down to a snail’s pace What could be impacting our overall latency so badly? Prompts: ‣ Hint! Think about adding a num_hashtags or num_handles field to your events if you’d like to capture more about the characteristics of your payload ‣ It may be helpful to zoom in (aka add a filter) to just requests talking to amazonaws.com Scenario #3 response.status_code request.host contains aws Thank you & Office Hours ... Tracing A Honeycomb Tracing workshop A bit of history ▸ We used to have "one thing" (monolithic application) ▸ Then we started to have "more things" (splitting monoliths into services) ▸ Now we have... Bits about your infrastructure (e.g which host) ▸ Bits about your deploy (e.g which version/build, which feature flags) ▸ Bits about your business (e.g which customer, which shopping cart) ▸ Bits... Find Checkpoint ‣ Try writing messages like these: ‣ "seems @twitteradmin isn’t a valid username but @honeycombio is" ‣ "have you tried @honeycombio for @mysql #observability?" → let’s see what