The OpenTelemetry Bootcamp: Sampling and dealing with high volumes

My notes & takeaways (14)

Kind of worried, as I heard “three pillars” being mentioned… but still, it’s one of the common models used.
Traces are something that are mostly automatic.

Traces are expensive : there’s a cost to CPU / Memory used, cost in data transfer to cloud provider, and cost in storage used to save them.

The cost of the tooling (Jaeger) is not as relevant when compared to the above costs.

Trace sample percentage can / should be chosen based on use case. Cost analysis can help define percentage ranges. Different percentages can be used within the use case (example: 100% of errors but 10 of the rest)

Head sampling : decide to keep sample when the span starts. SDK makes the decision, normally.

Tail sampling: decide to keep sample when trace completes. Collector can make this choice.

Tail sampling is a part of otel-collector-contrib_tail_sampling package / plugin, added to the collector.

When using tail sampling in the collector, samples are collected during a period of time (decision_wait), though that means it is subject to failure/trace loss if the collector fails (out of memory, etc)

Interesting conditions to choose on:

  • latency
  • numeric attribute
  • probability
  • status code
  • string attributes
  • rate limiting

Load balancer helps shred load, based on trace rules (think sticky session - same traceId to same collector)

Then 2 layers of collectors:

  • first layer load balances and makes decisions on what to collect;
  • second layer serves as transport to final destination
Parent Based sampler - Service B respects Service A’s decision of sampling or not.

Optimizations

Benchmarks

Tail based requires more memory since it needs to buffer and decide late.
You may need an extra layer if your target can’t handle the load.
3 metrics to calculate cost.

Cost Calculation

730 = aprox. hours in a month

Storage is probably the most expensive bit on the list.

Prod tips