In simpler times, enterprise applications were developed, deployed and managed as a tightly coupled, precisely orchestrated, monolithic unit. Every application had a single instance and monitoring, debugging, and troubleshooting was as simple as drilling down to the code level to understand and modify behavior and performance.
But complex times call for distributed, microservices-based cloud-native applications. Today’s enterprise applications comprise tens of thousands of loosely coupled polyglot microservices, as well as multiple versions of the same service, deployed across different locations, servers and ephemeral containers, and managed by different teams. Every request triggers a specific sequence of services that flows through this complex, distributed, and heterogeneous environment to deliver a specific functionality.
In this complex, diverse and dynamic topology, observability, the ability to understand how service interactions flow across a system, becomes a huge challenge. Without observability, debugging, troubleshooting and managing the performance of microservices applications becomes virtually impossible. And traditional monitoring and application management tools are incapable of handling the complexities of microservices applications.
In a microservices environment, observability tools play a key role in helping developers and administrators understand how a system behaves given the elasticity of the production environment, the unpredictability of inputs and the vagaries of upstream and downstream dependencies.
There are three distinct yet complementary approaches to achieving observability in a microservices-based application - metrics, logs and distributed tracing or distributed request tracing. Here’s the specific role that each of these approaches plays in the monitoring process to together provide a comprehensive perspective into the overall performance of the application.
Metrics are perhaps the basic building blocks of observability in a microservices applications. They are numerical measures recorded by an application over intervals of time that can be aggregated and analyzed to understand system behavior. It is therefore possible to instrument and monitor all services and link up the data to monitoring systems that can handle storage, aggregation, visualization, and even automated responses.
The key advantage of metrics is that they are frequently generated, fairly accurate and quite affordable to collect, store and analyze. For instance, the cost of aggregating, storing and analyzing metrics data does not increase proportional to an increase in system traffic or activity. They are also more malleable to mathematical, probabilistic, and statistical transformations making them ideal for monitoring the overall health of a microservices environment. Over time, stored metrics data can also be aggregated into daily or weekly cycles to understand historical trends.
However, a metrics-based monitoring system is purely diagnostic; while it can alert administrators to whether a resource is alive or dead it cannot actually pinpoint the problem within the system. Most importantly, the aggregation of data means that the contextual value of individual transactions is lost.
Logs provide more actionable information than metrics. While metrics can help flag a broken resource, logs go a step further and help understand the reason for the breakdown. Logs outperform metrics when it comes to monitoring microservices applications as they provide insight and context that aggregated data cannot. Logs, therefore, are capable of identifying emergent and unpredictable behaviors that might not show up in averages and percentiles.
Logs are easy to generate, are supported across languages and application frameworks, and provide more contextual and granular information than metrics. They do, however, have their own limitations and challenges.
For instance, logs struggle with microservices since each log stream only captures events that occurred in a single instance of a service. It takes an array of tools, including sophisticated logs aggregation technologies, to retrace the path of a request from multiple log streams in different formats.
Perhaps, one of the biggest challenges of this model is deciding what to log. With every resource in the microservices environment generating its own log, it is easy to get so overwhelmed with data as to make the entire process operationally and economically counterproductive.
3. DISTRIBUTED TRACING
Request tracing is an established practice in software engineering by which developers use instrumentation code to collect a range of metrics about application performance. Where request tracing tracked a request’s path within a single application, distributed request tracing extends this functionality to track requests from end to end, across multiple systems, domains services and instances.
The need for some degree of instrumentation is a shared requirement across both logging and distributed tracing. There are, however, several key differences between the two approaches starting with the information that each captures. Logging capture only high-level information in a standardized format for aggregation and analytics. Distributed tracing, on the other hand, captures huge volumes of low-level information to create a broader view of application performance than logging. It also uses a specialized data structure that can help identify causality.
Distributed tracing uses traces and spans to map the entire journey, from origin to destination, of a request as it progresses through different services and components in a distributed system.
A trace represents the complete journey of a single request, each of which is assigned a unique transaction ID that can be used to search and sort specific traces. This trace ID generated at the entry point of the transaction then passes through the call chain of that transaction to chart its data flow or execution path through the distributed system.
Each trace comprises multiple spans, each representing an incremental step or action performed to progress that specific request. Spans also have unique IDs and can create child spans that could be associated with multiple parents.
Completed traces can be searched in a presentation layer to get a comprehensive picture of the end-to-end performance of each request. Developers and administrators can trace requests through each span, correlate them to specific service instances to identify latencies and problems, and even identify the host system where the span was executedSOURCE: https://opensource.com/article/18/9/distributed-tracing-tools
Distributed tracing is the modern solution for debugging and monitoring complex microservices applications. However, it is still only one of the three pillars of observability. Most complex distributed environments will need a combination of logs, metrics and traces to maximize visibility into distributed systems.
There are challenges, primarily in terms of adding instrumentation to the code base and customizing solutions and setting up the tools to visualize trace data. But emerging frameworks, like OpenTracing, are enabling a more simplified and standardized approach to instrumentation and distributed tracing.
Arguably, the earliest precursor to today’s distributed tracing systems was X-Trace from 2007. However, the arrival of Google’s Dapper in 2010, just as distributed services were gaining traction in the enterprise, has been particularly influential on modern tracing for distributed systems.
Today, the distributed tracing systems landscape comprises several popular open source frameworks, such as OpenTrace and OpenCensus, and open source tools, such as Zipkin, Jaeger and Appdash. Apart from these open source options there are also several commercial solutions available including Datadog, Instana, LightStep, AWS X-Ray, Google Stackdriver, among others.
Here’s a brief introduction to some of the most popular distributed tracing frameworks and tools available today.
OpenTracing: OpenTracing addresses one of the fundamental challenges of distributed tracing; adding instrumentation to application code. It comprises an API specification, frameworks and libraries to provide a vendor agnostic, cross-platform solution to instrument applications for distributed tracing.
OpenTracing abstracts away the differences between multiple tracer implementations so that instrumentation does not need to be changed even if developers swap out tracer instances. By standardizing instrumentation, OpenTrace simplifies the tracing process and allows developers to focus on instrumentation before diving into implementation.
OpenTracing did not originally include a metrics API, a capability gap that will be addressed by the recent merger between OpenTracing and OpenCensus into a unified OpenTelemetry project
There are currently several popular implementations of the OpenTracing specification including open source solutions such as Zipkin, Jaeger and Appdash and commercial solutions like Instana, LightStep, Datadog, and New Relic.
OpenCensus: Unlike OpenTracing, OpenCensus goes beyond just establishing an open API and specifications. It offers a set of libraries for multiple language implementations, facilitates the collection of distributed traces as well as application metrics, and supports the transfer of this data to a range of popular backends.OpenCensus currently offers support for several languages such as Go, Java, C#, Node.js, C++, Ruby, Erlang/Elixir, Python, Scala and PHP with supported backends including Azure Monitor, Datadog, Instana, Jaeger, SignalFX, Stackdriver, and Zipkin.
OpenCensus offers some unique features such as automatic context propagation. As mentioned earlier, a unique identifier helps create a unified context that helps correlate all the events involved in each request. Ideally, this context is then automatically propagated throughout the system. Automatic context propagation is often one of the biggest obstacles to distributed tracing adoption. OpenCensus provides automatic context propagation as well as simple APIs for manually propagating or manipulating context.
OpenTelemetry (OpenTracing + OpenCensus): It was recently announced that the OpenTracing and OpenCensus projects would be synthesized into a single, unified project called OpenTelemetry. OpenTelemetry will provide a new, unified set of libraries and specifications as a complete observability telemetry system designed for monitoring microservices and other types of modern, distributed systems. This new system is expected to be compatible with most major OSS and commercial backends.
Zipkin: Zipkin, one of the oldest and most mature distributed tracing systems, was developed by Twitter based on a paper about Dapper, Google’s internal tracing solution, and open sourced in 2012. It helps gather timing data needed to troubleshoot latency problems in microservice architectures and manages both the collection and lookup of this data.
The Zipkin system consists of reporters (clients), collectors, a storage service, a query service, and a web UI. Reporters, components that instrument applications, send data to collectors that validate incoming data and pass them onto storage. Zipkin ensures safety in a production environment by transmitting only a trace ID to inform receivers that a trace is in process and by asynchronously sending the actual data collected by each reporter to the collectors. Users can then use the query interface and web UI to search, retrieve and explore traces from the database.
Zipkin is compatible with the OpenTracing standard.
Jaeger: Jaeger is a more recent distributed tracing product that originated at Uber Technologies and has since been adopted by the CNCF as an incubated project.
Jaeger’s architecture, with clients (reporters), collectors, a query service, and a web UI, is very similar to Zipkin. In addition to these components, Jaeger also has an agent on each host to locally aggregate the data, batch it and then send to a collector. The collector validates, transforms and stores the data, which the query service can access and provide to the web UI.
But unlike Zipkin, Jaeger samples only 0.1% of all traces that pass through each client in order to avoid being overwhelmed by data. The system uses a probabilistic sampling algorithm that users can refine to suit their data requirements. Jaeger is also currently working on an adaptive sampling solution that would assign sampling probabilities on a service + endpoint, rather than just a service, basis and dynamically adjust sampling rates based on their impact.
Jaeger is also fully compatible with the OpenTracing standard.
Appdash: Appdash is an open source system created at Sourcegraph based on Dapper and Zipkin.
Appdash’s architecture consists of three main components; a client to instrument the code, collect the spans and send them to a local collector, which in turn sends the data to a central server running its own remote collector for all other nodes in the system.
The Appdash solution is still not as mature as the other solutions and does lack in documentation.
Overview of all frameworks/tools:
Distributed tracing, not to forget the other two pillars of observability, metrics and logs, has helped streamline the approach to monitoring, debugging and troubleshooting microservices applications. Today, there are several open source and commercial solutions available for gaining valuable operational insights into complex distributed environments. There are, however, still several challenges to address such as the need for tracing solutions that are as, if not more, scalable than the systems they are monitoring, greater standardization and better interoperability.