Over the last two decades, a shift in technology has brought the power of distributed systems from specialized fields such as telecommunications into common day-to-day operations for many enterprises. Alongside this shift, a need arose to be able to understand and observe a distributed system at scale. With a distributed architecture with many dependencies, it can be complex and difficult to understand where a particular error or increase in latency caused an impact to your users. Let’s explore how we can use New Relic to understand difficult-to-track problems in a distributed world.
Indicators
Generally, a DevOps engineer cares about a few performance metrics, such as:
- Size of payloads (size)
- Time of a particular request (duration)
- Whether the request was successful or not (error)
Moreover, it’s common to enhance the above-mentioned metrics with additional attributes for improved debuggability:
- Upstream application: For tracking which of your upstreams generated this request.
- Trace Id For tracking an individual request through many systems.
If you’re running a multi-tenanted application, you may want to additionally include tenant related attributes, such as “userId” and “tenantId” for debugging individual requests for a particular customer.
A common method of introducing this information between distributed applications is to include them in headers when making requests. This allows downstream applications to understand who their clients are and semantics about that call. This pattern works well for both synchronous and event-driven architectures.
Instrumentation
There are many methods that New Relic offers to instrument your applications with relevant data. While a power user may utilize custom events, simpler use cases may be able to instrument “transaction” events with custom attributes. Let’s take a look at a few simple examples.
Ruby on Rails
In a Ruby on Rails application, you can easily add custom attributes to your transactions with the New Relic Agent; like in the following example:
class MyController < ApplicationControllerdef doAction NewRelic::Agent.add_custom_attributes({ upstreamApplication: request.headers['x-application-context'], traceId: request.headers['x-trace-id'] }) ... endend
Java Spring
For Java Spring developers, the process is just as straightforward. Below is an example of how to achieve similar instrumentation in a Java Spring application:
@RequestMapping( path = "/api/doAction", method = RequestMethod.POST)public void doAction( @RequestHeader("x-application-context") String applicationContext @RequestHeader("x-trace-id") String traceId ) { NewRelic.addCustomParameter("upstreamApplication", applicationContext); NewRelic.addCustomParameter("traceId", traceId); ...}
Using the New Relic SDK, you’re able to quickly instrument information about upstream applications, which can be useful in debugging complex issues.
Distributed tracing
If you’ve enabled distributed tracing, these attributes and headers are automatically added by New Relic and propagated between your applications. You can simply utilize the built-in functionality of New Relic agents without needing to worry about instrumenting both your client and server with matching headers.
Observe
In New Relic, you can then issue a query such as the following:
FROM Transaction SELECT traceId, appName, upstreamApplication
Here, you can see the different transactions that occurred, and where they originated:
To track a particular erroring request throughout multiple services, we can issue a query filtering on a traceId. This will give you a table of the request path from the client to the server:
You can then use this to add additional attributes already present on transaction events, and identify that the error for an upstream application actually came from one of its dependencies. In this example, we can see that the error was actually caused at the “api-gateway” service, which resides in between the “web-dashboard” and “auth-service”.
Understanding this, we’re now able to investigate the erroring application in APM and see that the error occurred during a token refresh in the api-gateway service:
Using our pre-built Distributed Tracing UI, you can also see a map of application dependencies, as well as individual transactions and their dependencies, alongside key performance information such as the span’s duration:
Here, you can see how New Relic utilizes distributed tracing to track sign-up flows which end up calling multiple APIs on our user service, for example.
Conclusion
When you have a distributed infrastructure that’s composed of many microservices, debugging a simple error can be challenging. Adding in debugging information such as “traceId”, and “upstreamApplication” allows you to be able to quickly and efficiently track down the source of errors in a complicated mesh of system calls. Using New Relic distributed tracing makes that even easier.
Next steps
Want to improve your visibility into distributed systems? Sign up for New Relictoday and start tracking down complex issues with ease.
For more detailed guidance on using New Relic with distributed systems, check out the following resources:
- A guide to distributed tracing
- Track requests across your microservices
Alternatively, you can also take our our self-paced course on distributed tracing.