Open Agent Management Protocol (OpAMP) is the emerging open standard to manage a fleet of telemetry agents at scale.

Take a look at the conversation between Nočnica Mellifera and Srikanth as we discuss recent updates to the standard and how you can remotely manage the OpenTelemetry collector with OpAMP.

Summary of the Talk

Find the conversation transcript below.👇

Nica: Folks, the transition between seasons in Oregon is pretty distinct. You go from this beautiful autumn weather, cold but kind of clear, to absolutely bucketing down rain. Not quite cold; it never freezes here in Oregon, not really. But just enough so you can't grow avocados. Oh boy, it was cold. We spent a little time at the beach this weekend, and that is work only for very committed people, I'll tell you that.

But that's not what we're here to talk about. We're going to talk about OpenTelemetry, an open open-agent management protocol. And we have Srikanth, a former contributor, putting it in the right direction with us today. Say hi to the people, and tell them what you do for SigNoz.

Srikanth: Hey, everyone. I'm Srikanth. I work for SigNoz. I mostly worked on the metrics and alerts, and I was a maintainer for the OpenTelemetry Python SDK. Currently, I work mostly on the Open Agent Management Protocol (OpAMP).

Nica: Very cool. This is something that a lot of people have been curious about, and I'm excited to talk more about it. We were just having this tailbase sampling conversation, and I was like, "Man, isn't it possible to do something like that with the agent management protocol?" We'll get to that.

But it seems as we start to ask these big, large-scale, and highly sophisticated feature questions, this could be part of the answer. I wrote up some questions. Some of them were a little silly, but I was just getting started on this topic area. Let me go ahead and drop the spec document that we're talking about. I think this is one thing I might contribute to the project at some point. I'd like to see a more high-level doc get released rather than just the spec, which is what's currently out there. Anyway, here's the link if anybody wants to follow those OpenTelemetry docs for the spec.

Agent vs Collector

I want to start with, when looking at the spec, one of the things is you hear the term "agent" used over and over again. Now, I believe from reading it that we're talking generally about a collector. But is that wrong, or are we talking about an instrumentation agent? Help me with this very basic piece of knowledge.

Srikanth: So the spec is rightly worded. It is not just the collector; the protocol, the whole specification of the protocol, while the current focus is around the collector, the wording, and the way that's designed is not specifically for the collector. It could be an instrumentation agent; it could be anything. Any other program as long as it speaks the protocol. Today, the focus is mostly on the collector. Eventually, what could happen is that even the SDK can have the OpAMP implementation. It's not just the collector configuration that you could dynamically change; it could be a sampling rate at the SDK level as well. The wording is there, but it's not just the collector. It's worded in such a way that it could be anything.

It's any Telemetry agent that we are talking about.

Nica: Yeah, so the idea is that this is supposed to be a standard way, again not collector-specific, to let us do configuration on the OpenTelemetry data transmission.

Srikanth: Yes,

Nica: And that remains the focus, right? Is the idea of, like, hey, we want to control how this data is being said. I do have that part right?

Srikanth: Yeah, correct.

Nica: Okay. So, again, back to the basic questions. This is right now marked as beta, and I'm curious how you feel about it getting used in production. What pieces do you think are working well, and what pieces do you think are still to be defined?

Srikanth: Yeah. So, for an end user, I don't think there is much that they can use today. It's like, you know, there still have a lot of the protocol is beta, the specification is parts of beta. Ultimately, for an end user, the implementation is what they are concerned about and parts of the implementation in terms of what end users get. Is there support for operator-like? How do they go about controlling today?

So, that implementation is still very much in development work. For an end user, there's still a lot of work to do for them to even try it out. But there's a lot of effort going on in getting those things out. But for a vendor, I would say, you know, I would hope to see that everybody does something around it. So, if you are an observability vendor, try it out, do some POCs, get feedback from the users, and then bring those use cases. What problems do your users want to solve so that the implementation can go in that direction?

Nica: Yeah. And that's been kind of my understanding too, that like, this promises quite a bit. And I think it's worth looking at. As an end user, what might you see working next year? Right? For live collector config, it's worth thinking about.

Tail-based sampling

But not something that we want to see kind of, "Hey, don't worry." For example, I'll start this now. I didn't even put in a banner for it. But the one that has been on my mind is the idea of tail-based sampling when you get into a conversation about how are traces selected and you look at, of course, you know, the vast majority of traces that you send are not useful. So, you'd like to be a little more focused. And rather than making decisions when a trace starts, wouldn't it be cool if you could make decisions after a trace has been captured, which are we going to transmit? And this raises some really basic Computer Science 202 problems, right? Like, how do you save resources if you have to process every single trace anyway?

The basic one is, “Hey if you decide at a particular point, let's say at a collector, about, "Oh, wow, this kind of trace is really interesting. I want to see a bunch more of those." How do you synchronize across multiple points?”

Presumably, you are not using one collector to say, "Hey, okay, go ahead and change the sample rate on this trace smartly." And if you don't have it distributed, then your tail-based sampling becomes a massive bottleneck. If you do have it distributed, then you have the synchronization problem. So, today, I would not say, "Oh, you can do this with OpAPM and the collector as it currently stands." But I would say, like, hey, if you know this is something that you're going to want to cover further down the road, I think you may see a solution for this, say, in 12 months or two years.

This is a defined system to say, "Hey, let's communicate about how we want to communicate these." And so, you may see this in a little while. Sorry, that was me talking a lot. But what do you want to say about tail-based sampling and the usability thereof?

Srikanth: Yeah, so tail-based, the agent management part, has several things, and config management is one of them. It enables, by dynamically giving you a chance to dynamically change the config, it enables a lot of these use cases. But the issue with that, specifically the tail-based sampling, is that it requires a particular kind of setup. When you do this tail-based sampling, you need to have the load-balancing export, like an exporter in place so that the span of the same trace goes to the same collector.

So, when the sampling decision is made, it's made on the full trace. That requires a particular kind of setup, and those implementations are not ready. There's not an easy way to do this today. Although you can change the configuration dynamically, a setup that requires tail-based sampling cannot be controlled dynamically.

Nica: And you very correctly say, "Even if the feature was there to do remote config, you would need a very particular architecture to support that correctly," because the act of saying, "Hey, I want to sample a bunch of these at this rate," even in the very rough version of, like, we'll take half of them or something, there's not an easy way to do that with a filtering system that is supposed to be stateless.

It's supposed to get the config and nothing else. But yeah, I saw some discussion at KubeCon about, well, you know, you could introduce a plug-in or a processor that is stateful that goes and checks some database file or something else. But even so, you still need to be handling your traces in a very particular way to have them be useful. I don't mean to pour water over.

Everybody's tailbase sampling dreams. I think maybe there are smarter ways to configure your head-based sampling and smarter ways to configure how you're storing and managing your data overall, such that you stop being quite so stuck on this like, "Hey, what if I only saved the interesting traces somehow?" But anyway, that was on our minds.

So let's talk about some of the things that are supposed to be handled with this in the future. Let's talk about auto-updates, where right now, maybe you're seeing some implementations with the collector that require these relatively manual processes like restarting the collector, obviously for new config, but also updating being a little bit effortful currently.

Srikanth: Yeah. So, the protocol has something called packages. Those packages are top-level packages or add-ons. Those can be downloadable. The way auto-updates could work is a server offers a new downloadable version. Say, I got a new version, something like v88, and the server offers a v88 version. Here is where you get the URL where you can go and fetch the package.

The message that the server offers, contains the downloadable URL, a checksum that you can verify against all those details. It's the job of the agent, whether it's a supervisor server or integrated, to download that, restart the collector process with the newly downloaded version, and it can check if the newly updated package has detected some issues. It will offer the downgraded version, and if it exists, you download it and just run it with the existing downgraded version.

Credential Management

Nica: I see. Predicting proactively that a big part of this is going to be applying a certain package to your system. Right? So there's also some mention that was news to me when I was getting read up on this for this week. I didn't think about, again, I don't think a lot about security, is this thing about credentials management? There's some stuff in there about connection credentials management for stuff like client-side TLS, and certificate replication, and that was not stuff that I realized was implemented as part of this protocol.

Srikanth: Yeah. Agent management, as I mentioned earlier, has several parts. One is config, one is credentials, and then packaging packages. If you look at the protocol specifications, you will find them. The way credential management is structured, there are two ways: client-initiated requests and server-initiated requests.

In server-initiated requests, as soon as the client connects to the OpAMP server, it can offer. Okay, so what's the OPAM? So there's something called OPAM connection settings. As soon as the client connects to the server, the server can offer.

Okay, here are the settings that you need to use from the next time that you call me, like any time. Okay, so now we are connected and it offers settings. These TLS certificates should be used. So there's also an endpoint. Okay, so you should connect to the server.

The server can let's say, you, the agent, like the collector, have to send data to some point that is secured. Okay, so here's the endpoint, here are the TLS credentials that you should use. You should connect when you send the data.

And the thing with the server-initiated flow is that any time the server thinks that it has to rotate the certificates, it just offers the new certificates. So when a client sees that it has received new connection settings offered, it will follow the new connection settings and then rotate the certificates.

In the client-initiated certificate request, the flow is the same. It generates the certificates and then it sends them. Here are my credentials. The server accepts. So that's how the certificate management is done.

Nica: I see. So definitely seeing some possible improvements for how we're able to manage secure certificates with those connections.

All right, and let's give a moment here if you want to add any questions to the chat. Looks like we have a few people watching, which is fantastic. So I see a few models implemented for how the communication is working. Here's how I had this written down. Maybe this is not great, but can you describe the relationship and communication model between the client and the agent and how this model supports different implementation scenarios like if there's a sidecar or have a plugin if this is integrated directly in our code?

Communication Model

Srikanth: So, when you want to control, dynamically change certain things, like change the configuration, one way to go about it is integrated. By integrated, I mean there's a server that can offer dynamic configuration, and then there's a client. You could make this client part of the same code base that the collector runs. The client implementation will be part of the same collector binary. It runs along with the collector, and there's no separate process involved. It communicates with the server.

Whenever it receives a new configuration from the server, it just reinitiates the whole pipeline. Let's say your collector is running. You started it without restarting the same process. It just reinitializes, stops whatever is happening, says, "Hey, I received a new configuration," flushes the pipelines, and then starts over all over again.

That is integrated, where the client is part of the same binary that the collector is running. That's one way to do it. The other way to do it is not to make this integrated. Keep the client as a separate process. The issue with having the client being part of the integrated, one major issue with that approach is if a process crashes, if the collector process has crashed, who is responsible for bringing it back up?

Because since the client is also part of the collector process, who is going to bring it back up? There are these details where the client and the collector will go away at the same time, so there is nobody to keep a check on the mandatory existence of the collector process. That's one issue.

To avoid that issue, you have another model where the client is running as a separate process. This is called the supervisor approach. You have the collector, you have the supervisor. This supervisor process is what communicates with the server. It gets the configuration from the server and runs alongside the collector process.

Anytime it receives a new configuration, it writes it to the configuration file and restarts the collector process. If the collector process, for some reason, has failed to start, it could be due to an invalid configuration or anything else, it tries to revert it, roll back to the existing earlier configuration that was working, and sends the status to the server, saying, "Hey, the config that you have sent, I tried to give that configuration to the collector, but it failed to come up, failed to start," and then it sends the error message.

In this model, even if the collector crashes for any reason, you still have the supervisor process, which monitors the collector process and will make sure that there is always a collector process, either with the new configuration or with the old configuration.

So, these are the two major models and how this is implemented in a Kubernetes-native environment. When you talk about the sidecar implementation, if you take the supervisor approach in the Kubernetes environment, you have the sidecar supervisor and the main collector as part of the same pod. There are two containers, the supervisor and the collector, and they are part of the same pod. The supervisor will be doing the same work that it does inside Kubernetes.

Nica: This is a concern we've talked about a lot recently, knowing that there's a process there to pick up your data. There are two reasons for that. One we don't want gaps in our durability. We don't want weird blackouts. But also, it may not be healthy for your process to be trying to report to a collector who's not there on the system.

Good to worry about. You don't want to sit there trying to write failover directly into your implementation, so just making sure the process is running is pretty solid. I was going to talk about WebSocket and HTTP transport. I wasn't sure that this was insightful. This is the transmission of config data. I think we're pretty hip to that. I don't want to turn this into a long discussion of WebSocket versus HTTP in general.

I've noticed there's just talk about the supervisor process. So what is the role of that supervisor process? This is kind of what we were talking about, keeping the agent up and seeing that the collector process is running at all times. Is there more to that that we want to make sure that people understand as they're exploring this?

Srikanth: Yeah, I think that's mostly it. It's going to ensure that the collector process is running, and this is the main process that's communicating with the server. It fetches the updates, handles the process of downloading the new version, and then restarts the whole thing. It also manages the connection work. Essentially, the supervisor does the whole management work.

What’s next?

Nica: Got it. So, my last question is just about what you see as the immediate future of the next couple of quarters. Are you looking to see maybe some implementations of a toolkit to do remote collector config with this, or do you think it's going to be a little longer for that? Or are we kind of waiting on some vendors to create some more complex demonstrations of using the tool?

Srikanth: What I believe is that there has been a lot of effort going on into this, both in the operator and in the general collector side. There is good progress that has been made. What I hope to see in the next couple of quarters is that as an end user, you will get to see some sort of implementation where you can dynamically change the collector configuration. You will get to see a UI where the current configuration of the collectors is given, and then you can modify and trigger the config update. It goes to the collector, and as an end user, you can dynamically control that. You will see it happen in the operator and the regular collector, in both. That's what I see going on.

Nica: Yeah, no, I think that's sort of what we're all hoping for, right? Within your telemetry tool, within your APM tool, you're able to say, "Oh, I want to see a little more of this, I want to filter out things like this," and you're able to just feed it down as config. So, we'll see exciting developments next year. Maybe by the next KubeCon, we'll be talking about practical demonstrations of it.

Well, you know, we have covered our half-hour. We've got a lot of good information here. I'll be doing a write-up on this, so hopefully, give me a little help when we do that, to talk about this.

I think especially I want to talk about this in terms of the tail-base sampling world and kind of where we're at and where we see the next year of development. I don't want to give the impression that either tailbase sampling or config that is automatically shared across a whole swarm of agents is going to happen or something that you need to have working to be able to do OpenTelemetry at scale. So, we will be continuing that conversation.

Did you have the final stuff? Did you want people to get involved? There is an SIG, I believe, for people could join.

Srikanth: Yeah, there's a SIG, and there's a meeting that happens bi-weekly. It will be today in Pacific time for you. It starts in some time. There's a lot of work to do. You can join the SIG, and there's also a Slack channel. You can come and say hi. If you're interested in contributing, any sort of contribution, it does not have to be code. Awesome.

Nica: Yeah, I encourage people to check out the SIG calendar for OpenTelemetry. There's some great stuff there. Oh yeah, it's in like two hours. If you're interested, check it out. I will share the link to that calendar system because everything needs help.

You do not have to be a master hacker to be able to contribute a lot. See me tomorrow about this time at the collector SIG for sure, where I'll be trying to get a few more filter functions working and a little more general-purpose.

All right, folks, that's been our time. We will be back next week with more stuff about OpenTelemetry. We are on the SigNoz team. Check out Signoz if you're interested in the dashboard for your OpenTelemetry data. But in general, do try out OpenTelemetry and the open world of monitoring. It's got a lot to offer you.

Thank you so much. We will be back next week.

Thank you for taking out the time to read this transcript :) If you have any feedback or want any changes to the format, please create an issue.

Feel free to join our Slack community and say hi! 👋

SigNoz Slack community