Workload mTLS with ACME & Go

December 5, 2022

DISCLAIMER: I am currently working for Google. This post is published in my personal capacity, without using any knowledge I may have obtained from my employment with them. All the information provided here is coming from purely personal time and effort and does not represent the opinions or practices of my employer.


ACME, standardized in RFC8555, provides for a way of automatically issuing certificates at scale, using an API, and eliminating all manual work. Most of the TLS certificates issued today in the WebPKI are issued using ACME. It is an enormous milestone for a protocol first presented in 2016 and standardized in 2019!

The protocol is so popular that Go includes an ACME client in its standard library, and even a helper, autocert, which automatically takes care of all certificate needs for an application.

I wanted to capitalize on that, and make workload-level TLS as easy to automate, reliable, and painless, as Let’s Encrypt made website certificates.

The Design

The place I applied this on is my own Kubernetes cluster. It is being used outside of that, for some applications that run separately, but it’s mostly for services and jobs running there. My goal is to secure all service to service communication with End to End Encryption, and remove any reliance on transport security at the network layer. Not only does this simplify the network setup, it also offloads all encryption and decryption to each application, so the routers don’t have to do a lot of work. E2EE is a very nice bonus. Even jobs on the same worker / VM use TLS between them. It’s so cheap at my scale…

For this to work, each service running on the cluster (or outside), and each instance of it, has to present a TLS certificate to its clients. I could manually configure a certificate for each one, issued by hand, but this is very problematic, and I’d have to either use very long-lived certs, or set up a recurring calendar event to renew them.

I could also use something like cert-manager that automatically gets certificates from a public ACME CA, but this creates problems. In order to not modify TLS verification, I would need a way to map public hostnames to internal IP addresses, and I would have to abide by all the rules of the WebPKI. It could also pose rate limit concerns with the ACME CA I am using.

I opted for a per-workload certificate, where each instance of the same service would have its own TLS certificate, private key, etc. I also wanted short-lived certificates (Go doesn’t support requesting them with autocert, but I sent a CL to add this). This prevents the sharing of private keys across services, and they can live in-memory only, and for the duration of the job. After that, it is removed. No disk storage, nothing.

For this to scale, I needed a way for all services to automatically get a certificate. And the solution to that is clearly ACME. But let’s not stop there.

In TLS, you can do “mTLS”. This is when you authenticate both the server and the client. In order for a client to make a connection to such a server, it needs a certificate too. Control of this trusted certificate’s private key lets the server know that this client is okay to connect. If a connection does not use this, then it should be terminated.

There’s a lot out there about whether mTLS should be used for authentication or even authorization, and there are experts that disagree, but it’s mostly about the use of publicly trusted certificates. With a private CA it starts to make more sense. Or at least it did for me, for this one use case, at the time… :)

If we are to require all clients, in order to talk to any server, to use an X.509 certificate, we need to make sure that they are able to reliably get those certificates as well.

If a service is simply exposing something, it only needs a TLS server certificate. If a job makes requests to a service, it only needs a TLS client certificate. If it does both, then it needs two certificates, a client and a server one.

If the above works, and everything is renewed on time, each process / workload should be able to talk to any other, and the connection will be protected and encrypted End to End, with TLS. Both the server and the client will be authenticated, and if someone hijacks the network, or gets access to it, they cannot read or modify any data, they cannot impersonate any service, and they can’t even establish a connection to them, because they don’t have a client certificate.

So how do we get them from ACME?

The ACME Server

We need an ACME server that we can run (e.g. on-premises) that can issue TLS certificates for either servers or clients. There are many that are available out there, built for a particular purpose in mind, however I decided on implementing my own. It’s a nice opportunity to learn the ins and outs of ACME that way – which comes as a bonus.

As I don’t need an externally usable server, I took the liberty of only implementing the minimum amount of endpoints needed from the API, which is fine, as to my knowledge there isn’t a single ACME implementation that supports all of RFC8555.

With Go it wasn’t that difficult, and I even managed to have all SubCA keys in HSMs. I was able to use some of the ACME stuff from the standard library, despite being a client-only implementation. But HSMs are expensive, right? How did I get away with that?

By “HSM” I refer to low cost “HSMs”, such as the YubiKey 5 in PIV mode, which provides a very minimal API, and the NitroKey HSM 2 which surprisingly provides a very feature-rich PKCS#11 implementation, at least for the CHF 50 range :) It can even do encrypted key exporting to load the same SubCA in multiple HSMs! Over a year ago, I also ordered some SoloKeys V2, but due to tremendous problems in their manufacturing in Italy, they have yet to arrive. When they do, I need to post about a comparison maybe.

As all my CAs are using ECDSA, I never ran into the storage limits for any of them, especially after my PR to a Go library for interacting with YubiKeys went through! Signing performance on the other hand is a different story. More on that later, and some of it now:

Revocation

In any PKI, you need to plan for revoking certificates (invalidating them before they expire). As expected, I had some problems with that. The typical problem with revocation is that there’s no performance, or privacy, or both. There’s a high latency between the clients and the CA servers, there’s a lot of data to download, and the CA can see your “browsing history”.

Luckily, none of that were a problem for me. The OCSP responders are almost always “DC-local”, which means that they live within 2 ms of the client making the request, often running on the same hardware server too! Even if I ran them on Cloudflare Workers, it would mostly be the same datacenter, or one in the same city, and traffic would go over one of the multiple peering interconnections my network has with theirs.

Also, since these are high bandwidth servers, they have no problem downloading a multi-MB CRL a few times a day and checking against it. Finally, there are no privacy concerns, as I run and own everything in this environment, and there’s no need to protect my privacy from myself.

The problem was the “HSMs”. In order for a CA to serve OCSP responses, they have to be signed, either in advance, or at the time of querying them. My HSMs can reliably sign ECDSA over P-256 with something in the range of 5 signatures per second. And this has to happen every few hours, e.g. 4 or 8, to reduce the window of vulnerability. At the peak of it, as I have an average of 2.5 certificates per pod, I had tens of thousands of active certificates (drains can create many, in spikes!), and each of them required signatures. This led to hour-long periods where the HSMs were busy and couldn’t issue anything without some luck. But I dislike OCSP Delegated Responders more, so…

I tried online signing. Every time there was an OCSP request, I would ask the HSM to sign it, and deliver it. This, combined with removing OCSP checks in most services, visibily reduced and spread the load, as I was caching the responses for their entire validity in-memory. I actually scaled down my OCSP responder pods to improve cache hits, and it was manageable, with no more HSM deadlocks due to revocation.

Since OCSP and CRLs work over HTTP, and not HTTPS, I had to use a domain name that was not in the HSTS Preload List. As I add all of my domains there, I got a new one: plzcheck.it.

After running this system described here (in various iterations) for almost two years now, I finally decided to make the last step in its revocation pipeline:

I am now issuing 24 hour certificates, renewed every 16 hours. These very short-lived certificates don’t need OCSP or CRLs, as their use is extremely limited, especially since their private keys are new every time, and live in-memory only, and never touch a persistent storage medium. As I grow more confident in my PKI, I may consider reducing this even further, but my HSMs are the bottleneck for long-term use, especially during large pod startup events, either due to drains or autoscalers. It also means that I have at most 8 hours to fix problems with my ACME CA, or all network traffic will stop. This is fine!

The application side

All the applications I have written are in Go, so it was extremely easy to integrate with my CA and ensure that fresh certificates are always available for both inbound and outbound connections. I am using the standard library’s autocert package that takes care of everything.

TLS Servers

Here’s an example of a server ACME client:

autocert.Manager{
	Prompt:                       autocert.AcceptTOS,
	Email:                        "my.email.here@not.that.i.send.any.to.myself",
	HostPolicy:                   autocert.HostWhitelist(strings.Split(*hn, ",")...),
	RenewBefore:                  8 * time.Hour,
	Client:                       loadAcmeClient(),
	RequestedCertificateValidity: 24 * time.Hour,
}

The variable hn contains a list of all the hostnames this server is reachable at. Each service may be reachable either in a common endpoint for all instances, or a separate one for each instance, for direct communication. You can have the server obtain a certificate for every hostname in the SNI of a TLS connection, but I didn’t want that.

The function loadAcmeClient() returns a new ACME client every time it is called. In order to not rely on persistent accounts, I am creating a throwaway account for every pod, and I have a job to delete accounts that haven’t issued a certificate in more than 7 days from the server database. It sets the ACME directory, the user agent (to binary name + version), and has a new ECDSA P-256 key. In case you’re wondering why 7 days, I have jobs that only get client certificates, and act based on external triggers, and don’t expose Prometheus metrics (so they get server certs). Perhaps I’ll find the time to implement the ability to use a different ACME client for every renewal attempt if it scales.

As you can see, RequestedCertificateValidity exists, and this is because I run a private fork of Go’s x/crypto/acme that has incorporated my patch.

For the TLS listener, I am using something like this:

tls.Config{
	SessionTicketsDisabled: true,
	MinVersion:             tls.VersionTLS13,
	ClientAuth:             tls.RequireAndVerifyClientCert,
	ClientCAs:              cca,
	RootCAs:                rca,
	GetCertificate:         srvcm.GetCertificate,
	NextProtos:             []string{"h2", acme.ALPNProto},
}

This mandates TLSv1.3: as I am in direct control of all my jobs, I can be confident that all clients support it, so there’s no need to compromise security by supporting TLSv1.2. I really hope that by TLSv1.4 the upgrade will be easy and painless cluster-wide!

I am asking Go to request client certificates, and also verify their validity, using cca as the valid Client cert CAs. In practice, since I don’t want every service to be able to talk to every service (as it would be the case here), I am also implementing a function for VerifyPeerCertificate that will only allow specific TLS client certs, and not all of them, to connect, based on an ACL.

This tls.Config can be passed to a raw TLS listener, an HTTP listener, or a gRPC Transport (NewTLS()) and wrap all communications in TLS! I have my own package that allows me to very easily do all of that and get a final tls.Config for use in the listener.

TLS Clients

This is where it gets interesting. This is not a common use case as ACME is mostly used for server certificates today. The autocert library was also built with that in mind. And since whether a TLS certificate from an ACME CA has no standardized way of getting the Extended Key Usage (that makes it a server or a client cert), library support is also lacking.

First of all, I wanted a way to specify whether an ACME certificate is for a Client or a Server. I ended up modifying the standard library to pass this information back up to the CA in the CSR.

Then, I wanted to use the autocert package as much as possible, despite it lacking any support for that. I ended up modifying the Manager to add support for both Server and Client certificates separately. Some of the configuration options are reused, while others are different. In my current implementation, you cannot use a Manager for both, and you have to create two different ones.

The tls.Config for each client (gRPC, HTTP, etc.) looks something like that:

tls.Config{
	MinVersion:           tls.VersionTLS13,
	RootCAs:              rca,
	GetClientCertificate: clicm.GetClientCertificate,
}

Another problem I had was the ACME challenges. They are all designed for servers. You need to serve a file here, a random value there, or you need a valid DNS zone to get certificates. This wasn’t flexible enough for workload client certificates.

One option was to host an “RA”, or Registration Authority, the part of a CA that verifies all the data before issuing a certificate, and checks if the challenges are solved, within Kubernetes. Then, use the TLS-ALPN-01 challenge for example, or the HTTP-01 challenge, to solve something on ports 443 or 80. But they were already used by the server above, and modifying autocert to have a single solver for two Managers was too complicated. Plus I’d need to create (at least internal) hostnames for the RA to connect to.

For these reasons (and then some), I decided to leverage the flexibility of my own ACME CA to introduce a new challenge: KUBERNETES-01. It is a private challenge, not intended to be used publicly, e.g. in the WebPKI, and for now I have no reasons to pursue its standardization. After adding support for it to my fork of the acme library, the client solves this challenge by doing nothing: it immediately asks for the CA to check if it is solved.

Upon a request to verify it, the RA talks to the Kubernetes API and verifies (to my satisfaction) that this workload is indeed valid, the one it claims to be, etc. and then marks this as solved.

A final modification I made for this project that was necessary for my setup (but has value for others too) is to add support to autocert for multiple ACME servers. Instead of a single Client in the Manager, I am accepting a slice (array) of them, and I am picking one at random for every call. There’s great value in this, so I may go through the process of also sending it upstream, when I find some time. As each acme.Client has a different DirectoryURL, this effectively allows support for any number of CAs, and seamless round-robin or failover between them.

The reason I needed this was my HSMs. At 2-4 certificates per second per HSM, the global capacity of my CA would be very limited. For this reason, I run multiple ACME CAs, completely independent of each other, and each CA has its own YubiKey / NitroHSM. In rack servers colocated in commercial facilities, I include as many YubiKey Nanos as I can, either internally to the server, or, depending on the facility, externally as well. They’re too difficult to remove anyways, and PIN protected :) In privately owned facilities (my house), I employ a USB hub. For each “HSM”, I run a new copy of the CA software. cp is faster than making your software multi-tenant ;)

If I needed Certificate Transparency, this could drop me down to less than half this issuance rate, due to the extra signatures required for precertificates. Luckily, a few months ago, Cloudflare unknowingly subsidized an extra of around 30 certificates per second for my CA!

In order to be able to use YubiKeys securely (their Nano model is much more convenient than NitroKey HSM 2s), without having to generate CA private keys on a laptop and then import them, but have them generated on the hardware, I am using a different SubCA per ACME endpoint. With that, I have servers and clients use any kind of certificate chain, but this is fine, as they all chain up to the same Roots. Most of my problems would go away if I used a real HSM, a Cloud HSM, or just accepted CA key material on a file, but one of the benefits of using this setup is that it’s less likely to fail. Even on a network split on my backbone, workloads can still try until they find an ACME CA that works, possibly colocated with them. I’m also running one Out of Band (not within my AS), as a last resort, if there’s still Internet access. I don’t have the single-site capacity, but it’s better than nothing.

Non homemade apps

I need to run some applications that I haven’t written, and therefore can’t use Go, and they need a way now to participate in this mTLS universe. For the ones running on Kubernetes, I used the cert-manager integration, pointing it to my ACME servers. There are some issues there with the client certificates, but luckily they aren’t really needed that often. The biggest problem I’ve observed is the lack of easy certificate reloading (some need to restart), so having to deal with that every 16 hours can be a pain. I may decide to keep OCSP & CRLs and use e.g. monthly certificates.

It is similar with non-Kubernetes-running non-homemade apps. A regular ACME client, e.g. acme.sh can easily get those certs.

My personal devices

Finally, the last usecase I needed to cover was my personal devices, e.g. my laptop. They need a way to participate in this mTLS universe too. For external services, I am using publicly trusted certificates on the server side (mostly), but client certificates or communication with internal workloads is still living within this PKI.

Since there’s a lot of work done there, I am leaving this for a separate blog post. You can read it here.

In conclusion

With the setup described above I am running an end to end encrypted infrastructure where all services are authenticated and speak securely to each other. I don’t have to manage certificates, renew them, share them across services, store them, etc. and all private keys are unique and rotated every 16 or so hours. Only TLSv1.3 is used, with Perfect Forward Secrecy, and I can sleep safe knowing that it’s unlikely any pcaps will be useful any time soon.

All of the above is very easily added to any Go app I write and can secure gRPC, HTTP, or even raw TCP connections. There’s no configuration necessary on the client side (however defaults can be overriden). If I worked with other languages too, I could perhaps create similar packages, but I think it would probably not be as easy without the use of third party tooling.

My fork of the Go crypto library is private, as it’s tailored to my specific needs, and it’s something I have to maintain, but I will consider submitting CLs with interesting bits, after heavy refactoring and coordination with the Go team (on the interface design) to make them generically applicable.

In terms of performance, I am patiently awaiting Go 1.20 and their TLS connection pooling, but I haven’t noticed any significant throughput issues, only the added latency in the time to first byte due to the additional RTT required for the TLS handshake. The only problem I found was in applications using sendfile(2), which makes sense, but hopefully kTLS will address for the apps that implement it, especially on some modern hardware. It’s already above my requirements, so I don’t really mind.

ACME is a very nice and flexible protocol, it wasn’t that difficult to implement in Go (for a specific use case, outside the WebPKI), and it really proved useful and scalable. I would say that most of the limitations I found in the above was due to ecosystem immaturity and not the ACME protocol. This is something that will get better with time, and I fully intend to help on that, however I can. So many things are tied to how Let’s Encrypt does it, instead of following the generic ACME way. Being by far the largest CA, as well as the first, this is understandable. I am just glad I didn’t have to implement SCEP ;)