A scalable OCSP Responder on Cloudflare Workers

November 16, 2022

DISCLAIMER: I am currently working for Google. This post is published in my personal capacity, without using any knowledge I may have obtained from my employment with them. All the information provided here is coming from purely personal time and effort and does not represent the opinions or practices of my employer.

We use X.509 certificates everywhere today, many times a day, without knowing so. They are used to secure TLS, e-mail via S/MIME, VPN servers, etc. They even help protect your connection right now, as you are reading this blog post.

However, for any number of reasons, these certificates may be revoked. Currently we have two primary mechanisms that are standardized and help clients know when this happens: CRL and OCSP.

As the background here is long, you can read more in my published work on CCSP to understand these mechanisms better.

What you need to know is that a CA that is issuing X.509 certificates may want to publish this revocation information, and make it accessible to clients, so they can make better decisions when verifying an e-mail message or connecting to a website.

Some CAs are even required (at least now) to operate OCSP responders, the most notable of which are WebPKI (aka Publicly Trusted) ones. This is a compliance requirement and they cannot skip it.

These endpoints may receive a lot of traffic, in terms of requests per second, and the CA must be able to respond successfully. Therefore, the issue of scalability is important, including during cold starts or traffic spikes. Requests may skyrocket unexpectedly if a client decides to start querying for this information.

A known situation where this could happen is Apple’s operating systems, that use an internal system called “Valid”, which uses Bloom filters to probabilistically check if a certificate is revoked or not, and only send an OCSP request if there’s a chance that it is. This causes most Apple devices around the globe to suddenly send a request for every connection they make using this certificate.

But what do CAs have to do to serve this information to clients that rely on their certs?

CRLs

CRLs, or Certificate Revocation Lists, are -as the name implies- lists of all the revoked certificates. Every time a certificate is revoked, a CA can add this to that list, sign it with its private key, and publish it somewhere. Then, clients can retrieve this information and check if the certificate they care about is in there.

As CRLs are basically static files, it is very easy for a CA to serve them at a large scale. Just PUT the file to Google Cloud Storage, Amazon S3, or your favorite CDN, and keep paying the bill at the end of the month…

Of course, there are still problems there and tradeoffs to be made (such as when you generate the file, if you do it after each revocation or periodically, etc.) but after you have your file, it’s relatively straightforward.

Just make sure that you’re familiar with your provider’s caching policy, as it could put you out of compliance.

Even in cases where a Cloud provider or CDN must/should be avoided, serving a single static file is doable, you just have to solve more problems such as cache invalidation, global propagation of your file, etc.

OCSP

OCSP on the other hand is a request - response protocol. The client sends a specially crafted request and asks about a certificate it wants to know more about, and then the server checks its database, retrieves the status, and provides an appropriate response.

We can’t have static file hosting here anymore… :(

We need a server that can parse the request, and then make the right decisions and return a valid response.

To make matters worse, the protocol is over plaintext HTTP (to avoid chicken and egg problems), and a security feature was added, where the client can send a random value (nonce) inside the request, and the server should provide a signed response containing this nonce.

Online Signing

That means CAs need to sign their responses for every single request. This is very difficult to achieve at very high rates as the number of RSA or ECDSA signatures are astronomical. Moreover, you now have to move private keys around and ensure they are available in all your machines and pre-provisioned with adequate capacity to be able to handle spikes. And you have to protect these keys very well, as leaking them will give full access to someone on the CA. You can’t outsource these keys to a CDN or a Cloud…

On top of that, most CAs store their private keys in Hardware Security Modules (HSMs), which are extremely expensive devices, have very limited abilities in terms of copies, exporting and importing, and are far slower than a server in signatures per second. For some CAs they even need to be physically audited at least once a year, so a global deployment isn’t practical.

A solution to that is something called OCSP Delegated Responders, but they’re bad, and come with their own sets of problems, so they won’t be covered here. Most of the limitations though still apply, and they have led to incidents for WebPKI CAs (especially when mixing TLS & S/MIME ;).

How can we solve this problem then? The same way we solve many issues on the Internet: deviate from the standard, break compatibility, and force people to adapt… It turns out that this nonce wasn’t a great idea, so let’s collectively agree to ignore it!

Offline Signing

In offline signing, the CA periodically signs one (or more) OCSP responses for every single certificate, and then stores the resulting bytes in a database. When a client makes a request, it finds the correct entry, and sends a properly formatted response back, using this data. The nonce field is just empty. Although this would normally break, clients now are used to this practice and will ignore (but may warn about) the lack of it.

An immediate scalability issue here is that you need every 8, 12, or whatever many hours to perform several signatures. Let’s Encrypt, the #1 WebPKI CA in terms of active certificates, claims to have ~300M of them. This is about the same as my colleague Ryan Hurst reports. They need hundreds of millions of signatures per day, and probably 99% of these files will never be requested… In any case, that’s the OCSP life, and unless there is a hybrid system that e.g. uses a cache, where certs are signed on the first request, and then that version is served for e.g. 24 hours, there’s not much else that can be done. These hybrid systems are complex to get right, especially for CAs with strict compliance requirements.

In a purely offline-signed world however, we still need a server that will receive a request, parse it, fetch the appropriate response, and send it back to the client. Ideally, this needs to happen extremely quickly, to make sure that the latency is minimal.

There have been some production deployments, and there is RFC5019, but it’s far from a CRL-like problem.

The managed solutions here are clearly lacking. I don’t really know much, but I think they are extremely expensive, from a couple of CDNs only, and not easy to work with. I guess this makes sense, as the market is in the low tens of customers and the compliance requirements are constantly changing and can cause serious damage to the CA if not met.

Cloudflare Workers

Cloudflare Workers is a product that allows users to write small pieces of code, and then run this at Cloudflare’s edge, in over 275 cities. I am not sponsored by Cloudflare, but I’ve seen this launch many years ago, in 2017, and I always wanted to experiment with them. Finally, after 5 years, I found the time to do that! It wasn’t that I was busy, it was that I wanted it to mature first… :P This code was written in early 2022, and it took me about 6 more months to begin writing this post…

So that was the plan, write an OCSP Responder in Cloudflare Workers, and then deploy it globally. If everything goes to plan, a CA can address its OCSP needs with just $0.15 / million requests. As the code runs within low tens of ms of each user, it should also provide very good latency. Moreover, it can scale to very high numbers of requests per second.

The Language

Unfortunately Workers only support JavaScript / TypeScript as a first class citizen, and not Go. As the last time I wrote JS was in high school, using jQuery, and I don’t really know modern JS & tools, this turned a 5-10 minute job into a multi-hour effort. I could probably try WASM, but it could end up as a multi-day effort, so I went for straight TypeScript and scope reduction.

One of the problems I first ran into was the lack of an amazing standard library. Unfortunately only Go has this privilege… As OCSP needs ASN.1 and DER, and it couldn’t use JSON like a normal HTTP service, I noticed the lack of crypto immediately.

Luckily, Peculiar Ventures had a great package that provided the OCSP ASN.1 Schema in TypeScript, and combined with their ASN.1 Parser it wasn’t that difficult to get past understanding the request.

Architecture Design

For the design I decided to go with a single Worker that matches all paths and handles all routes. Obviously it’s trivial to run multiple copies under the same domain, so that doesn’t make any difference.

For the database I went with Workers KV, as a Key Value store was all that was needed. The data is not at the “Edge” but in some key central datacenters (whatever that may mean – that’s the problem with managed solutions), however you can specify a “cache TTL” where reads will be cached at the “Edge” for e.g. 3600 seconds. I wasn’t planning on using this in production, so I didn’t spend too much time reading and investigating. In the best case scenario I guess you’re going to get the value from KV in low single-digit ms, in the worst case I can’t tell, as Cloudflare doesn’t have an (extensive) backbone and most regions would have to talk to the “storage” nodes over the Internet probably. And then it’s cached if that’s desired. This adds $0.50 / million reads plus $5 / million writes, plus $0.50 / GB-month but it is probably still a few orders of magnitude cheaper than a commercial service.

A request should arrive at Cloudflare, the Workers code should run, fetch the data from KV (either locally or remotely), and then respond. No backend necessary.

A cron job (or equivalent) runs on the CA systems to periodically sign and push to Workers KV the signed responses, so they are always there, for all certificates, and are never expired.

Inspiration

Generally OCSP servers that openssl can talk to aren’t too complicated with the right libraries like the ones above, but there are several compliance requirements most CAs have to follow on top of that. Just because a client can talk to it, it doesn’t mean it’s fine. And then there’s also all the misbehaving long tail of outdated or poorly written ones that won’t comply to any standard anyways.

As I was just testing Workers, I decided to make extensive support out of scope, and I also decided to base my implementation on Let’s Encrypt’s. It is part of their Boulder software, and it can be found here. It’s only a few lines if you ignore all the tracing, logging, metrics, etc. so a port to TypeScript shouldn’t be difficult.

Coding

I set up a nodejs / npm development environment and after getting more than 300 Debian packages installed I began. I used the wrangler CLI tool to create a new TypeScript app that was empty. Thankfully Visual Studio Code autocompletion worked and TypeScript errors were catching many things / providing searchable messages.

Dependencies

I ended up needing mostly Peculiar Ventures’ libraries:

import { AsnParser } from "@peculiar/asn1-schema";
import { OCSPRequest } from "@peculiar/asn1-ocsp";
import { OCSPResponse } from "@peculiar/asn1-ocsp";
import { BufferSource, Convert } from "pvtsutils";
import { decode } from "base64-arraybuffer";

The decode one was added so I don’t have to spend any more time than I had to dealing with *Script but I would probably avoid it in a production setup. It’s my left-pad to feel like a true {J,T}S developer ;)

Getting the request

The first thing that needed to be done was to get the OCSP request. It’s not using common / familiar HTTP schemas, but instead it’s a binary ASN.1 blob. For extra fun, the request can come either via GET or POST, and if it comes via GET it is added to the path as a Base64-encoded blob. Clients today may send both, so they had to be supported.

I quickly added a switch statement, and began to populate the various cases:

// Handle both GET & POST
var reqb:ArrayBuffer;
switch (request.method) {
	case "GET":
		let req:string;
		try {
			req = decodeURI(new URL(request.url).pathname);
		} catch (e) {
			return new Response(null, {status: 400});
		}

		// Remove / from path
		req = req.substring(1);

		// https://github.com/letsencrypt/boulder/blob/659d21cc871ab1e53e3c26349017499dd611db64/ocsp/responder/responder.go#L233
		for (var i = 0; i < req.length; i++) {
			if (req[i] == ' ') {
				req[i] == '+'
			}
		}

		// https://github.com/letsencrypt/boulder/blob/659d21cc871ab1e53e3c26349017499dd611db64/ocsp/responder/responder.go#L243
		if (req.length > 0 && req[0] == '/') {
			req = req.substring(1);
		}

		// Base64 decode
		try {
			reqb = decode(req);
		} catch (e) {
			return new Response(null, {status: 400});
		}

	break;

	case "POST":
		reqb = await request.arrayBuffer();
	break;

	default:
		return new Response(null, {status: 405});
}

For GET requests, I URL-Decode the path, remove the first /, and then replace with +. Let’s Encrypt explains why this is done in a comment, I didn’t know if it applied to TypeScript too, but I did it anyways as it can’t hurt.

I then remove the first / if it’s there. Let’s Encrypt explains that there are clients that naively add a second / with code probably looking like OCSP_SERVER + "/" + OCSP_REQUEST where OCSP_SERVER ends in a /. Note that / is a valid Base64 Character, and also a valid URL character. I don’t know why they don’t do this before the URL Decode, but their code works in production for years, so I’m not going to spend too much time on that.

I then Base64-decode the request and store it in an array. The whole ArrayBuffer / Uint8Array / string / atob / btoa thing probably took me a good hour to figure out and get correct results and not misinterpret bytes as UTF-8 characters by accident…

Parsing the request

After I have the OCSP request bytes, I need to parse the data to make sense of it. Let’s Encrypt is lucky to have Go, but with TypeScript it looks like this:

// Parse OCSP Request
try {
	var oreq = AsnParser.parse(reqb, OCSPRequest);
} catch (e) {
	console.log("can't parse ocsp request");
	// https://cs.opensource.google/go/x/crypto/+/refs/tags/v0.2.0:ocsp/ocsp.go;l=396
	return new Response("\x30\x03\x0A\x01\x01", {status: 400, headers: {"Content-Type": "application/ocsp-response"}});
}

OCSP has its own Content Types for requests and responses, but I guess that Let’s Encrypt prefers to parse anything. Either they wanted a broad ability, or, most likely, there are many clients out there that don’t set this correctly. I didn’t check for that either.

From now on, there is reasonable evidence that this client is an OCSP-aware device, so every response must follow the OCSP protocol. Much like the request, the response is also raw ASN.1 bytes. I looked at Go’s code and it seems that they just have constants for all OCSP responses that don’t have any specific fields, so I just copied them over to my code and I send them as raw bytes. From now on, the Content-Type header must be set.

Extracting the certificate information

The OCSP protocol defines an array of certificates as a potential input, and the OCSP responder must provide an answer for all of them. However, this is probably not used by anything in practice. I looked at Go’s code, and there’s a comment that only single-cert requests are parsed, and anything else returns an error. Since Let’s Encrypt, a publicly trusted CA, didn’t have problems with that all this time, it’s good enough for me. Normally I’d have to check if the issuer is valid and use all the fields, but for this Proof of Concept just getting the (clearly unique, right?) Serial Number of the certificate is fine. Interestingly, I think Let’s Encrypt doesn’t check either.

// Get the Serial Number of the first cert
try {
	var sn = oreq.tbsRequest.requestList[0].reqCert.serialNumber;
} catch (e) {
	// https://cs.opensource.google/go/x/crypto/+/refs/tags/v0.2.0:ocsp/ocsp.go;l=397
	return new Response("\x30\x03\x0A\x01\x02", {status: 500, headers: {"Content-Type": "application/ocsp-response"}});
}

The serial number is used as the primary key in Workers KV, to perform all lookups.

Database lookup

As I didn’t want to have to troubleshoot this too, I went with the Base64 of the serial number, and not the raw bytes (or hex encoding).

// Base64-Encode the Serial Number to use as key lookup
let esn = btoa(String.fromCharCode(...new Uint8Array(sn)));

// Check for OCSP Response in KV
var rsp = await env.OCSP.get(esn);

The code inside btoa() is clearly mine, and not taken from anywhere… Thanks JS!

This will perform a lookup for e.g. AMTVDRH0Mv7PEa0iPrEUSeI= (which translates to 00c4 d50d 11f4 32fe cf11 ad22 3eb1 1449 e2).

Something that could go wrong here potentially is serial numbers that start with n 0 bits. You need to make sure that they are always converted properly. I think that since the ASN.1 parser returns a byte array, and not a number, I am probably fine.

Sending a response

Now Let’s Encrypt here does many things depending on the response it got from its database, but they can afford to do this due to their superpower, Go stdlib. Since my ASN.1 parser simply populates values of a struct, and doesn’t really understand them very well, nor can validate them, I decided to skip all of that. If there are bytes on the database, I serve them. If not, I will respond with Unauthorized. OCSP Status Codes are really messed up and their meaning has changed massively over time, but in a way, this seems like the best fit.

// Respond with Unauthorized if not found
if (rsp === null) {
	// https://cs.opensource.google/go/x/crypto/+/refs/tags/v0.2.0:ocsp/ocsp.go;l=400
	return new Response("\x30\x03\x0A\x01\x06", {headers: {"Content-Type": "application/ocsp-response"}});
}

Moreover, they are very careful with their HTTP headers, especially for caching, and it looks like they use reverse proxies in front and have special headers for them too. I won’t bother with that :)

Testing

That’s basically it. The code is not production-grade, I did my best with error checking, and it would probably require a few hundred extra lines to be trusted for a production workload, but the MVP is ready to be deployed.

After the successful provision in a dev URL, I tried various OCSP clients against it, e.g. openssl, and even sent manual requests using curl, and all the tools were able to properly understand what was happening. I did not encounter any errors.

In order to test using openssl, you can use the following command:

openssl ocsp -issuer chain.pem -cert certificate.pem -text -url "http://193.5.16.90:8787/"

Obviously, the appropriate .pem files must exist, and you need to replace the URL with the one you are serving from. If all goes well, you’ll see something like this:

OCSP Request Data:
    Version: 1 (0x0)
    Requestor List:
        Certificate ID:
          Hash Algorithm: sha1
          Issuer Name Hash: EC4A2797F8915935139678B3E8C8A21D097B312E
          Issuer Key Hash: D5FC9E0DDF1ECADD0897976E2BC55FC52BF5ECB8
          Serial Number: C4D50D11F432FECF11AD223EB11449E2
    Request Extensions:
        OCSP Nonce:
            0410AA3265E7B68DF5B8BF1887FB96ED8F41
OCSP Response Data:
    OCSP Response Status: successful (0x0)
    Response Type: Basic OCSP Response
    Version: 1 (0x0)
    Responder Id: D5FC9E0DDF1ECADD0897976E2BC55FC52BF5ECB8
    Produced At: Nov 15 05:31:18 2022 GMT
    Responses:
    Certificate ID:
      Hash Algorithm: sha1
      Issuer Name Hash: EC4A2797F8915935139678B3E8C8A21D097B312E
      Issuer Key Hash: D5FC9E0DDF1ECADD0897976E2BC55FC52BF5ECB8
      Serial Number: C4D50D11F432FECF11AD223EB11449E2
    Cert Status: good
    This Update: Nov 15 05:31:18 2022 GMT
    Next Update: Nov 22 04:31:17 2022 GMT

    Signature Algorithm: sha256WithRSAEncryption
         99:0b:96:e3:96:39:2c:20:2a:4a:13:83:47:8e:dc:35:c1:98:
         ba:e1:1e:e0:e1:78:94:6a:b6:6f:76:68:71:68:9f:97:1d:23:
         10:30:d4:14:70:00:af:91:8f:4c:cf:4f:c9:3d:e0:09:35:e2:
         d5:5f:96:6b:8d:3e:b9:15:da:51:e1:5d:bd:d2:b7:a7:e3:90:
         fa:8a:fe:2d:22:81:c3:88:c5:89:6f:dd:6d:61:8f:c1:e2:69:
         ac:18:dd:90:87:0c:b0:58:df:1b:66:f0:2f:24:34:13:2f:73:
         06:27:94:ad:0f:58:53:8d:91:bb:4f:67:c0:c8:16:20:4d:11:
         1f:b7:fd:cd:88:4b:3d:7e:42:e5:5c:ec:b5:e8:f0:b1:3b:59:
         9b:ca:f9:4c:93:d7:46:67:d7:20:1f:44:92:ff:30:33:8e:b3:
         85:9d:ba:04:82:99:7b:14:75:2e:d2:d2:07:9d:d1:47:36:4c:
         15:58:7d:d0:3c:78:46:72:21:9d:f6:18:c2:8c:20:6c:ca:e3:
         27:b5:5f:8f:a7:e2:0e:8d:9c:fc:ba:f5:45:7d:09:13:5b:a2:
         c6:5b:8c:59:ba:0a:7c:bd:e9:82:59:cb:a1:92:a0:33:d7:91:
         70:53:9d:ea:ac:67:b8:30:c9:54:be:f1:c3:d2:13:d0:a4:c4:
         5c:e5:78:cb
WARNING: no nonce in response
Response verify OK
certificate.pem: good
	This Update: Nov 15 05:31:18 2022 GMT
	Next Update: Nov 22 04:31:17 2022 GMT

For manual checks, e.g. with curl, openssl supports -reqout req.dat and -respout resp.dat and it will save the raw request and response bytes to these two files. Then, they can be sent to the server:

curl -X POST -H "Content-Type: application/ocsp-request" --data-binary @req.dat "http://193.5.16.90:8787/"

You can even easily dissect them online with this great tool.

Benchmarks

There’s no point in running any benchmarks I think, as basically a solution like that relies on Cloudflare 100% for the scale. It’s a wrapper around their KV product, and probably most of the execution is spent waiting for the data to be fetched. It’s also difficult to benchmark “edge” products as if you run a load test from a single location you may see far lower throughput than if there are requests balanced globally.

Pricing

Finally, let’s see whether or not such a setup makes sense. Going with a commercial service helps, as others run it for you, with experience in operating the software as well as knowing that it works for possibly some of the largest CAs.

If we exclude any R&D costs, or delays, or risk, does it make financial sense to run an OCSP Responder on Cloudflare Workers?

I’m going to use the example of Let’s Encrypt. I don’t know about their order of magnitude for many of these parameters, and trying to figure it out is unlikely to change the final result by a lot. They may sound as a worst case scenario, due to them issuing most of the TLS certificates in the world, but I think it makes more sense to use them, as any fixed costs regardless of size are hopefully going to be negligible.

Moreover, I will use Cloudflare’s List Prices. They have been very generous with fully paid sponsorships for many projects, e.g. Have I Been Pwned? so probably someone like Let’s Encrypt would get at least a discount, but let’s look at a worst case scenario here. The usage is also significant and would probably qualify for “Contact us” pricing tiers.

Requests

Let’s assume a monthly average rate of 10,000 OCSP requests per second. This is about 26 billion requests per month. At $0.15 / 1M it can be around $4k per month. This is requests that reach a worker, and GET requests that may be cached for free wouldn’t really be counted here.

Reads

We then perform (in theory) also 26 billion reads from Workers KV. This is another $13k per month. I don’t know if “edge cache” requests (within e.g. 3600 seconds) also count towards reads, or if this is just from the storage nodes, but there are potentials here for significant savings. Perhaps a look into Workers and their lifetime (they have 128 MB of RAM that can cache a lot if they survive over multiple requests). An LRU / LFU cache could probably drop the reads by a factor of 10…

Storage

For stored data, let’s assume 300M OCSP entries, at, say, 1 KB each. That’s 300 GB / month, or $150.

Updates

Writes are a bit trickier (as some updates will do writes, and some deletes) but since they’re priced the same it doesn’t matter. Let’s assume 300M updates per day, or 9B updates per month. That’s the highest number yet, at $45k.

Worth it?

That’s a total of over $60k a month… With Cloudflare’s S3 service (R2), the number could go down to $53k. It’s significant, but that’s for over half the world’s certificates. Probably the costs for a normal CA (e.g. 1M certs) would be in the low hundreds of dollars per month. As you can see, most of the cost comes from the number of updates per day, which is directly correlated to the number of active certs, and requests per second come second…

My main concern is that all of that is for OCSP, a protocol that has problems, is rarely used, it’s difficult to maintain, and has troubled many engineers over the past years. With windows of vulnerability frequently exceeding a week, privacy leaks, and a bad design that we’ve abused for modern day use, I just don’t see it as good value for money… Short-lived certificates will hopefully make it less and less relevant, and hopefully, one day, we’ll see it removed from the requirements for publicly trusted CAs.

Until then, happy cache invalidation! ;)

PS

One could easily adapt the code above to build an OCSP-aware reverse caching proxy. It could use Workers KV for the global cache, and then fetch everything missing from a normal OCSP backend, by sitting in front. However, if this could be done in-memory, and even if we assume 1 KB per response, it would be easy to write a responder with e.g. 4 GB of RAM and the ability to cache the most common 4 million certs. Workers, like many other FaaS products, is designed to not be running constantly however, and this makes it more difficult to be efficient without KV. Also, the global nature of it would be working against it, as the cache would need to warm up in N locations x M servers.

I must also say that writing PKI / X.509 / Cryptographic code in TypeScript was not a great feeling. Maybe it’s that I’m not used to this, but the lack of Go’s library, type system, interfaces, etc. were clearly visible. I purposefully avoided dealing with that as much as I could, because it just didn’t feel okay. A properly written OCSP server however is likely not going to suffer as much from this.