shred / acme4j

Java client for ACME (Let's Encrypt)

Home Page:https://acme4j.shredzone.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handle /directory cache errors gracefully

m-vojvodic opened this issue · comments

Hello,

I am using this library to interface with Let's Encrypt. Let's Encrypt recently had a maintenance event that caused a short planned outage. For an hour following the maintenance, I noticed the following NullPointerExceptions in my application:

java.lang.NullPointerException: null
  at org.shredzone.acme4j.Session.resourceUrl(Session.java:234)
  at org.shredzone.acme4j.OrderBuilder.create(OrderBuilder.java:316)

The errors stopped 1 hour after the initial occurrence. Looking at the suspect code from the stack trace, it looks like the client may cache the /directory endpoint (even if it errors with 500) for about 1 hour:

/**
* Reads the provider's directory, then rebuild the resource map. The response is
* cached.
*/
private void readDirectory() throws AcmeException {
synchronized (this) {
Instant now = Instant.now();
if (directoryCacheExpiry != null && directoryCacheExpiry.isAfter(now)) {
return;
}
directoryCacheExpiry = now.plus(Duration.ofHours(1));
}
JSON directoryJson = provider().directory(this, getServerUri());
Value meta = directoryJson.get("meta");
if (meta.isPresent()) {
metadata.set(new Metadata(meta.asObject()));
} else {
metadata.set(new Metadata(JSON.empty()));
}
Map<Resource, URL> map = new EnumMap<>(Resource.class);
for (Resource res : Resource.values()) {
directoryJson.get(res.path())
.map(Value::asURL)
.ifPresent(url -> map.put(res, url));
}
resourceMap.set(map);
}

In my particular case, this extended the brief outage from my service provider to an hour-long outage. The client should more gracefully handle outages or errors to ensure that a short maintenance is not prolonged unnecessarily.

Thanks!

Yes, an error state should not be cached. I will change that. Thank you for the report!

Thank you @shred ! The library has been great to work with.

I hopefully solved the issue with commit 6dec97d. acme4j is now evaluating HTTP caching headers instead of just caching the directory for 1 hour. I planned to do this change for a while. Now was a good moment to actually implement it.

There is a reason why I used a hardcoded caching time. Let's Encrypt explicitly forbids caching the directory via their HTTP headers. The topic has been discussed in a RFC 8555 errata before, but has been rejected with the consensus that it should be solved on HTTP level. With the new, RFC conformous implementation, the directory will now actually be fetched via network every time it is used, as it is supposed to be. To reduce network traffic, I have opened ticket letsencrypt/boulder#4814 and asked the Boulder team to set headers that permit caching.

The change will be published in the next release v2.10.

I'm closing this bug. Feel free to reopen it if the new implementation did not resolve your issue.

It took much too long, but v2.10 has been released now.