certbot / certbot

Certbot is EFF's tool to obtain certs from Let's Encrypt and (optionally) auto-enable HTTPS on your server. It can also act as a client for any other CA that uses the ACME protocol.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Purge old private key material

schoen opened this issue · comments

Related to #4634.

When we designed the lineage format with /etc/letsencrypt/archive, we thought

  • sysadmins will manually inspect their new certificates before "deploying" them

  • sysadmins might find something to dislike about new certificates and choose not to use them

  • sysadmins might notice something to dislike about new certificates after they're deployed and reverse the deployment by reactivating an older certificate

We've since almost entirely abandoned the first two concepts (for example, the default gap between obtaining and deploying a cert, which I had originally conceived of as one week, has been reduced to zero and the same code paths that obtain certs normally immediately attempt to install them).

I'm not sure I've heard of any cases where sysadmins have deliberately used the third option, since if they dislike the new cert (which is rather rare), they will probably just go ahead and get an even newer cert.

It's not necessarily inappropriate to have, or have the option of having, some history just in case the third situation actually comes up in an unanticipated circumstance, but there is a security cost to having lots of key material preserved indefinitely, and that's the case specifically where clients negotiate non-PFS ciphersuites with servers, such as TLS_RSA_WITH_AES_128_CBC_SHA256 or TLS_RSA_WITH_AES_128_GCM_SHA256. In the last few years there has been a general deprecation of these ciphersuites in favor of (mostly) DHE and ECDHE ciphersuites.

The difference is pretty significant; in DHE and ECDHE, the client and server exchange ephemeral Diffie-Hellman parameters which are discarded after the session negotiation is complete. These parameters are used to derive the session key. It's not thought to be feasible to derive the session key from the ephemeral parameters without solving the discrete logarithm problem (or elliptic curve discrete logarithm problem, in the case of ECDHE) and it's not clear that even powerful attackers will ever succeed when the parameters are chosen appropriately. (Unfortunately they will succeed when the parameters are chosen inappropriately. ☹️)

By contrast, in the non-DH RSA ciphersuites, RSA is used directly for key exchange: the client picks the session key and encrypts it under the server's RSA public key, then transmits this encrypted key over the wire. That means that the server private key that was in use in the course of that key exchange can be used directly to recover the session key; indeed, Wireshark has a feature that does exactly this if you give it a server private key. (The way this breaks with PFS ciphersuites is something that has really annoyed people who are trying to do legitimate consensual network debugging of TLS-wrapped protocols, and I assume also really annoyed wiretappers.)

Our /etc/letsencrypt/archive directory is a useful target for anyone who wants to read old non-PFS TLS traffic with a particular server known to be running Certbot. If they can get ahold its contents, even years later, they can just go back to the appropriate 60-day period and plug the key into Wireshark and read any non-PFS HTTPS traffic they've intercepted.

The best remedy for this is to get more use of PFS ciphersuites, but I'm concerned that they only represent the majority of TLS traffic today, not the overwhelming majority, and hence there would still be potentially significant value for an attacker to try to compromise a machine to steal its old key material, that we helpfully maintain (in duplicate) in /etc/letsencrypt/archive and /etc/letsencrypt/keys.

I believe that old private keys should be purged, as best we can, on some schedule, in order to reduce the risk to old non-PFS TLS session contents. Maybe not immediately after renewal, but maybe after a second successful renewal, for example. It's not perfect, but we can shred -u the key file or something, creating a low probability that its complete contents will be recoverable later, rather than an effective certainty.

This could also be a configurable option, which probably should be enabled by default (maybe purge_old_privkeys = True or something).

We don't even need to unlink the files if we don't want to invalidate lineages from the broken symlink point of view. We can use shred instead of shred -u, and then the file still actually exists, but its contents have been overwritten a couple of times.

We have to be very careful about how this interacts with people's cert pinning strategies.

In particular, if people achieve their pinning with --key-path /etc/letsencrypt/archive/example.com/privkey1.pem, their renewals would later fail if Certbot goes and shreds that file. (It would notice the key_path in the renewal configuration file and then try to re-import a copy of the private key that had already been shredded!)

By contrast, @pde mentions that --key-path /etc/letsencrypt/live/example.com/privkey.pem might be safe because it will always point, at a given time, to a copy that still exists and can safely be used.

A possible remedy for this could be to have the --key-path parser give a warning or error if the specified path contains /etc/letsencrypt/archive (or contains a directory that's identical to it, so that cd /etc/letsencrypt/archive/example.com; certbot --key-path ./privkey1.pem would be caught!).

As an update, a colleague gave me an estimate which indicated that over 10% of TLS key exchanges in a particular context are non-PFS and hence would benefit from the eventual destruction of their server private key material. We can look for other views of this which might show higher or lower numbers, but that was based on a pretty big sample so it should be relatively typical of what data sources might show. (Of course it's different from country to country, and from site to site depending on the fraction of mobile users, etc.)

I think this is high enough to justify moving forward with this.

It would be nice if this was at least better documented so admins who don't need to pin can know to remove these files manually. I just discovered the letencrypt "archive" of old keying material that I assumed had been deleted for many months and came here to file a bug about it. It never occurred to me that after a successful renewal certbot wasn't shredding the old keys.

Another option: on initial enrollment/setup, ask the user if they want to keep an archive of old keys or not.

PFS or not, it is bad form to leave old private keys laying around on disk. I would expect to see the old keys removed after a short period of time (ie 3-5 days at most).

One use-case for deploy_before_expiry is TLSA rollover. Ex the following config:

renew_before_expiry = "30 days"
deploy_before_expiry = "20 days"

... and a renew hook which add the new TLSA record and a deploy hook which remove the old TLSA record.

Regardless of forward secrecy concerns, a program shouldn't fill a directory with old unused files. Sooner or later a limit will be reached – disk sectors, inodes, directory entries or something else. Surely you want Certbot to be useful even on small servers with limited storage?

Expired certificates, and their associated keys and CSRs, are useless as far as I can see, and should be removed.

Is it safe to let a Cron job delete all files under /etc/letsencrypt/{csr,keys,archive} that are older than, say, six months, or would that upset Certbot somehow?

commented

We've made a lot of changes to Certbot since this issue was opened. If you still have this issue with an up-to-date version of Certbot, can you please add a comment letting us know? This helps us to better see what issues are still affecting our users. If there is no activity in the next 30 days, this issue will be automatically closed.

Yes, the old keys are still kept on disk.


As a side note, just because a bug has been left unfixed for four years, it doesn't mean that users should be asked to re-confirm it every month (or whatever the configuration says) to prevent the bot from closing them. That's sweeping dirt under the rug.

I know how having hundreds of open issues feels overwhelming, but I've seen other projects mass-close them and that was a pretty terrible thing to do considering that a lot of user effort was put into reporting them, compared to the effort needed to close (that's not to say fix) the issues.

It is certainly not ideal, but the automated closing feature is a compromise between having thousand of issues with a lot of duplicate or irrelevant ones that will completely overwhelm a dev team with limited capacity, and mass closing issues without giving a chance to users to actually assert that some issues should ultimately been fixed.

Here the stale inactivity is one year, so you are good for the next 365 days to have the issue live and give it a chance to be handled once we have the capacity to do so :)

Now to build on the feature itself, I have also some interests to change the behavior of Certbot regarding old keys and old certificates. It is not related to the fact that old material is stored directly, but to how the associated rotation is implemented, so maybe these thoughts could be integrated in a design proposal.

The problem I have is that Certbot relies on symlinks to connect the up-to-date material from the archive folder to the symlinks folder. These symlinks create some problems for Docker and Windows runtime. For the first one, it prevents to mount only what matters (the keys and certificates), but the entire certbot directory instead, to make symlinks work. For Windows, even though symlinks exists, it is not a first class citizen an generally require administrative accès or complex configuration on the local policy to make Certbot run on a standard user.

So if some modifications are done to the archive folder, I think it would be great to take this occasion to drop the symlinks.

This solution from @localhorst seems to do the job Source

Always nice to take 20k files out of file system and the backup scan process :)

Note: the orignal has broken

I thought here could be a good place to share this usable solution.

Perhaps a good place for it can be in /etc/cron.monthly/certbot

#!/bin/sh

# Delete all CSR files not modified during the past 91 days:
find /etc/letsencrypt/csr/ -type f -name '*.pem' -mtime +91 -exec rm -f {} \;
chmod +x /etc/cron.monthly/certbot

etckeeper

For any of you using etckeeper (on my list of 'must have' utilities)
a nice thing to do is to add those lines to /etc/.gitignore and let backups handle this.

# Ignore certbot keys
letsencrypt/keys
letsencrypt/archive
letsencrypt/live
letsencrypt/csr

Note: on debian install git before etckeeper or it could default to other version manager - apt install git etckeeper

Question:

Would it be safe to do same thing on /etc/letsencrypt/archive ?

find /etc/letsencrypt/archive/ -type f -name '*.pem' -mtime +91 -exec rm -f {} \;

update:

So one way to find out is to try, so far seems to work, but one will have to watch out for expired certs referenced in nginx configuration for exemple (nginx -t will tell you).

commented

I'd not trust mtime too much. It's safer to check for active symlinks in /etc/letsencrypt/live. You never know from where they have been moved, restored etc. Also even expired certs and keys should not be removed automatically because this may prevent a service (e.g. nginx) from startup.

You have a point, that's why always nginx -t && systemctl nginx but in case of a reboot, this could crash nginx boot.

So perhaps I'll just delete them manually like once a year.

Thanks

is there any progress on this front from the original design? i understand this might not be a priority from a security perspective, but from a data management perspective, certbot is kind of a nightmare because you eventually end up with a LOT of certificates, keys, chains, csrs and all that jazz in there, especially if you host a large number of domains.

here i have configurations for about 20 domains (and not all o f those are active, mind you), and I ended up with thousands of files in /etc/letsencrypt, using up 22MB of disk space:

root@marcos:/etc# ls letsencrypt/archive/| wc -l
19
root@marcos:/etc# find letsencrypt/ | wc -l
5343
root@marcos:/etc# du -sh letsencrypt/
22M     letsencrypt/

... i don't mind the disk space use so much, actually, but this is the kind of things that trips other tools over. i had to implement hacks in etckeeper, for example, to keep it from falling apart on those folders...

we're also told not to touch the archive directory at all, which makes me wonder if even certbot itself would be in a position to clean stuff up in there... it looks like purging everything older than 6 months might be safe, but i'm not 100% sure so it would be nice to at least have a documented workaround for this... at least for now I'm going to try this and see what breaks:

root@marcos:/etc# find /etc/letsencrypt/archive/ -type f -name '*.pem' -mtime +180 | wc -l
1548
root@marcos:/etc# find /etc/letsencrypt/archive/ -type f -name '*.pem' | wc -l
1720
root@marcos:/etc# ffind /etc/letsencrypt/archive/ -type f -name '*.pem' -mtime +180 -delete
root@marcos:/etc# f

Just chiming in to say that I had a production incident today because certbot ate up all my inodes by making more than 1.5 million files in the /etc/letsencrypt/{keys,csr} directories. Please fix this — it's frustrating that this issue has been known since 2017 and nothing has been done to fix it.

it looks like purging everything older than 6 months might be safe, but i'm not 100% sure so it would be nice to at least have a documented workaround for this...

i think this would be a great start to fix this. @schoen what's a proper expiry schedule here? is it something we could do in (say) the debian packaging, as a daily cleanup cronjob, or do we need to actually parse the certificates?

i am thinking this is something we could do downstream in the debian packages...

maybe also: if others stumbled upon this issue and deployed workaround, could you document here what worked and what didn't?

maybe also: if others stumbled upon this issue and deployed workaround, could you document here what worked and what didn't?

I recently created cert-prune for this exact purpose. It's not overly fancy, and may not suit everyone, but it scans the letsencrypt folder, checking symlinks and certificate modification times, and purges (by default) certificates older than 60 days which are no longer symlinked from the live/* folders. It works well for my purpose (managing well over 100 certificates) and is being used in production over the last three months with no issues. I always suggest backing up your entire letsencrypt folder before testing.

commented