MD5-based naming scheme breaks due to "/" in base64-encoded content

Question

MD5-based naming scheme breaks due to "/" in base64-encoded content

madelson opened this issue 2 years ago · comments

Thanks for creating this library! I'm trying to use this it to generate API documentation for my projects. I'm trying to use the NameAndMd5Mix mode to avoid long path issues I'm seeing with the default naming scheme.

The problem is that the MD5 hashes are encoded with base 64, which can contain the / character. This causes files to end up in nested folders (e.g. see this file). This in turn breaks all relative links in the nested files (e.g. see the namespace link here).

I think an easy fix would be to use hex encoding rather than base 64. This has the added advantage of being case-insensitive which tends to be better for URLs.

If you're interested, I'd be happy to submit a PR.

Paillat Laszlo · Answer 1 · Mon Aug 08 2022 15:57:29 GMT+0800 (China Standard Time)

oh that's a dumb oversight on me, I correctly handled it for Md5 but forgot to do the same for NameAndMd5Mix >_> but it might be simplier and safer to use hex enconding like you said, wouldn't that produce file longer hash though?

madelson · Answer 2 · Mon Aug 08 2022 20:18:00 GMT+0800 (China Standard Time)

@Doraku yes the solution you linked there should work and fixes the nesting problem. However, doesn't ? have special meaning in URLs (starts the query string?). I could see this causing issues depending on where the docs are hosted.

There is also still the case-sensitivity problem although I suspect that the risk of an actual collision there is pretty low, comparable to knocking a couple bytes off the hash. Unifying forward- and back- slash as ? similarly increases collision odds.

Hex will lead to hashes that are a bit longer (32 chars vs. 24), so maybe that's a concern. For my use-case it would not be.

madelson · Answer 3 · Wed Aug 10 2022 20:02:27 GMT+0800 (China Standard Time)

Another option would be to use a custom alphabet for the encoding, for example all upper-case letters and digits (36 chars):

Encode(md5, "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ");

static string Encode(byte[] hash, ReadOnlySpan<char> alphabet)
{
	var bi = new BigInteger(hash.Concat(new byte[] { 0 }).ToArray());
	var result = new StringBuilder();
	while (bi != 0)
	{
		bi = BigInteger.DivRem(bi, alphabet.Length, out var remainder);
		result.Append(alphabet[(int)remainder]);
	}
        if (result.Length == 0) { result.Append(alphabet[0]); }
	return result.ToString();
}

This gives hash strings of 25 chars or occasionally less if the hash has enough trailing zero bits (a padding solution could be added to guarantee constant length if desired). The nice thing about these hashes is that they only use very "safe" characters and are case-insensitive.

Paillat Laszlo · Answer 4 · Thu Aug 11 2022 16:06:10 GMT+0800 (China Standard Time)

that's would be actually pretty cool (and safe) :)