google / nomulus

We found that we can create hostnames of mixed case, both in and out of zone, through EPP. We would not expect that both ns1.UPPER.foo and ns1.upper.foo could be created for example based on our interpretation of RFC 952 that states:

A "name" (Net, Host, Gateway, or Domain name) is a text string up
to 24 characters... No distinction is made between upper and lower case. ....

a mixed case domain returns: Domain names can only contain a-z, 0-9, '.' and '-'

On Wed, Oct 19, 2016 at 3:46 PM, Nick Felt nickfelt@google.com wrote:
You're right, this is a bug. Thanks for pointing it out.

The real issue (IMO) is that we rely too much on Guava's InternetDomainName.from() method to do validation for us. It's too permissive for my tastes, and for example, when constructing a domain name, it normalizes uppercase to lowercase. If you try to use this as a validity check, then you might do subsequent checks on the InternetDomainName object, expecting that it represents the same literal string (if converted back to a string) as the original, but that is not in fact the case due to the silent normalization.

This is exactly where we went wrong here, since we call validateHostName() which uses InternetDomainName.from() [1] and then in DomainCreateFlow pass the result into lookupSuperordinateDomain() [2], but elsewhere in that flow we use the original "targetId" (aka the FQHN) directly, including for constructing the new HostResource.

I think we should just reject hostnames that aren't already normalized, i.e. we should check that InternetDomainName.from(hostname).toString().equals(hostname) and reject anything that fails. If we felt like being friendly, we could try to do some proactive checks (e.g. for uppercase characters) so we can return a better error message, before returning a generic "host name %s doesn't match normalized form %s" error.

[1]

nomulus/java/google/registry/flows/host/HostFlowUtils.java

Line 38 in afa4d66

static InternetDomainName validateHostName(String name) throws EppException {

[2]

nomulus/java/google/registry/flows/host/HostCreateFlow.java

Line 94 in afa4d66

lookupSuperordinateDomain(validateHostName(targetId), now));

On Wed, Oct 19, 2016 at 7:41 PM Steve Brown steve@sbrowns.com wrote:
proactively checking for uppercase characters as you suggest would make the hostname and domain behavior more consistent

I did some analysis on our existing crop of HostResources for our live TLDs, of which .how, .soy, and .xn--q9jyb4c see substantial numbers of registrar-created hosts.

The result is that 0.07% of all hostnames contain an uppercase character, aka less than 1 in 1000. There are 146 such hostnames total out of nearly 200k total. Out of these, 142 were created by a single registrar (representing 8% of their hosts), and the other 4 were created by another (representing 1% of their hosts).

Of the 146 hostnames with uppercase:
100 are all uppercase (e.g. NS1.EXAMPLE.COM)
9 have just the first label uppercase (e.g. NS1.example.com)
6 have just the first letter uppercase (e.g. Ns1.example.com)
31 have some letters in the SLD uppercase (e.g. ns1.FGRYIXZCUK635489-services.com)

It turns out that in addition to the lowercase/uppercase issue, we also were not checking for hostnames with a superfluous trailing dot. There are 27 such names, an even smaller number.

I've been working on the fix for the past day and it should be good to go soon. Our approach is to be a lot more strict about validating incoming host names. Unfortunately it's going to be kind of complicated on our end because we have a data migration to go through first (which involves renaming the existing bad hostnames), but that won't be a problem for anyone who isn't already running a real production registry.

And some fun reading material related to a debate we had internally on how to handle this: https://tools.ietf.org/html/draft-thomson-postel-was-wrong-00

Just to stir the pot... 😄

Host and domain names have been case-insensitive since the earliest days of the internet, and are still expected to be so by most software that accepts or processes such names (e.g. browsers and mail clients/servers). I haven't done exhaustive research, but as an example, RFC 608 "HOST NAMES ON-LINE" from 1974, defining the original HOSTS.TXT file format, says "no distinction between upper and lower case letters" when defining constraints for host names. Another more recent from 2006 and more relevant to DNS, RFC 4343 "Domain Name System (DNS) Case Insensitivity Clarification" says "According to the original DNS design decision, comparisons on name lookup for DNS queries should be case insensitive." and goes even further defining and requiring case-preservation (oh dear!). Lastly, a brief reading of almost any DNS related RFC, starting with say the classic RFC 1034 and all those that updated it indicates a strong bias toward case-insensitivity of names. For good reasons? Dunno, but lots of bits were spilled in the name of it. Of course one could argue that the EPP and a registry are not DNS, and that might be an interesting discussion.

Additionally, I'm not sure it's correct to apply Postel reasoning here. It's one thing to design a protocol where upper and lower case are distinct for the protocol (e.g. HTTP method names). It may be another to say that the data (the names) you transport or represent with that protocol should have different semantics than has been widely accepted for, let's say, four decades.

The above references and rambling not withstanding, my main concern about "simply" (heh) solving this by restricting the allowed characters in host or domain names to lower case is that it pushes the problem of normalizing case onto every client of the registry instead of keeping it in one place that actually well, or at least better, understands the semantics of those names. We're not really "solving" it in the "Postel was wrong" sense by making the registry restrictive, we're only pushing it elsewhere, and making everyone duplicate the solution. Sure, it's not hard to call toLowercase(), as long as you get it in all the needed places, and not too many of the wrong ones.

Thanks for reading this far. 🙇

I definitely don't see us going the route of case preservation. That would require maintaining a separate field to preserve what is effectively the display hostname as originally cased when created. Also, we are still fixing the issues that we identified with non-normalized and non-punycoded hostnames.

As for case insensitivity, we only had two registrars send us non-lowercased hostnames. It may not be as big of an issue as it appears; it seems that most registrars have the same inclination as we do, to force-lowercase everything. I do see your point though, but I also see how it might be confusing that a registrar thinks that they've created NS1.BLAH.COM but they've really created ns1.blah.com, which they'll see in subsequent responses. I could go either way on it. I'll wait for my teammates to chime in.

Just to be clear, I wasn't advocating case preservation. As I said, my main concern is pushing the problem onto others. And as you point out, most existing registrars have already taken that on. We just need to do it our UI too.

[Edit] But now that I think about it more, I now see how case-insensitivity, coupled with your observation about confusion over NS1.BLAH.COM turning into ns1.blah.com would lead the DNS Deities to additionally require case preservation so as to eliminate that confusion.

I recognize the point about how DNS names are generally case-insensitive, but Hans noted above, it seems fair to argue that DNS names specified in EPP don't necessarily need to abide by the exact same rules that actual DNS lookups are expected to follow.

Overall, I think the Postel-was-wrong line of argument still makes sense in this case, and being maximally strict and only accepting normalized input is the right approach. Normalized data is easier to reason about and analyze, and code that only accepts normalized data is easier to reason about as well. I think this benefit extends to the clients, too - so in that sense, pushing the normalization requirement upstream to clients is a feature, not a bug.

As Hans notes, there is the counterargument that it places a certain amount of extra burden on the clients to do that normalization, but I think that has to be weighed against how costly that burden actually is. Sometimes it might take precedence, e.g. if the clients are actually humans, or if there are lots of legacy clients that can't be updated, or if the normalization is very difficult to do correctly.

But I don't think any of that applies here*. This is an XML wire protocol that is almost always coming from a machine, not a human (and if it was a human, presumably they're technical enough to be able to normalize a domain name on their own). The client audience of this protocol is a small, limited number of other contracting parties (versus the DNS server case, where it could be anyone on the internet with a DNS client). And the basic normalization we're talking about is really pretty trivial - you lowercase the string and remove a trailing dot if present. We are also intending to require punycoding normalization, but we already require this for domain creates (along with the lowercase and no-trailing dot requirements), and nobody has ever complained.

I just don't see evidence that this will be a huge additional burden for registrars. Empirically from the analysis I did above, it's clear that the vast majority of registrars are already providing data in this normalized form. And I'd argue that any responsible registrar is already doing normalization/validation on their own side anyway, so that e.g. if a user types in an invalid hostname in the registrar's nameserver configuration UI, they can reject it right away (perhaps even in javascript) rather than attempting to send it to the registry via EPP and then tunneling back the error. I just tried this in the Google Domains console, and they indeed validate hostnames up front and auto-normalize an uppercase hostname to lowercase on save. So by making this change, we're encouraging that registrars follow best practices and normalize their data right away.

I do appreciate your concerns, and I think it might be worth contacting our registrar partners first to see if any of them object strenuously. But unless we get significant pushback from them, I think strictness is preferable.

The one exception might be domain checks, which I think are more frequently passed through directly from the registrar to the registry (without any intermediate storage by the registrar). In our work internally we were leaving those unvalidated, so you can supply non-normalized names but they'll fail to return any results. But I just looked at it now, and out of 3,763,658 domain checks during October, only 182 checks were not in the normalized form, which is a tiny fraction of a percent across only 3 registrars.

hostname case sensitivity