whatwg / url

URL Standard

Home Page:https://url.spec.whatwg.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a neato informative table of various URL pieces

domenic opened this issue · comments

Basically copy the bottom half of this: https://nodejs.org/api/url.html#url_url_strings_and_url_objects

(We could presumably SVG-ize it so it's a little prettier.)

Via the thread at https://twitter.com/wa7son/status/886982643463708673

Quick WIP with Inkscape:

drawing svg

Would something like this work?

I think so, although maybe a table is better as that would be more accessible I suspect. Note also that ? is not part of the pathname getter. The other thing that might be interesting is to illustrate a couple different URLs. In particular different schemes. You also omitted the origin field although that's rather hard given that it needs to skip user/pass somehow.

Note also that ? is not part of the pathname getter.

Oops, an off-by-one.

In particular different schemes

That does indeed sound like a nice idea.

You also omitted the origin field

I did so intentionally, as it's not a concept intrinsically related to URL parsing, but rather more about Web apps/security. And because it's hard.


I did a table version with a few variants. The first is a straight translation of the SVG graph. The second is closer to the version in the Node.js doc. The third is the same as the second but has origin, mainly there to show how ugly it is. The forth is a URN for fun. I do like the fact that you can link to the spec for that exact attribute, but with some coloring I still think the SVG one is a bit prettier.

screenshot from 2017-09-05 16-07-49

Alright, I'm game. We should be able to get the links to work with SVG too. I'm not really sure if we can make all of it equally accessible though.

Personally I like the first table one, possibly with additional text-align center.

I also think it might be interesting to have a counterpart table that is about the URL record terms, instead of the API? (E.g. scheme instead of protocol, query instead of search, fragment instead of hash.) Maybe that wouldn't be that helpful though.

It's probably useful as there are some interesting differences between the two. Bit unclear where the table should be located at that point, but maybe we could put it in an Appendix?

What is the status of this issue? The issue that I submitted today, "Documentation on URL syntax", has been closed and deferred to this issue, which has been in play for over a year. In the meantime, the only documentation I've found that lays out URL syntax is the series of steps in section 4.5, which requires some figuring out to understand. So are we going to add something to fix that?

@EnnexMB it basically needs someone to work on it and resolve the open questions above.

Ok, can I help with this? It seems that the two open questions are:

  • What to be used to represent the syntax (graphic, table, or formula). I suggest the second or third of the four tables posted by TimothyGu, but:
    • It should be in text, not a graphic, with links on the terms. The code Timothy used to generate the graphic would be helpful.
    • The terms should match those used in the URL spec, although it would be good to point out the parallel API terms and link to their document, and it would be best if the analogous table is posted there as well.
    • It would be helpful to have an additional row at the top of the table with a symbol to distinguish required from optional elements. In the formula proposed in my issue yesterday, this was done with square brackets. Timothy's table is much easier to read, but the optionality is important information to include.
    • I guess the third table is best if "origin" means something, which I don't know. Linking to where it's explained in the spec would fix that. It appears twice, so some clarification is needed.
    • I don't know about URNs, but if Timothy's fourth table is related to URLs, then it would be good to show that relationship as well.
  • Where to put it. I suggest section 4.5 unless there is a more appropriate place. If it's put in an appendix, it should be prominently linked to, and maybe should be linked to in a few places anyway, because this is going to be something that will be useful to people.

What have I left out? If Timothy will post the code behind his graphic, I'll work on editing it to implement the suggestions above.

Sounds good, thanks. What do you think @TimothyGu?

As for placement, the top of section 4 might also work, given that it illustrates the relationship between various subsections.

@EnnexMB Thanks for your interest in this.

The WIP are unfortunately on my laptop that has seen some physical damage since the time I created them. I'll try to recover the files tonight.

The code Timothy used to generate the graphic would be helpful.

For the first (https://user-images.githubusercontent.com/1538624/30042227-a0f3af64-9222-11e7-96a4-39c0cf11d279.png) it was just a manually created SVG. For the second it was a pretty standard HTML table with the spec's default styling.

It would be helpful to have an additional row at the top of the table with a symbol to distinguish required from optional elements.

NB: what's optional is quite different for different URL schemes. The URN at the end is a good indicator of that. In fact, for non-special URLs only the scheme is required and nothing else – tim: is a valid URL! It's important to be mindful of that.

I guess the third table is best if "origin" means something, which I don't know.

I'd be okay removing that. It's not really a component of the URL but rather a byproduct, so may not fit in that table.

Hi @TimothyGu, were you able to recover the file? I don't think we need the first one, since it seems to be superseded by the second, the table version. Standard HTML is fine, and it could help to start from the structure you've already created, rather than starting from scratch. Of course, if you want to move forward with the changes yourself, that would be great too. But I'd be happy to do it if that would help.

I understand that optionality is complex. I had in mind to devise some compact way to represent it in the first row of the table. You (and others) might want to take a look at the formula in my original post and see if you agree with the optionality as represented there by square brackets. (I just now edited with a correction.) It does have everything optional except scheme:. I wrote that formula entirely based on the serializing instructions in section 4.5.

@TimothyGu, any luck getting that file? I really think that one way or another we should get this done.

@EnnexMB Sorry about the delay, but yes! Here's the diff for the table version:

https://gist.github.com/5eb111b5021b338d516e97225a65bed4

Here's the SVG if you're interested. Note the search coverage is still wrong.

https://gist.github.com/bf539f420463bab1eb7426cff267a5b4

(drawing2.svg have the fonts embedded)

Please go ahead and work on it. I won't be able to do so myself and I really appreciate your stepping up.

Thank you @TimothyGu. I need some help with the format of the file in the first link. Can someone send me a link to documentation on the diff format used there? I Googled "diff file" and don't see anything relevant.

I found https://www.thegeekstuff.com/2014/12/patch-command-examples/. The document being patched is the source file for the URL Standard by the way, url.bs.

@EnnexMB Oops, I’m sorry to have missed your comment on the gist itself. What @annevk gave should work, though I would personally do this:

  1. Put the gist file in a file, let’s call it tmp.diff
  2. Apply it using git apply tmp.diff.

git apply has several advantages over patch and is usually much easier to use, so I’d recommend that for diffs with Git headers like the one I provided.

Okay, I'm sorry, but I still need a bit more help here.

I think the problem is that this all started when I was reading the URL standard and posted an issue about it, which landed me here in GitHub, but I have no experience in GitHub. So when I'm told to use git apply tmp.diff, I don't know what environment I'm supposed to be in to do that.

I Googled git apply and found what appears to be documentation of that command, and from there, of git itself, which appears to be software that I need to install on my computer in order to proceed with this. Is that correct, or is there a way to work with that diff file online without installing software?

Sorry to distract from the thread topic by needing some guidance.

It's for the command line, e.g., the Terminmal application on macOS. And yeah, you'd need to have such tooling installed (for macOS you'll get prompted to install it). To help you, I applied the diff to url.bs and copied the result to https://html5.org/temp/url.bs.

Edit, Sept. 15: Disregard this post, and see my next one below.


Okay, thank you. That gave me a helpful starting point.

I don't know how to include HTML in this post, so I've inserted two images of what I've done and then after those images, I provide a link to the HTML file that generated both of them.

Here is @TimothyGu's third table with the changes I suggested and some additional changes:
url syntax representation- original table proposal modified
The complete list of changes from his original table is documented in the HTML file linked below. Also, in that HTML file, the red, underlined text is working links.

In addition, I've done some further work to present an alternative proposal, which has three parts.

  • Formulaic representation: I think this is worth including because it uses the standard system of square brackets to represent optional elements and curly brackets with a vertical line to represent a set of elements to select from. Also, referring to it can assist in understanding the meaning of the graphical representation below it.
  • Graphical representation: This uses different colors to represent optional elements, with a gradation of lighter colors to represent elements that are optional within other optional elements, and adjacent elements in the same color to represent mutually exclusive choices. The information content is the same as in the formulaic representation, but it is easier for a human to read.
  • Table of element conditions: This summarizes the rules in section 4.5. "URL serializing" of the standard. Again, referring to this table can make it easier to understand both the formulaic and graphical representations.

In the following image, the underlined text is working links in the HTML file linked further below.

url syntax representation- new proposal

The two images above were generated in an HTML file using the same CSS as the URL standard. However, that didn't handle conversion of the double-brace wrappers used in @TimothyGu's code, so I converted those to <code> tags. (I'd be very interested in knowing how to use those double-brace wrappers if someone could direct me to information on that.)

The HTML file is posted at Gist, and I don't see a way to link to it so it can be read directly by your browser. So to see it as intended, you will have to copy it into your own htm file and view it in your browser from there. If someone will tell me a better way to do this in the future, I will do that.

Alright, hold on a second. Disregard my previous post from a few days ago. I was just reading up on CSS syntax and in sections 4.1 and 5.1 came upon railroad diagrams. It's a far better way to represent syntax than my home-spun graphical representation above. I found a website for generating them, and here is the result for URLs:

url syntax railroad diagram

Along with that graphic, there is an htm file that shows that diagram with links on the element names to the relevant sections of the URL Standard, along with another representation of the syntax in EBNF notation, which is the code used to generate the diagram.

As above, the htm file is saved as a Gist, and I wish I knew a way to post it so it would load directly in your browser, but I don't.

From my previous post, the table of element conditions might still be useful. I'd say disregard all the rest.

See #24 on some previous work done on creating a formal grammar for URLs, perhaps displayed through railroad diagrams (see http://intertwingly.net/stories/2014/10/20/Url.xhtml). In my opinion, RR diagrams and formal grammar solve a different problem, and a version of what I had should be enough just for a simple overview of URLs, which is what this bug is all about.

The RR diagrams you linked to are very complex and, as you say, solve a different problem than we are discussing here. The RR diagram I posted is very simple and contains the same information as in your table plus information on optionality of elements. Do you have an idea of how to convey that optionality information in your table? That was what I was getting at with the graphical representation, but I think the RR diagram does it much better.

Whether we use the RR diagram or a version of that table or something else, I would like to suggest that this issue be brought to a conclusion by posting something in the standard to give readers and easy way to understand the syntax of URLs.

a version of what I had should be enough just for a simple overview of URLs

It seems like the simple RR diagram in #337 (comment) hits the sweet spot pretty well. To me at least, it’s more user/reader-friendly than either the table approach or the https://user-images.githubusercontent.com/1538624/30042227-a0f3af64-9222-11e7-96a4-39c0cf11d279.png approach

As @annevk mentioned, there is a complete railroad diagram for the URL specification as it existed four years ago. It even was testable, produced a reference implementation, and passed all of the (valid) tests at the time. This was even merged into the spec (the stylesheet is now gone, but you can get the idea). If this is an idea whose time has come, I can help.

Hey, those railroad diagrams in that old version of the spec (linked under "merged into the spec" above) are great. Why were they taken out? If that's a long story, we don't need to go into, but the important question is, can we get them put back in?

If the problem is that changes in the spec made the diagrams invalid and it was too much work to update all those detailed diagrams, I understand that. But if that's the case, can we, instead of leaving the diagrams out entirely, put in a summary diagram like the one I posted above, so the reader at least has something to help them understand the syntax?

The summary diagram I posted above is roughly equivalent to the combination of diagrams at the following locations in the old spec:

One problem with my summary diagram that I see by looking at the old spec is that my diagram does not cover the case of relative URLs. The reason for this is that I built that diagram based on the rules in section 4.5. URL serializing. I suppose that may have been an error on my part and I should have used the slightly more complex rules in section 4.3. URL writing. Is there a reason the serialization rules don't include relative URLs?

If there is a decision to use a summary railroad diagram like the one I posted above, then I will correct it to cover relative URLs by recasting it from the rules in section 4.3 or any other set of rules that are the right rules to base the diagram on. On the other hand, it would be even better to take up @rubys's offer to update the more detailed diagrams (if that's what he's offering to do). Best solution would be to include both the detailed diagrams and a summary one.

If the problem is that changes in the spec made the diagrams invalid and it was too much work to update all those detailed diagrams, I understand that.

I hit that same problem while working on PR #416. I was unable to find a complete list of what has changed since RFC3986. Is there an official log of the changes between versions/dates (other than having to dig through the git commits)? Without a change log, it's hard for implementors to find out if their implementation is still up-to-date with the spec.

RFC 3986 wasn't exactly used as a base, so there's no detailed changelog relative to that.

What was used as a base then?

Reverse engineering implementations through adhoc testing that became increasingly more rigorous.

@annevk, any comment on the use of railroad diagrams in the standard, why they were taken out, and whether they are a suitable solution to this issue?

@EnnexMB they were never taken out - I build a separate spec as a proposal; it didn't get much interest at the time, so they never went in. Perhaps now is a good time to revisit the idea.

The thread at #24 (comment) @rubys linked earlier has some additional details from my perspective. But yeah, if someone were to resolve the outstanding issues I think there's still interest in having something like that in the standard.

Can someone please answer my questions above?

  • Is there a reason section 4.5. URL serializing doesn't cover the case of relative URLs?
  • Since it doesn't, what is the correct section of the standard that we need to talk here about illustrating? Is it 4.3. URL writing or something else?

There's no object representation of a relative URL. It's only a potential input to the parser, not an output.

Thank you. How about the second question?

I suggested the top of section 4 earlier on, but we could also add a new section somewhere. Depends a bit on what the contents end up being.

Sorry, let me clarify the question. What section of the standard contains the algorithm that we need to represent? I ask this because I made the mistake before of representing the algorithm in section 4.5, which is incomplete for this purpose.

Incomplete in what way? What OP asks for is some kind of graphical overview of a URL (defined in 4.1), which is typically done by using its day-to-day syntax separators (defined in 4.3; though that also defines other things, not relevant for this). And perhaps how that relates to the API (defined in 6.1).

Section 4.5 is not incomplete in the sense that it's wrong (as you explained), but is incomplete in the sense that its algorithm cannot be used to infer the complete URL syntax. I made the mistake of using section 4.5 for that. Now it seems that there are at least two other sections that might be used for this purpose: 4.3 or 4.4, and I'm asking if one of those or another section is the right one to use for this. (You mentioned section 4.1, but it only defines the parts that go into the URL, not how they are combined to compose it.)

It sounds like your goals have diverged from what OP requested. All valid URL input is described by 4.3. Everything that parses into something is described by 4.4.

You moved my issue here, which seemed to make sense because the purpose was to get a human-friendly representation of the syntax into the documentation. The OP asked for an informative table of URL pieces. I've suggested including information about optionality, which requires following an algorithm, not just a list of parts. Is that an unwelcome divergence from the goal?

Now, I apologize again for my lack of expertise in URL technology, but I don't understand enough to know if what you said about sections 4.3 and 4.4 answers my question. If my lack of experience makes my participation a drag, tell me and I'll go away. But if I may be helpful in developing the desired representation, could you please say which section contains the right information to base the representation on?

I guess I don't quite understand what your goals are. E.g., your issue doesn't list the scheme as optional, but elsewhere in this issue you're talking about relative URLs (relative-URL strings), which can omit the scheme. If you're interested in all the strings browsers accept, that's 4.4. If you're only interested in strings web developers should produce, that's 4.3.

The goal of this issue as I understand it is mostly to show the typical URL syntax in relation to the various components described by the standard, which looked similar enough to your issue...

My personal goal is to fully understand the URL syntax, so I can code it properly in my websites. That's why I came to check the URL Standard in the first place. But when I found that no simple representation of the syntax was given there, rather than just complain about it, I offered to pitch in and help make one. So that's the goal of my participation in this issue, to help develop a human-friendly representation of the URL syntax for inclusion in the Standard.

My first attempt at that showed the scheme as required because I made the mistake of basing it on section 4.5.

Since section 4.3 describes what URLs developers should write and 4.4 describes URLs browsers should accept, I presume the logic of both of those should be the same. So if I were to go through each of those sections and lay out the syntax they represent, I should get the same result from each section. Is that correct? (I ask that question because the same was not true for section 4.5, for a reason I would have understood if I were more familiar with this technology.)

That's incorrect. The parser will accept more. Whatever it accepts more, is always flagged as a "validation error". What's not accepted by either is flagged as "failure" in the parser. Any further mismatches would be errors with the URL standard itself.

Okay, that's interesting. I'm glad I asked.

You said above that we can move forward with this "if someone were to resolve the outstanding issues". You cited another post of yours, which talked about discussing grammar and saying that @rubys's attempt at railroad diagrams was not successful. Could you say what outstanding issues need to be resolved besides developing a correct and helpful representation of the syntax?

The railroad diagrams were intended to replace section 4.3 (writing) and 4.4 (parser). Some problems were:

  1. They were not strictly identical.
  2. They were not necessarily easier to understand.
  3. Railroad diagrams as a concept were not formally defined.

If you don't plan to replace those sections and only offer them as non-normative guidance, then 2 and 3 go away.

@rubys, would you like to work with me (or without me) on this? You obviously have far more experience and expertise on it and have already done a lot of the work. I wouldn't want to reinvent your wheel. It sounds like there's interest in using something now if it's either perfect or non-normative.

@EnnexMB, I'm willing to help, but there seems to be some confusion. For example, I don't believe that the railroad diagrams were ever meant to replace any existing sections, and if I ever gave that impression, I apologize. Nor do I believe that they were meant to be normative (my memory is fuzzy on this point, perhaps they were initially proposed as such, but if so, we quickly determined that they were best non-normative.

Beyond that, there is an even bigger disconnect. To illustrate, look at the original table and note that it uses the word protocol. Now look at recent work, and see the word scheme. I think the biggest problem here is determining who the target audience is for this change. I gather that the original request was focused on users of the API.

I guess what I am getting at is that there may be multiple issues here, and they aren't mutually exclusive. It may be worthwhile adding multiple graphics to different sections.

Finally, yes, I'm willing to help. If you have something you would like to see in the document and can show it displaying in a web page, I can review it and do the command line magic to make pull request for you. If what you produce addresses this issue, that's great. But if not, that's not a problem either.

Hi @rubys, I'm glad you're willing to put those misunderstandings behind us and move forward with this.

We do need to figure out the matter of terminology you mentioned. Let me ask this question. I see two possibilities:

  • The two sets of terminology are equivalent, i.e., they form a set of synonym pairs used in two different technology domains.
  • There is a meaningful difference between the two sets, so that, for example, there is a (perhaps subtle) difference between the meanings of protocol and scheme, between query and search, and between fragment and hash.

If the first case is true, then perhaps each box of the diagrams could include both terms, i.e., it would be bilingual.
If the second case is true, perhaps this brings up your suggestion of different graphics in different sections. But would it also be possible to have diagrams that express the relationship between the two terminologies, so that people could understand that relationship instead of looking at them in isolation from each other?

Regarding a web page that displays a candidate of something to go in the Standard, I'd like to suggest that we're talking about something on the spectrum between my diagram and your diagrams. One problem with my diagram is that it doesn't include the case of relative URLs. But what I like about my diagram is that it summarizes the whole sequence of absolute URLs in one diagram (albeit with a line break). Your diagrams go into much more detail and therefore cover the content of my diagram in at least four different diagrams, as listed above. It seems that both approaches are worthwhile for seeing both the forest and the trees.

In addition to @rubys, it would be helpful if @annevk and anyone else chimes in if you ever feel that we're going off in a direction that's not going to work. It would be unpleasant for us to develop something, only to be told later it's not suitable.

I have added syntax diagrams to the Wikipedia pages on URNs and URIs. Those diagrams are generated directly from the syntax code posted on those pages (which was there before). The portion of the URI article that includes that drawing is transcluded (automatically copied) to the page on URLs. That means that other people at Wikipedia have decided that the syntax of URIs and URLs are the same. I don't know if that's correct or not.

There is a contradiction between the URI/URL diagram and the one I originally proposed above. The diagram above shows the path as optional, but the syntax in Wikipedia (based on RFC 3896) shows it as required. So I suppose this is another error in that original diagram.

If either of those diagrams posted in Wikipedia is incorrect, or if it is incorrect that URI and URL syntax are the same thing, then either feel free to edit the Wikipedia pages or let me know what the problems are and I'll get them corrected.

The URI/URL syntax drawing does not have the level of detail in my original diagram or in @rubys's diagrams. I won't enhance the diagrams in Wikipedia until a new diagram or diagrams have been vetted and approved here.

There's been no response on either of my last two posts for two weeks. I don't know if this is because my questions were deemed to dumb to comment on or too difficult to answer.

I do think @rubys was right when he said that the conflict in terminology is an important place to start. But whereas he suggested choosing one form of terminology based on who the audience is, I'm suggesting sorting out and resolving the conflict so that all audiences can talk with each other and be understood. Can we do that? If we can, then we can proceed to make up a useful and correct (albeit nonnormative) illustration of the syntax.

Regarding URIs and URLs, there is some disagreement in the world about whether they are synonymous or not. It would seem that the folks who set the standard for URLs would be a good authority for establishing the correct answer to that. And when we have that answer, we'll know whether the illustration of the URL syntax also applies to URIs synonymously or needs to be adjusted to apply to URIs.

Are we going to move forward to get an illustration of the syntax done?

It's not really clear to me what questions you have, I only count one question mark in the preceding two posts. Here's my view on the terminology:

  1. For the URL model we continue to use scheme et al. as these are more appropriate and less misleading than protocol et al. For the URL API we continue to use the latter as changing the API breaks compatibility and introducing new APIs solely to fix the terminology is not worth it. We could align the model with the API, but given the many non-API consumers of the model I don't think that would be fair.
  2. The point of view of the URL Standard is that URIs and IRIs no longer exists (subsumed by URLs). And that URNs are URLs with the urn: scheme.

Thanks @annevk.

In your answer 1, it sounds like you're saying that the terminologies are synonymous. Therefore, the diagram I first posted above can be made bilingual by inserting the corresponding API terms in brackets below the non-API terms where they are different, as follows:
url syntax with api tems
This diagram has the problems already discussed above and is only used here to show the presentation of API terminology alongside the non-API terminology where they are different. I have taken the API terms from @TimothyGu's original postings (first and second) in this issue.

@rubys, does this resolve your concern about the terminology? Can we move forward with developing the correct syntax diagrams in this way?

@annevk, in your answer 2, is the view of the URL Standard authoritative, or is there some competing body that could disagree with you? If this view is authoritative, then I could propose that Wikipedia state that the term "URI" is depricated and when we finish the syntax diagram here, that should be posted on Wikipedia in the URL article with reference to the new diagram in the URL Standard.

The IETF would likely disagree.

Okay, thank you.
What do you think of the bilingual syntax diagram?

Sorry, I think it would be nicer to always list the second term, even if it's identical, and link it to its definition.

Yeah, listing both names in all boxes could make it clearer, even if repetitive.
And yes, the intention would be to link the boxes to the definitions. Is it sufficient to link to the definitions in section 4.1, for example for scheme? I don't see analogous definitions in the Standard for the API terms.

Note that they don't really match. E.g. if the scheme is "https", then protocol is "https:". Similarly query/search and fragment/hash.

Okay, thank you @domenic. This is the question I was originally asking. If they don't match exactly, then they are not synonymous and the bilingual diagram above is not appropriate. In that case, either:

  • The syntax diagram can cover one terminology or the other (non-API or API), or
  • It could incorporate the differences, or
  • There could be two separate diagrams for the different terminologies.

I think the second one (incorporate the differences) would be best if it can be done reasonably well, as it would help people understand the relationship between the terminologies.

So, is there a place that lays out the exact relationship between the terminologies, i.e., the differences that you are referring to?

Okay, could you help by providing a translation of those algorithms to a set of correspondence rules, like the one you stated above, that scheme https = protocol https:.

I wonder how that rule applies. The scheme is always followed by ":", so from that rule, it looks like the definition of protocol just includes the ":" instead of appending it. If that's correct, that's fine; we do need to represent such relationships correctly. So can you provide the set of those rules to work with?

The set of rules are described by the algorithms, no?

It would be good to name and identify both the domain-names in the host-name (separated by dots) and the path-components (akak 'folder names') in the path (separated by slashes).

Thanks @dwsinger. We already have #435 for formalizing domain labels. And we should probably formalize "path segment" for the latter, which we already use in URL writing, but not in URL representation (there it's just an ASCII string without a formal name).