w3c / websub

WebSub Spec in Social Web Working Group

Home Page:https://w3c.github.io/websub/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Replace rel=self with rel=canonical

cweiske opened this issue · comments

WebSub requires each resource to be delivered with a rel=self link:

Link Headers [RFC5988]: the publisher SHOULD include at least one Link Header [RFC5988] with rel=hub (a hub link header) as well as exactly one Link Header [RFC5988] with rel=self (the self link header)

The web is already adding links to the resource itself in HTML pages, it's the rel=canonical link which is supported by major search engines since 2009.

I do not see a reason to add a second link that has the same meaning. Please drop rel=self and replace it with rel=canonical.

See https://chat.indieweb.org/2015-05-22#t1432330810734000 for a discussion in die #indieweb channel about this:

tantek: however, this "even if the same resource is served at http, https, /, and /index.html, only one of those URls actually works as the push topic" - is an already solved problem
search engines have the same problem
and previously solved it with rel=canonical
thus if that really is a problem for PuSH as well, the PuSH should build upon the pre-existing rel=canonical for the page, rather than require rel=self

rel=self was defined by Atom RFC https://tools.ietf.org/html/rfc4287 in 2005. But rel=canonical is more widely used for HTML.

it is also mentioned in the Web Linking spec https://tools.ietf.org/html/rfc5988#section-6.2.2

Are the two really semantically identical?

self seems to be defined as:

Conveys an identifier for the link's context.

More elaborately from the Atom spec:

The value "self" signifies that the IRI in the value of the href attribute identifies a resource equivalent to the containing element.

While canonical seems to be defined as:

Designates the preferred version of a resource (the IRI and its contents).

The spec for canonical can be found here: https://tools.ietf.org/html/rfc6596

An alternative to dropping rel=self could be to also accept rel=canonical perhaps? Have rel=self for atom and rel=canonical for html?

Yes, it makes no sense to use rel=canonical on Atom feeds.

So the rules could be:

  • In the link header, look for either rel=self or rel=canonical
  • In the atom content, look for rel=self
  • In HML content, look for rel=canonical or rel=self

Meh. Not sure this actually brings anything while it breaks a lot of existing implementations.

I would not like to add a new tag to HTML if the page already has rel=canonical - many pages use that already. But I see we should not look for it in the http link headers.

New suggestion:

  • In the link header, look for rel=self
  • In the atom content, look for rel=self
  • In HML content, look for rel=canonical or rel=self

I think there is a confusion here that self does not mean canonical. This is indeed confusing but I'll add a note to the spec to clarify this.

not having a "self pointing" link exposes us to silent failure (the subscriber subscribes to a url that is never actually pinged to the hub...). This is frequent with "silent" query strings.

Another example of why this is important is for URLs with redirects such as the 'today' URL for the IRC log in this very group. (@aaronpk can say it better!)

An example of when rel=canonical wouldn't work is the IRC logs for #indieweb and #social. The URL we tell people to bookmark is https://chat.indieweb.org/today, however that URL will always redirect to the current day's permalink, such as https://chat.indieweb.org/2016-12-06. That day's page would have a rel=canonical of itself, https://chat.indieweb.org/2016-12-06, but a subscriber would need to use a topic URL of https://chat.indieweb.org/today in order to receive updates.

The rel=self provides a way to advertise the topic URL to use, which may be different from the canonical URL. It probably would have been better to call it rel=topic, but I believe the term came from Atom's use of rel=self.

👍 to rel=topic. If there are aggregate resources that share a hub, then the rel=self would not follow the semantics established. For example, imagine subscribing to wikipedia versus to each individual page in wikipedia, then rel=self from https://en.wikipedia.org/wiki/PubSubHubbub to https://en.wikipedia.org/websub/all would not be correct (I believe).

Just to be clear, I wasn't actually suggesting changing it to rel=topic. This is not a new spec and we would much rather not break every existing implementation just for aesthetic reasons.

The example you provided sounds completely fabricated to me. I don't think anyone on the https://en.wikipedia.org/wiki/PubSubHubbub page would expect to be able to subscribe to updates for all wikipedia articles by just clicking a button on that page. Instead, they would actually navigate to the home page (or some other feed page) and subscribe there, which would have the appropriate rel-self link.

That example is fabricated of course, but the situation is one that we're facing today in several environments. The usage is machine to machine, rather than a human clicking a button.

As an example, the Getty Museum has a collection of some 100k objects. Each description changes VERY rarely, but the changes are typically also VERY important to propagate as they reflect significant changes in state. It would be ridiculous to require systems to subscribe to each object individually. So from the description of the object, we would want to have systems subscribe to the general hub for all objects' changes. If the required pattern is to have an intermediary resource to which each description refers (the "navigate to the home page and subscribe" approach), then what is the link rel for that interaction so that machines can perform it?

The same occurs in the IIIF community. Changes to a particular image are also very rare, but there are millions of them at each organization. Or in scholarly communication -- aggregating preprint journal articles at a subject level.

Existing proposed uses in those two environments:

If you're saying that we shouldn't use websub for those use cases, that would indeed be good to know!

Thanks for clarifying, that use case makes sense now. It does seem to be something different from the scope of WebSub which is "subscribe to updates of this resource". It sounds kind of like the "RecentChanges" feed in MediaWiki, which is linked from every page. I think a standard way of finding that master feed would be a useful thing, and then WebSub would be used to subscribe to changes of that feed.

I don't think we made any new progress on this issue so I suggest closing it.

I agree. I believe the example in use I mentioned in #68 (comment) illustrates the need for the separate rel value.