w3c / websub

WebSub Spec in Social Web Working Group

Home Page:https://w3c.github.io/websub/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why is rel=self mandatory?

sandhawke opened this issue · comments

I was thinking it'd be nice to provide websub service for w3.org, so anyone can get notified when any resource changes (eg new WD). For the static parts of the site (not the wikis, etc), this seems pretty straightforward to do en-masse. Except for rel=self.

Why not have the spec say if there's no rel=self link, use the Content-Location header, and if there isn't one of those, just use the URL you fetched from (after any redirects) ? I think like that, I could just deploy a site-meta file, connect the content-mirroring system to a hub, and we'd have it.

I realize this isn't one of the original use cases, and I don't actually know why it'd be useful, but ... somehow I think it would turn out to be, so I'm a little bummed. Anyone know what problem it solves to have rel=self mandatory? (And what kind of mandatory it is. It says publishers MUST provide it. It doesn't say subscribers MUST stop if they don't find it. That should perhaps be clarified.)

I have a vague recollection of similarly requesting explicit fallback in the case of no rel=self, and a vague recollection of @julien51 explaining a good reason (avoiding longterm delayed silent failure?) to require rel=self of publishers. I don't remember if I opened an issue or not, or if it was just discussed in a f2f. I do remember Julien convinced me at the time. Hoping Julien remembers.

The need is that because the subscribe may never know it is not subscribed to the url that the publisher uses to ping.
In practice, this happened over and over with the techcrunch RSS feed.
For the longest time, Techcrunch advertised this URL http://feeds.feedburner.com/TechCrunch/
as their feed: (it's still the one linked on https://techcrunch.com/rssfeeds/)

Now if you just "blindly" subscribed to that very URL, you would never receive any notification, because the publisher pings http://feeds.feedburner.com/Techcrunch/ to the Google Hub. (Notice how the 2nd c is lowercase. The subscriber would have no way of knowing that it's not using the right URL and this is a silent failure.

This happens also with urls with our without a trailing slash... etc.

Sandro, what type of resource are you looking at for your w3.org example? An RSS/Atom feed? an HTML page?

Why is the failure silent? Why can't the hub say it doesn't handle a resource by that name?

For w3.org, I'm saying simply every static resource. html, css, js, turtle, xml, json, jpg, png, txt, etc.

I was on mobile yesterday night. The hub does not know either what URL it will receive pings for until it actually receives one.
As for your specific use case Sandro, I see really no way around adding the header...

As issue#100 points out, communication between the publisher and the hub is out-of-scope for this spec. So the hub certainly COULD know what URLs should be subscribable, if that protocol tells it. (For instance, it could look in a database table for an entry manually set by a user, or it could dereference the URL and see there is a correct hub link and no redirect or disagreeing self link. But that's a matter for the hub and publisher to work out.)

In other words, the techcrunch problem could have been solved privately, without making life harder for everyone else.

The TC is just one of many examples. The trailing slash is another one or URLS with a query param used for tracking purposes (utm tags...) etc.

For reference: rel=self already been discussed to some degree in #69 (which pointed to pubsubhubbub/PubSubHubbub#36) and in #68 – lots of examples and discussions in those issues that are still valid and which ended up in the conclusion of keeping rel=self

Also: Removing the requirement of rel=self would be a breaking change for WebSub/Pubsubhubbub clients, as they right now should be expecting nothing but a rel=self link, and I thought we were trying to avoid breaking changes for now?

@voxpelli Thanks for the links. Too much to remember. Re-reading those, I remain unconvinced. At the time, I was convinced by @julien51's silent-failure concern. But now, having thought about #100, I don't see how silent-failure is really a thing. (as below)

@julien51 It seems to me all of those cases can be addressed by having the hub reject subscriptions to unknown topics, giving an error that perhaps the publisher needs to use rel=self. I'm fine with rel=self being used by sites that need it. My complaint is with making folks who don't play these URL games have to provide trivial, obvious rel=self on every page. That feels wrong, and (in my use case) might well be the barrier that stops adoption (as I think through the political hurdles in my organization).

Re breaking changes, yeah, that's why I phrased the issue title as a motivation question. My issue, as phrased, could be addressed by summarizing this argument into a sentence or two in the spec. Then, the next time someone brings it up, we'll know they didn't even read the spec :-)

Slightly more seriously, often in cases like this, there's a clever way to retain backward compatibility while still allowing forward progress, so it can be worth some brainstorming. I haven't been able to think of one here, yet, though. Also, it'll be interesting to see how many existing subscriber implementations actually do everything required by the test suite (eg host-meta, although that's one we might just take out). We should probably make an effort to find every PuSH impl we can, and test them or get the developers to. I guess that would be a separate issue....

I can see how a hub may be able to more or less imply the rel-self from the source URL, but I'm not totally convinced. Requiring the rel=self if nothing else makes it very simple for a client to know what it should be subscribing to, little room for error.


Also – one thing that hasn't been brought up here yet is section 7. on Content Distribution which says:

[...] It MUST also include one Link Header [RFC5988] with rel=self set to the canonical URL of the topic being updated.

With no rel=self (and not using the rel=canonical either, as eg. #69 suggested) – how should the hub identify what topic an update belongs to?


Also – in regards to the use case of making it possible to subscribe to real time updates for pretty much anything on a site – CSS, JS, images etc. Aren't static assets like CSS, JS, images often immutable? Given new URL:s whenever their content change? To ensure that they can be cached for a very long time (even forever nowadays). So in regards to enabling WebSub subscriptions to such resources – subscribing to updates of something immutable doesn't make much sense and whenever the content isn't immutable, maybe the adding of some headers are pretty simple? (As it likely will be hosted by something more flexible than a CDN then)


Lastly – if one uses something like host-meta to add ones rel=hub (not really sure if that's the site-meta that you refer to or if you're referring to something else), then it has already solved the adding of rel=self for the individual resources through its use of link templates. One simply does:

<Link rel='self'  template='http://example.com/self?topic={uri}' />

(point by point, same sections)

Agreed requiring rel=self is likely to reduces some errors, yes, because it adds redundancy. The question (in the abstract, if we didn't have to worry about compatibility) is whether it's worth the cost.


My proposal would be that rel=self could be omitted by publishers if it's just going to be the same as the Content-Location, or the page's only URL. So there always is a topic URL defined, and it would still be used in Content Distribution, it just wouldn't need to be stated explicitly by the publisher when it's
obvious.


Often vs Always. There are lots of pages that probably don't make sense to subscribe to. But it seems to me it's likely to enable some interesting unimagined apps if we deploy this technology more broadly than just the places we're sure it's useful.

(Of course, in this arena, I'd really want wildcard subscriptions. That would allow mirroring an entire website. But that's a straightforward extension, I think.)


On "site-meta", yeah, I meant "host-meta". Specifically, I was referring to this part of the spec (Section 4):

Finally, publishers MAY also use the Host-Meta Well-Known URI [RFC6415] /.well-known/host-meta to include the element with rel="hub". However, please note that this mechanism is currently At Risk and may be deprecated.

It'd be awesome if the mechanism you suggest worked, it'd solve my problem nicely, but since this is done client-side, and the WebSub spec only says to use host-meta for rel=hub, not rel=self, I think I'm out of luck.

@sandhawke wrote:

often in cases like this, there's a clever way to retain backward compatibility while still allowing forward progress

In this specific case, we're talking about backwards compatibility for subscribers attempting to arrange notifications about some topic using a pre-WebSub PSHB implementation, is that right? I'll proceed on this assumption.

Assume also that we do change WebSub to allow omission of rel=self.

In that case, a PSHB subscriber is not going to be able to subscribe to certain resources that a WebSub subscriber will be able to subscribe to. But is this different from the situation today? It seems to me that in this case, no functionality for PSHB subscribers is lost. If they wish to subscribe to resources lacking rel=self, they can switch to a WebSub implementation, or petition the resource owner to add rel=self.

Perhaps all that's needed, after downgrading the MUST to a SHOULD for rel=self, is a compatibility note.

@julien51 It seems to me all of those cases can be addressed by having the hub reject subscriptions to unknown topics, giving an error that perhaps the publisher needs to use rel=self. I'm fine with rel=self being used by sites that need it. My complaint is with making folks who don't play these URL games have to provide trivial, obvious rel=self on every page. That feels wrong, and (in my use case) might well be the barrier that stops adoption (as I think through the political hurdles in my organization).

You don't answer my main question: how does the hub know what is an "unknown topic" ? Don't forget that often a hub does not know of publishers who may be using it...

Perhaps this is irrelevant, but what's the business model for a Hub providing service for unknown publishers? For what traffic levels is that sustainable? I'd think at higher traffic levels, the publisher would need to be paying the hub.

Is there a reason a Hub wouldn't at least ask the publisher to sign up for a free account, so they can contact the publisher about problems, give them access to subscriber stats, etc?

Anyway, answering your question, for the case where the Hub has never heard of this publisher but still wants to act as a Hub for it....

Right now, the spec kind of glosses over this, saying in 5.2:

Subscriptions MAY be validated by the Hubs who may require more details to accept or refuse a subscription. The Hub MAY also check with the publisher whether the subscription should be accepted.

My proposal is a way that Hubs can do that:

  1. http GET topic-url
  2. if http response code != 200 (eg 301, 404), reject
  3. if there isn't at least one rel=hub pointing at you, look in host-meta, too. If you still don't find one, reject
  4. if there's a rel=self with a value that's NOT topic-url, reject
  5. accept

If hubs do this, and the rejection messages are clear enough, I think the silent failure problem goes away.

Of course, I'd do proper caching on the HTTP GET, and logically cache the result of this algorithm as well, remembering 'topic-url' as a valid topic for a long as I can cache the HTTP result.

This is probably a good algorithm to go through, even if the publisher is a registered customer of the Hub.

Okay, what did I miss? :-)

@tonyg my hope is that every major PuSH implementation (subscriber, publish, and hub) already is a conformant WebSub implementation. (which means I was misusing the term backward-compatible, I think.) We're probably not quite there. If it turns out we're nowhere near there, there's more room for this kind of stuff. In practice, though, given our extremely limited timeline, we don't really have a chance to add something normative to WebSub.

Still, we could add a non-normative suggestion like I gave above, and if it turns out Hubs widely implement it, then maybe subscribers tools will decide things are fine without rel=self and they can subscribe to more pages. After a while of that, sites might feel they can stop with the rel=self.

@sandhawke What I was trying to get at is that it seems harmless for WebSub to permit additional behaviours beyond those of PSHB; for WebSub to be a superset of PSHB; for a WebSub subscriber to be able to subscribe to a superset of the resources that a PSHB subscriber can.

@tonyg The problem is that even if we did that with WebSub subscribers, WebSub publishers would still be required to provide rel=self, or else PuSH subscribers wouldn't work with WebSub. Right?

Perhaps this is irrelevant, but what's the business model for a Hub providing service for unknown publishers? For what traffic levels is that sustainable? I'd think at higher traffic levels, the publisher would need to be paying the hub.

Google and Superfeedr provided community hubs. We had different motivations. Google for example did this because they are interesting in learning about what sites changed. This is a good way to achieve that.
For Superfeedr, this was in an effort to raise awareness around our product and services...

Is there a reason a Hub wouldn't at least ask the publisher to sign up for a free account, so they can contact the publisher about problems, give them access to subscriber stats, etc?

Yes, simplicity and convenience. You don't want to ask every random blogger to create an account. Superfeedr and Google;s hub are integrated by default in the PubSubHubbub WP plugin for example.

Your proposal does not solve the problem at all for utm tags for example. You can't ask all publishers willing to support the protocol to redirect to a "canonical" version of their URL when a random query string has been added to it. As for the traling / problem or the www prefix or the https vs http URL, if there were solvable problems, they would not been solved a long time ago by search engines and crawlers... Their solution is to use a rel=canonical which serves a very similar purpose (and, if we were starting from scratch I'd have picked canonical instead of self... but we have to deal with a very large and existing set of implementations).

Also does hub know that if there isn't at least one rel=hub pointing at you, look in host-meta, too. If you still don't find one, reject ?

Well, this thread blew up fast.

Based on how i'm reading https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.14 you should only be providing Content-Location when the content is available from a non-conneg URL.
Indeed, the line it is only a statement of the location of the resource corresponding to this particular entity at the time of the request. makes it sound like as the contents of the page change, that Content-Locations URL can change

Also, Content-Location can be a relative URL which would be problematic if you return the same content with and without www.

@sandhawke As a maintainer of a hub very similar to a WebSub hub, I'm against the mentioned verification part of a subscription. Not because a subscription verification doesn't make sense at some point, but because it has to be a synchronous verification to have an impact.

Doing requests synchronously is challenging. My hub is very strict with rate limiting the amount of outgoing requests to any specific host and thus moves the validation or fetching of any incoming requests to a queue system where it will be eventually handled somehow somewhere.

Employing rate limiting, queue system and asynchronous fetching of incoming requests should be best practice for any service that's instructed to fetch a third party resource (to mitigate it being used for DoS-attacks etc) and that makes it impossible to require them to respond to any subscribe request with a 4xx-code that depends on any content fetched.

(In practice, apparently, some people sometimes decides to ping all URL:s of their site, and any queue will become very very large and the response times of any synchronous call become hours long, which of course is totally unfeasible)

@tonyg My impression is that WebSub in its current form is meant to strictly replace the current version of PubSubHubbub, not be an evolution of that version. So the two should not be living side by side, but rather be considered to be the same thing. I think that is critical to get adoption, that people should not have to move, but rather automatically become valid WebSub.

@julien51 we can probably let this thread go -- I'm still not seeing any way to make this change in practice, given our compatibility requirement, so, oh well. Closing it, commenter satisfied.

That said, fyi, my reply:

Good points about account creation, esp w/ WP plugin, thanks.

For utm tags, just use rel=self. Again, I'm not trying to take that away. I'm just looking for a way that folks can skip it if they're not doing anything too complicated with URLs, motivated by my own trying to deploy it.

For checking rel=hub, I'm not seeing what's problematic. I'm suggesting the Hub basically do Discovery on the topic-url and look to see if one of the hubs it's found is itself. Presumably it knows the URL people are using for it as a Hub. It might want to allow for some variants like http(s)?://(www)?.example.org(/)? but that's a local issue, based on how it's been telling people to link to it as a hub. This step isn't really needed, but seems like another source for possible errors that's easy to catch at this point.

On / and www and http/s, I don't think the problem is the same as with search engines. At least on w3.org, all three of those are handled with redirects, so they'll already be 'canonicalized' before the subscriber is getting the URL to send to the Hub. And the non-canonical versions will all be rejected by the Hub, if the subscriber doesn't follow redirects, if it uses my algorithm. (The subscriber has to follow redirects to find the rel=hub, so it would be a simple bug -- not an arguable simplification -- if they used the pre-redirect URL instead of the post-redirect URL as the topic.) (Search engines, on the other hand, are looking at numerous inbound links and trying to figure out if the person creating those links was intending to refer to the same thing.)

@dissolve I'm not seeing the disconnect. Indeed, Content-Location is only used when during ConNeg and the content is available at another URL. In WebSub as-it if you want to do ConNeg, you have to make the content available at another URL, and give that URL as the value of rel=self. And that's how every static page on w3.org works (the Content-Location way).

@voxpelli Yeah, I wondered about that. I'm infatuated with fast dereferencing, and really liked longpoll before websockets came along, so my thinking was the subscriber can wait for a while during validation, and if the validation is taking too long, then they probably shouldn't be subscribing. But I know that shows my async coding style and is very hard to do on some platforms.

Still this only comes up if you're writing a specialized standalone Hub, not a Hub that's embedding in the server with its content, so I'd think you'd be using a fast async platform, expecting to handle >10K open TCP connections, etc.

@sandhawke that sounds like a Valid use of Content-Location, but not the same meaning. As I read it, if you have a Content-Location URL is may or may not be available at a later date, It may or may not change when the requested resource changes.
rel=self equates to this is the current URL and it will be updated when the content is updated, the hub will also know that this is the canonical URL for this content.

The other bit that is a pretty big distinction is that based on that bit of the RFC it sounds like, if you request from the URL returned by the Content-Location header, it should NOT be returning another Content-Location header, while if you request the content from the rel=self URL, you MUST still include a rel=self.

@dissolve Agreed the Content-Location spec is a little problematic about that. I'd be surprised if that was a problem in practice, the way people actually use it, but if it is on your site, you can just rel=self.

On the way CL is only sent during CN, and not when you fetch the CL address -- that's fine for the algorithm I proposed for leaving out rel=self, because if there's no self or CL, it just falls through to the URL you're fetching.

@sandhawke Falling through to the URL you're fetching, you are almost guaranteed to have issues. Especially if you are coming from a search engine which can add utm_source=google or something like that. You also then burden clients or hubs with having to worry about stripping out fragments.

@dissolve Of course you have to strip the fragment, but that's a natural part of HTTP, not a burden. And that will strip anything someone might add. If search engines put utm codes in the non-fragment part of a URL, they'd often wind up 404.

thats how all of google analytics works, and google adwords as well as other advertising platforms. Searching doesn't do it, but to just ignore that is pretty silly. I see websub being especially useful for sites that publish news, which are all pretty well used to advertising. I'd say the default on most sites is to ignore unknown query params. So, no, I don't think they would hit them too often.

But basically you are advocating replacing a fixed value that will always work with,
Check for rel=self, if not, check for Content-Location which is probably okay in practice, and then just use the URL under the assumption that if there are any query values they were probably not just being added by something else.
And hopefully they don't start using any sort of analytics at a later date, or they will break any new subscribers coming in from those channels.

Its replacing certainty with a lot of 'probably ok' in most situations. Seems like a terrible idea for a spec to me.

I don't agree, @dissolve, but at this point it's probably not worth more time from either of us.