Twitter: aborts if media download yields "403 Forbidden", e.g. removed by copyright claim
joonas-fi opened this issue · comments
Here's the Tweet: https://twitter.com/janl/status/1113015555064201216
Error message:
2019/11/30 18:04:02 [ERROR][twitter/joonas_fi] Getting latest: getting items from service: processing tweet from API: processing tweet 1113180316510957568: making item from tweet that this tweet (1113180316510957568) is in reply to (1113015555064201216): making item from tweet that this tweet (1113015555064201216) embeds (1112473455650172929): media resource returned HTTP status 403 Forbidden: https://pbs.twimg.com/ext_tw_video_thumb/1112471832232259585/pu/img/ywWGTl09hsnLnMOY.jpg
That image URL redirects (when used with browser - different when API use?) to this DMCA warning.
Timeliner cannot cope with this, and trying to re-run Timeliner always gets me this and cannot continue.
Ah, oops. Not something I anticipated or encountered. How do you think we should handle this?
I dunno, this is a pickle. The obvious error is not being able to continue after 403. My data retrieval process just aborts.
But, what should we do about it? Sure, continue after the error. But, personally, I am not fan of losing any information. In this case the information is:
there once was an attachment, but we didn't manage to fetch it in time because it was later taken down because of a DMCA complaint
I'd prefer this to be stored in the data model. I haven't researched Timeliner's data model, something like attachment: {id: '987654321', permanentFetchFailureReason: '403 not found - Twitter or the author removed it?'}
?
Things to think about:
- there's a distinction between transient errors and "likely fetch will never work again" (like in this case)
- I guess Timeliner already supports transient errors (since re-running my retrieval always ended up fetching this errored attachment) and
- Does Timeliner mind on "force full refresh" where an attachment was managed to be fetched before (we already have a copy of that attachment stored) and now it's unavailable doing "full refresh"? Obviously it should just shrug and carry on instead of aborting.
- if we're doing "full refresh" and we have a
permanentFetchFailureReason
, should we still re-try fetching it? I guess chances are kinda slim that the likes of Twitter restore content. But then again, re-trying probably isn't expensive (if we re-try, maybe not pollute the log about errors we think are really likely to surface from re-try attempts).
...if we're doing "full refresh" and we have a permanentFetchFailureReason, should we still re-try fetching it?
I concur with your conclusion that rechecking is inexpensive, and so worth trying.
403 is usually permanent, or something has to be changed on the server to remove that error.
Perhaps Timeliner should simply continue to the next item after seeing a 403. Log the 403, but continue on, since there's nothing we can do about it. This should probably be the behavior no matter what mode it's running in.
But I also agree that simply trying once or twice more before continuing on wouldn't be a bad idea, in case it was a fluke.