[FAIL. Media couldn't be retrieved] with some mp4 files

Question

[FAIL. Media couldn't be retrieved] with some mp4 files

tobozo opened this issue 2 years ago · comments

hey thanks for this great script! 👍

could be a false negative but I'm getting FAIL error messages on some mp4 files:

 47/2991 media/1336316106835779593-3-ICmLbbI3-lB9nw.mp4: FAIL. Media couldn't be retrieved from 
https://video.twimg.com/ext_tw_video/1336303046565892096/pu/vid/896x720/3-ICmLbbI3-lB9nw.mp4?tag=10 
because of exception: 'content-length'

exception thrown at this line:

byte_size_after = int(res.headers['content-length'])

the content-length header appears to have a valid value though (screenshot from Firefox):

be well and thanks for the awesomeness !

vibe · Answer 1 · Mon Nov 21 2022 23:07:14 GMT+0800 (China Standard Time)

Did you run the script completely? On the first run of the downloading part it queues up failed downloads to retry them with longer delay, and in my experience all the "content-length" errored videos do get downloaded on the second run.

tobozo · Answer 2 · Tue Nov 22 2022 00:14:59 GMT+0800 (China Standard Time)

Script ran completely, the fails were evenly spread across the logs, and all initially failed mp4 ended up with the SKIPPED status on the second pass. I'm not sure what caused this though, maybe twitter doing some throttling or a glitchy load balancing unit?

Retrying the ones that failed, with a longer sleep. 4 tries remaining.

(...)

103 of 103 tested media files are known to be the best-quality available.

Total downloaded: 206.7MB = 0.20GB
Time taken: 3182s

Closing this as it's more feedback than an issue.

Tim Hutton · Answer 3 · Tue Nov 22 2022 01:05:59 GMT+0800 (China Standard Time)

It's certainly odd that we get an exception on that line. Worth trying to understand. @press-rouch any ideas?

tobozo · Answer 4 · Tue Nov 22 2022 01:36:59 GMT+0800 (China Standard Time)

twitter cache servers appear to send SPDY header to the browser, could that explain the behavioral difference with the python script?

Press Rouch · Answer 5 · Tue Nov 22 2022 02:37:17 GMT+0800 (China Standard Time)

Huh, weird. I replicated the bug locally, but only the first time I ran it (so it did the retry, matched the content size, and didn't do the download). On subsequent runs it successfully got content-length in the first pass.

I guess we could change that line to a try and then print out the whole header on failure. I hadn't looked at this chunk of code before, I think we could make some improvements:

use requests.head to get the size before actually attempting to download it - this would be much faster for the case where it's already correct.
not sure what imagesize will do for MP4 files, I think it's always going to skip them. My two videos had the same byte size so it didn't hit that code path though.

Press Rouch · Answer 6 · Tue Nov 22 2022 02:57:21 GMT+0800 (China Standard Time)

D'oh, scratch the head bit - didn't notice it's using stream=True. I've found what happens for an MP4 - it does identify that it hasn't parsed it, but it'll download it regardless of whether the local version is bigger, which seems a bit odd.

Press Rouch · Answer 7 · Tue Nov 22 2022 03:31:13 GMT+0800 (China Standard Time)

It seems that content-length could be missing if it's using Transfer-Encoding:chunked (see this answer). That's deprecated in HTTP/2, but it looks like Python requests still uses HTTP/1.1. My completely unsubstantiated theory is that if a video hasn't been served by Twitter in a while, then it might serve it in chunks, but once it has a warm cache then it can serve the whole thing.

Tim Ruffing · Answer 8 · Thu Nov 24 2022 15:25:18 GMT+0800 (China Standard Time)

Interestingly, all my 4 mp4 failed with a different error message:

401/406 media/[...].mp4: FAIL. Media couldn't be retrieved from https://video.twimg.com/ext_tw_video/[...].mp4?tag=10 because of exception: HTTPSConnectionPool(host='video.twimg.com', port=443): Read timed out. (read timeout=2)

but then were successfully SKIPPED:

4/  4 media/[...].mp4: SKIPPED. Online version is same byte size, assuming same content. Not downloaded.u/vid/720x1280/[...].mp4?tag=10...