timhutton / twitter-archive-parser

Python code to parse a Twitter archive and output in various ways

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FAIL. Media couldn't be retrieved] with some mp4 files

tobozo opened this issue Β· comments

hey thanks for this great script! πŸ‘

could be a false negative but I'm getting FAIL error messages on some mp4 files:

 47/2991 media/1336316106835779593-3-ICmLbbI3-lB9nw.mp4: FAIL. Media couldn't be retrieved from 
https://video.twimg.com/ext_tw_video/1336303046565892096/pu/vid/896x720/3-ICmLbbI3-lB9nw.mp4?tag=10 
because of exception: 'content-length'

exception thrown at this line:

byte_size_after = int(res.headers['content-length'])

the content-length header appears to have a valid value though (screenshot from Firefox):

image

be well and thanks for the awesomeness !

commented

Did you run the script completely? On the first run of the downloading part it queues up failed downloads to retry them with longer delay, and in my experience all the "content-length" errored videos do get downloaded on the second run.

Script ran completely, the fails were evenly spread across the logs, and all initially failed mp4 ended up with the SKIPPED status on the second pass. I'm not sure what caused this though, maybe twitter doing some throttling or a glitchy load balancing unit?

Retrying the ones that failed, with a longer sleep. 4 tries remaining.

(...)

103 of 103 tested media files are known to be the best-quality available.

Total downloaded: 206.7MB = 0.20GB
Time taken: 3182s

Closing this as it's more feedback than an issue.

It's certainly odd that we get an exception on that line. Worth trying to understand. @press-rouch any ideas?

twitter cache servers appear to send SPDY header to the browser, could that explain the behavioral difference with the python script?

Huh, weird. I replicated the bug locally, but only the first time I ran it (so it did the retry, matched the content size, and didn't do the download). On subsequent runs it successfully got content-length in the first pass.

I guess we could change that line to a try and then print out the whole header on failure. I hadn't looked at this chunk of code before, I think we could make some improvements:

  • use requests.head to get the size before actually attempting to download it - this would be much faster for the case where it's already correct.
  • not sure what imagesize will do for MP4 files, I think it's always going to skip them. My two videos had the same byte size so it didn't hit that code path though.

D'oh, scratch the head bit - didn't notice it's using stream=True. I've found what happens for an MP4 - it does identify that it hasn't parsed it, but it'll download it regardless of whether the local version is bigger, which seems a bit odd.

It seems that content-length could be missing if it's using Transfer-Encoding:chunked (see this answer). That's deprecated in HTTP/2, but it looks like Python requests still uses HTTP/1.1. My completely unsubstantiated theory is that if a video hasn't been served by Twitter in a while, then it might serve it in chunks, but once it has a warm cache then it can serve the whole thing.

Interestingly, all my 4 mp4 failed with a different error message:

401/406 media/[...].mp4: FAIL. Media couldn't be retrieved from https://video.twimg.com/ext_tw_video/[...].mp4?tag=10 because of exception: HTTPSConnectionPool(host='video.twimg.com', port=443): Read timed out. (read timeout=2)

but then were successfully SKIPPED:

4/  4 media/[...].mp4: SKIPPED. Online version is same byte size, assuming same content. Not downloaded.u/vid/720x1280/[...].mp4?tag=10...