Support scraping Twitter posts via the Wayback Machine

Question

Support scraping Twitter posts via the Wayback Machine

upintheairsheep opened this issue 10 months ago · comments

Describe the feature

Add a new scraper: twitter-user-archive or x-user-archive that scrapes tweets based on the Wayback Machine API like https://web.archive.org/web/*/https://twitter.com/ElonMusk/* and it attempts to collect all posts.

Would this fix a problem you're experiencing? If so, specify.

This would bypass the API autocracy that Elon has made, by using archived tweets, and also be a way to recover deleted tweets and banned accounts for further viewing.

God save Twitter for None else can.

Did you consider other alternatives?

No response

Additional context

No response

Demetris Paschalides · Answer 1 · Wed Aug 02 2023 22:26:22 GMT+0800 (China Standard Time)

The issues with this solution are:

It is going to be extremely slow;
A vast majority of Twitter profiles have never been snapshot by the WBM, hence, we will not be able to get any data.

However, it is a possible solution to the situation.

upintheairsheep · Answer 2 · Wed Aug 02 2023 23:15:56 GMT+0800 (China Standard Time)

The issues with this solution are:

It is going to be extremely slow;

A vast majority of Twitter profiles have never been snapshot by the WBM, hence, we will not be able to get any data.

However, it is a possible solution to the situation.

1: Better than nothing
2: Yes I know this is the case, however it can scrape whatever twitter profile is used. This should not be a replacement for the regular Twitter scraped, more of a way to get posts of deleted or banned accounts and deleted Tweets.

JustAnotherArchivist · Answer 3 · Thu Aug 03 2023 04:02:09 GMT+0800 (China Standard Time)

The additional complexity of supporting every past version of Twitter's web layout (rather than just the single current one) is not something I consider an adequate use of developer time, especially given the spotty coverage.

upintheairsheep · Answer 4 · Fri Aug 04 2023 10:56:30 GMT+0800 (China Standard Time)

The additional complexity of supporting every past version of Twitter's web layout (rather than just the single current one) is not something I consider an adequate use of developer time, especially given the spotty coverage.

I'd say to just support the first two or three most recent versions, as desire to archive Twitter only really gained motion since Elon took over, and luckily for us, Twitter's web layout has remained stagnant from about 2016 to 2022, and some captures shuffle the mobile layout which has not changed either. See http://web.archive.org/web/2/https://www.twitter.com/jack/status/20 as an example.

JustAnotherArchivist · Answer 5 · Mon Aug 07 2023 01:44:37 GMT+0800 (China Standard Time)

'The site looks the same' doesn't mean there were no changes relevant for a scraper's code. The WBM also contains snapshots using at least four completely different Twitter website designs in just the last few years (the old design, the old simple/mobile design, the current simple design, and the current usual site which generally doesn't work in the WBM).

And you misunderstood me: I don't think supporting even a single additional version is worth the effort. I certainly won't be doing it. I might consider a well-written PR. Otherwise, this should be done outside of snscrape.