JustAnotherArchivist / snscrape

A social networking service scraper in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support scraping Twitter posts via the Wayback Machine

upintheairsheep opened this issue · comments

Describe the feature

Add a new scraper: twitter-user-archive or x-user-archive that scrapes tweets based on the Wayback Machine API like https://web.archive.org/web/*/https://twitter.com/ElonMusk/* and it attempts to collect all posts.

Would this fix a problem you're experiencing? If so, specify.

This would bypass the API autocracy that Elon has made, by using archived tweets, and also be a way to recover deleted tweets and banned accounts for further viewing.

God save Twitter for None else can.

Did you consider other alternatives?

No response

Additional context

No response

The issues with this solution are:

  1. It is going to be extremely slow;
  2. A vast majority of Twitter profiles have never been snapshot by the WBM, hence, we will not be able to get any data.

However, it is a possible solution to the situation.

The issues with this solution are:

  1. It is going to be extremely slow;
  2. A vast majority of Twitter profiles have never been snapshot by the WBM, hence, we will not be able to get any data.

However, it is a possible solution to the situation.

1: Better than nothing
2: Yes I know this is the case, however it can scrape whatever twitter profile is used. This should not be a replacement for the regular Twitter scraped, more of a way to get posts of deleted or banned accounts and deleted Tweets.

The additional complexity of supporting every past version of Twitter's web layout (rather than just the single current one) is not something I consider an adequate use of developer time, especially given the spotty coverage.

The additional complexity of supporting every past version of Twitter's web layout (rather than just the single current one) is not something I consider an adequate use of developer time, especially given the spotty coverage.

I'd say to just support the first two or three most recent versions, as desire to archive Twitter only really gained motion since Elon took over, and luckily for us, Twitter's web layout has remained stagnant from about 2016 to 2022, and some captures shuffle the mobile layout which has not changed either. See http://web.archive.org/web/2/https://www.twitter.com/jack/status/20 as an example.

'The site looks the same' doesn't mean there were no changes relevant for a scraper's code. The WBM also contains snapshots using at least four completely different Twitter website designs in just the last few years (the old design, the old simple/mobile design, the current simple design, and the current usual site which generally doesn't work in the WBM).

And you misunderstood me: I don't think supporting even a single additional version is worth the effort. I certainly won't be doing it. I might consider a well-written PR. Otherwise, this should be done outside of snscrape.