ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Home Page:https://archivebox.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: Twitter Thread Archiver

shimizurei opened this issue · comments

Can something like the Thread Reader App be incorporated into ArchiveBox?

Type

  • Propose a brand new feature

What is the problem that your feature request solves

We can save Twitter threads (NOT individual Twitter posts) as functionally complete articles.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

A nice article pdf like the Thread Reader app.

What hacks or alternative solutions have you tried to solve the problem?

ThreadReader App

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up

Yeah I've wanted this for a long time too. The way it's been implemented on other projects is as a content script that unrolls threads before snapshotting inside of chrome headless.

Would it be possible for the archiver to trigger the ThreadReader app to unroll it then archive the ThreadReader result?

Then it ends up depending on ThreadReader. What if ThreadReader becomes defunct tomorrow?

@shimizurei You'd have an archive of the ThreadReader page in your ArchiveBox.

If it's part of ArchiveBox's code, then it's life depends on the maintainers of ArchiveBox. ThreadReader isn't open source, so if it goes down tomorrow, that's it. Everyone will be scrambling to find a replacement because the code is not easily available. Yes, you'll have your already created archives, but you wouldn't be able to create anymore.

I'd rather do this via a python library, CLI tool, or puppeteer scripts (once our async playwright worker system is out).

Follow here for updates on puppeteer script support progress: #51

I would really like this feature, and I'm willing to contribute code to make it happen, if that's welcome.

There are still a lot of structural blockers in Archivebox's design to running content scripts directly during archiving.

The most helpful approach might be to write a dedicated extractor in Python that dumps the unrolled thread to a nicer HTML file? Look for existing tools structured like YouTube-dl but for Reddit and Twitter (does a thread-dl exist?), and then clone the YOUTUBEDL extractor code to get started.

I've been looking for a box with this functionality for a long while now, with no luck. The closest thing to what I imagine and that I found is https://github.com/weskerfoot/TweetLog – however that does require access to developer API which I don't have.

Regular thread – sequence of tweets making a mini article (my god, what happened to good ol' blogs?) – can be otherwise quite easily archived with Thread Reader App (by calling https://threadreaderapp.com/thread/$TWIDENT.html where $TWIDENT is ID of any of the tweet thats part of the thread; and then downloading it a few minutes later. Although I am looking for something that would be able to archive a tweet OR a thread, including all of the replies to one or more of the tweets included in said thread.

ThreadReaderApp has been acquired by twitter and shut down. I think a feasible approach would be to make a config option where a twitter developer token can be entered and then just download the thread and put it into a simple html file with one ˋ<p>ˋaragraph tag per tweet, maybe ˋ<br>ˋ for newlines.

I myself would do it quick and dirty and just pretend the html was made by readability but I can understand if that’s too much of a hack to you 😃

I also think that this feature is now of a higher importance than before because of the acquisition. I just archived ThreadReaderApps links before.

How about Nitter?

https://twitter.com/ArchiveBoxApp -> https://nitter.net/ArchiveBoxApp
https://twitter.com/mitchellh/status/1615797167607939072 -> https://nitter.net/mitchellh/status/1615797167607939072
... etc

FYI we use Mercury (recently renamed postlight) as an extractor already, and they're rapidly adding extractors on their side for many different kinds of sites, so we should get these improvements with no effort required on the archivebox side: