ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Home Page:https://archivebox.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: what's the current status of the REST API?

zblesk opened this issue · comments

I'd like to add new pages by sending an HTTP request to an endpoint. I saw it mentioned in issues such as #339 and items linked in that thread.

There seemed to be commit names that mentioned adding a REST API, but I haven't been able to find whether those are already implemented and released.

Are they? If so, how do I call them?

I've tried just capturing a request to the "Add" method when I click it in the browser, but it looks like there is some csrf protection, so I can't just copy-paste some bearer token and re-issue requests. I'm asking here before I spend time reverse-engineering something just because I missed an already existing API. :)

The current status of the API is "unstable" I'd say. Reverse engineering the UI is the way to go for now, but we have plans to stabilize it more in future versions and split out a proper API with django-rest-framework or something so that external tools don't have to shoehorn their needs into requests used by the UI.


✨ Edit as of v0.8.0 (2024-05): The new REST API is now available! ⬇️

@pirate I would be interested in working on this. I shot you an email a week or so ago cuz I think the underlying data model needs to be solidified and would love to help move this along. Let me know how I can help.

@mAAdhaTTah that is great! Currently, on master, we have the sqlite database working. We can now start working with django-rest-framework to enable a proper API (Like @pirate mentioned).
What are the issues that you are finding with the data model? Something that needs to be improved? We can start the discussion here, so we can all have the proper context, and find a way to get started soon.

@cdvv7788 Generally, I think the split/transformation between Link <-> Snapshot is a bit weird. Snapshot seems to be db-only (it's transformed into Link's as it's fetched out of the db for most of the operations I was looking at). I also think the double duty of timestamp being "the time it was bookmarked" as well as "the path in the archive" is a bit of an issue. From my email:

I believe you're currently looking to move from timestamp -> sha for the Snapshots and their relationship to the on-disk archive. If we want to eventually allow multiple snapshots per link (to avoid the hash hack), reifiying the Link model into the database and making the Snapshot a single download of a Link seems like a good way to do it. Part of the benefit for me for moving away from timestamps is I want to track when an article was read so I can group them by read day, and manipulating the timestamp for this seems a bit fragile if it can break the relationship to the archive. Having added, updated, etc. properties for that purpose seems a lot clearer.

(For context, I'd like to use ArchiveBox as a reading list, which I would then pull into my website, hence needing a REST API to pull that from. That's the reference to the "benefit for me" line.)

@mAAdhaTTah We have discussed those topics before. I think that @pirate has some progress on the timestamp issue, and it will be changed once we come up with a good solution.
The Link <-> Snapshot stuff is a leftover of the recent migration. In the latest release (v4.x), Link was generated from the index.json, and Snapshot was updated on a best effort basis. After the refactor, this has changed, and we definitely want to get rid of this relationship, leaving everything directly in Snapshot if possible. Supporting multiple snapshots for the same url is not supported at this moment, but after we remove the dependency on the Link schema, it should not be hard to add if we decide to go that way.
The main blocker at this moment is that Snapshot requires django, so it cannot be used on it's own. We need to find a way to circumvent that (@pirate do you know if this is possible?) or we need to get more creative initializing django. Some research on this specific topic would be of great help (this is something in our short term objectives).

Supporting multiple snapshots for the same url is not supported at this moment, but after we remove the dependency on the Link schema, it should not be hard to add if we decide to go that way.

So my thinking/proposal is to actually remove the Link schema, migrate what is currently considered a Snapshot to be a Link instead (mostly as a naming convention change), then add Snapshot that represents a single download of a website. Based on your explanation, I think we'd need to include a migration in v0.5 that migrates the index.json into the db, then once we're solely dependent on the db, performing the above migrations, splitting the existing Snapshot into 2 models: Snapshot & Link, with a one-to-many relationship (plus whatever UI updates are needed to account for this).

Does that make sense? Happy to elaborate and/or provide some code to explain.

The main blocker at this moment is that Snapshot requires django, so it cannot be used on it's own.

Not sure I understand this. Could you provide some background here?

So my thinking/proposal is to actually remove the Link schema, migrate what is currently considered a Snapshot to be a Link instead (mostly as a naming convention change), then add Snapshot that represents a single download of a website. Based on your explanation, I think we'd need to include a migration in v0.5 that migrates the index.json into the db, then once we're solely dependent on the db, performing the above migrations, splitting the existing Snapshot into 2 models: Snapshot & Link, with a one-to-many relationship (plus whatever UI updates are needed to account for this).

At this moment we only have the means to represent a single download per website. I understand what you propose, and that does make sense. At this point we already migrated the index.json into the sqlite database. In fact, if you check #502, we are already removing the automatic generation of those indexes completely. This, however, cannot be done without first solving the other issue, which takes me to:

The main blocker at this moment is that Snapshot requires django, so it cannot be used on it's own.

Snapshot is a django model. We cannot use that model in a place where django has not been initialized yet. If you try to do that, it will complain because the module will try to use some django internal stuff. This is the only reason we have not gotten rid of Link as we know it. I am going to spend some time figuring alternatives to make Snapshot usable in the whole application. You are welcome to help us pursue this. As I mentioned earlier, this is a blocker, and the other stuff cannot be worked until it is not resolved (The REST API could actually be implemented, but once we fix this, we would need to refactor it in a big way...I think it is better to solve this layer first).

We cannot use that model in a place where django has not been initialized yet.

All of this makes sense so far. I can do some investigating and see what I can come up with. Just to clarify, when you say "use that model", is that "interacting with it" or is importing it enough to make it fail?

Importing it is enough to make it fail. There is a method that you will find around named django_setup which initializes what is required.

I don't believe we need Link or Snapshot anywhere that Django is not initialized, so that is a non-issue. If you're worried about oneshot I have an idea to fix that (we can discuss more in Zulip).

@pirate Does that change if the idea is to turn Link & Snapshot into db models?

Hello!
I see there's been some progress here.
What's the current status? Is the api available yet?

One of the linked tasks seems to mention it's available in 'dev' - is that an available docker tag?
Is it safe to use?
To be more specific: I understand the API is still in alpha, and I can accept that. However, I don't understand what else can be unstable in dev - I don't want to risk my instance and my data.

Thank you!

I have not made any additional progress since opening my PR here: #529 I don't think we will be continuing down that path, as we were considering using Django Ninja instead of DRF as well. Eventually, I'd like to pick this back up again but haven't had the time.

Copying over my earlier message here from the API discussion related to the ArchiveBox browser extension #577:

I think a minimal API can be worked on before the Huey refactor, as the user-facing API is going to be relatively stable even with the change to the internals. These endpoints are already partially available through the Django Admin:

  • /add GET,POST (CSRF excempt, usable as an API from external origins and is used by the browser extension)
  • /api/core/snapshot/ GET, POST, PUT
  • /api/core/snapshot/<id> GET, PATCH, DELETE
  • /api/core/archiveresult/ GET, POST
  • /api/core/archiveresult/<id> GET, PATCH, DELETE
  • /api/core/tag/ GET, POST, PUT
  • /api/core/tag/<id> GET, PATCH, DELETE

and this bonus escape hatch endpoint is planned to be added to do everything else not possible with the above ^:

  • /api/cli/<command> POST (simulate running any archivebox CLI command with a given dict of args and kwargs to populate the CLI flags and args)
    e.g. /api/cli/add POST {urls: 'https://example.com', depth: 1, extractors: ['wget', 'media', 'screenshot'], ...}
    or /api/cli/schedule POST {urls: 'https://example.com', depth: 1, every: 'day', ...}

I'm leaning towards using FastAPI for the API instead of DRF. I like the pydantic type-based API definitions better than DRF's serializers but I could be convinced either way.

Thanks for the update. Looking forward to this.

Though I'm not sure I read those correctly. For instance, what is the difference between a GET and a POST to /add?
Will it support adding many links at once, as well?

And which endpoint should be used for 'return the archive URL for this input URL, if it exists'?

commented

@pirate hey there are you still working on this / need help? I'm thinking this is possibly something I could put together with FastAPI and the CLI hopefully next weekend. let me know! cheers

Copying over my earlier message here from the API discussion related to the ArchiveBox browser extension #577:

I think a minimal API can be worked on before the Huey refactor, as the user-facing API is going to be relatively stable even with the change to the internals. These endpoints are already partially available through the Django Admin:

  • /add GET,POST (CSRF excempt, usable as an API from external origins and is used by the browser extension)
  • /api/core/snapshot/ GET, POST, PUT
  • /api/core/snapshot/<id> GET, PATCH, DELETE
  • /api/core/archiveresult/ GET, POST
  • /api/core/archiveresult/<id> GET, PATCH, DELETE
  • /api/core/tag/ GET, POST, PUT
  • /api/core/tag/<id> GET, PATCH, DELETE

and this bonus escape hatch endpoint is planned to be added to do everything else not possible with the above ^:

  • /api/cli/<command> POST (simulate running any archivebox CLI command with a given dict of args and kwargs to populate the CLI flags and args)
    e.g. /api/cli/add POST {urls: 'https://example.com', depth: 1, extractors: ['wget', 'media', 'screenshot'], ...}
    or /api/cli/schedule POST {urls: 'https://example.com', depth: 1, every: 'day', ...}

I'm leaning towards using FastAPI for the API instead of DRF. I like the pydantic type-based API definitions better than DRF's serializers but I could be convinced either way.

Definitely open to contribution on the API front! I'm more focused on internals refactoring at the moment but as mentioned in that quoted comment I believe my changes can be kept insulated from anything external facing.

If you want to share gists or a fork with your work I can leave progress on your mock-up as you go to save time on PR review later.

commented

I would use an API like this.

commented

hi, if anyone is following this issue and could give me some guidance please see this issue: #1030

i think @zblesk brought up and important point. a route like /add/ feels like violating REST principles by implying an action. ideally, if the API should be REST, it the routes should be resources and the action is determined by the HTTP method (GET, PUT etc).
so i feel like it would make more sense to make a GET to /archive to get archived items and to make a POST to /archive to store a new link etc.

commented

Sure let's start with POST to /archive in addition to the current command line input method.

Lets keep the REST API URLs in line with the model names and use /api/snapshot GET/POST and /api/archiveresult GET/POST.

@pirate good point. my comment was less about the specific endpoint names and more about the REST conformity of using proper HTTP methods and resource endpoints. depending on the application design it might not make sense to map the models to endpoints 1-to-1 because some data is simply always a composition of different data models. i'm not familiar with the archivebox software project so i can't tell.

I think keeping endpoints the same as model names is better than the alternative because more layers of indirection/leaky abstraction make it harder to grep through the source code and understand.

Hi everyone, can I ask what the status is of the rest API, definitely +1 for fastapi instead of DRF.

Is this something you need help or is there a list of active tasks for the current implementation?

It's still on the list but slow going, I haven't had a lot of big blocks of coding time to work on ArchiveBox over the last year, so I've mostly been devoting my time to support and docs.

On the plus side I have interest from a big multinational org to use ArchiveBox, and maybe able to turn that into a consulting contract to fund some work towards the API. They are a slow-moving org so it may take 6~12 months, but it's exciting news nonetheless.

Hope this will be implemented. In my case, I want to scrap and store websites in my local network and then be able to process this with AI and then put it in my personal knowledge management system. AI and PKM staff is on my side, just need to have API 🙏

hello! what's the current state of this? It's kinda confusing since it says it's on Alpha but reading the comments I don't know if it's possible to use it on Docker. I'm interested on building an alternative front end for this application and the REST API would help me a lot

Alpha = There are a few POST/GET etc. endpoints exposed by the admin UI and the /add page that allow quick things can be hacked together, but it's not a proper REST API by any means. I'm working on a django-huey-monitor refactor to add and event driven queue system in the backend, and the new REST API I'm planning will insert messages into this queue to manage extractor jobs and snapshots.

Can I ask why you're going in the direction of an alternative frontend vs contributing changes to AB directly? I'd definitely be open to PRs improving our existing frontend!

See the discussion here too: #1126

Alpha = There are a few POST/GET etc. endpoints exposed by the admin UI and the /add page that allow quick things can be hacked together, but it's not a proper REST API by any means. I'm working on a django-huey-monitor refactor to add and event driven queue system in the backend, and the new REST API I'm planning will insert messages into this queue to manage extractor jobs and snapshots.

Can I ask why you're going in the direction of an alternative frontend vs contributing changes to AB directly? I'd definitely be open to PRs improving our existing frontend!

See the discussion here too: #1126

@pirate my main issue about contributing to the existing frontend is that the current version is far from what I think would be useful for me, so probably my changes might be too much disturbing to include them just with a PR without previous discussion. If you still think this project could benefit from a total rework on the FrontEnd (which I do) I can think about making some proposals and reach to an agreement

I'm down to add a new frontend to the existing app as long as we keep the Django admin one available as well in parallel. I was considering using htmx to do this myself (it plays well with Django templates) but haven't gotten around to it.

One of the core principles is that we should rely on JS as little as possible because I want ArchiveBox views to be extremely durable long term and viewable across many different types of devices.

I'm ok with some of the UI requiring JS but ideally the most critical parts should fall back to working with old school plain html.

If that design direction sounds compatible with your ideas then I'm down to work together to add your UI changes to AB directly, otherwise maybe an independent app/mod may be better.

@pirate sure, that sounds nice. I don't want to include a JavaScript framework neither. Regarding htmx, we can give it a try if we need it, I already did some works on a side project and it's great. About the CSS I saw the current implementation uses Bootstrap, I wonder if we can move to Tailwind, which I think fits better for an open source project these days, in that way we don't need to implement custom classes and it's easier for external contributions

Nice! I also prefer tailwind to bootstrap, happy to move to that.

If you want to open a new issue for your UI ideas as they come up I think we should move frontend discussion away from the REST API thread so we don't spam everyone.

If you do create a new thread for that, can you please @ me? Thanks.

Hey everyone, check out the new REST API on dev! Big thanks to @Brandl for the first PR that kickstarted it!

For users who want to try it out, get v0.8.0-rc (unstable) or later, start archivebox server, then visit http://127.0.0.1:8000/api and (/api/v1/docs) to get started with the interactive Swagger API docs/test page ➡️

image

It also supports sending webhooks to external servers whenever archiving events happen.

image

image

image

Currently can't make a backup of my archive, so I can't switch to dev; but I'm really looking forward to trying this. Thanks.

I can't wait for this to make it to stable.