medialab / sandcrawler

sandcrawler.js - the server-side scraping companion.

Home Page:http://medialab.github.io/sandcrawler/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Status?

brandondrew opened this issue · comments

Has this project been abandoned?

It looks very promising, other than the (apparent) lack of progress recently.

Hello @brandondrew. The project is not abandoned. The thing is a lot of things changed since the time I started and I will soon reboot both artoo & sandcrawler but I won't do so before at least October and if my projects need some crawling/scraping. But I am confident they will.

Very good news—thanks for the update!

Though I've only played around with artoo so far, the combination of artoo and sandcrawler appears to be the best option for crawling and scraping data off of the web. It's absolutely brilliant to have an in-browser option that complements a server-side option, sort of giving a REPL for scraping.

Do you expect the reboot to make significant changes to the API? (I'm toying around with an idea for a project that could rely heavily on sandcrawler.)

The reboot will probably make significant changes to the API indeed but should not steer to far from the existing concepts.

I also aim at trying different solutions than PhantomJS by clearly separating the engines of sandcrawler (static & phantom so far) to try and experiment with a headless electron and chromium because PhantomJS' quirks are really bugging me latetly (notably unavoidable crash & memory leaks).

@Yomguithereal If it can help, manet does a good job at proposing a service relying both on PhantomJS and SlimerJS https://github.com/vbauer/manet

This wont address memory leak problems however and a reboot system of both PhantomJS and SlimerJS is needed anyway to clean memory off :(

I wan only pointing it out as an example of abstractification over both engines :)

Any update on this?

Hello @Schaemelhout. Things will probably evolve by the end of the year. I'm sorry but I cannot be more precise. If you need specific bug fixes however, I can probably work it out.

Hi @Yomguithereal, I was just wondering how mature and solid this project is, I'm looking for a library to help me in my scraping-adventures.

The ones I have in mind are the following:

  • sandcrawler
  • node-simplecrawler
  • node-crawler
  • node-osmosis

The sandcrawler looks very promising, but it looks like a quite abandonned project, and I was wondering if it was worth the effort of using it right now if it's going to get a complete overhaul by the end of the year..?

Thanks anyway! The project looks great.

Of the list you present, sandcrawler is probably the best choice if you need to perform complex tasks and need to customize very precise things in order to achieve what you need. If what you need is fairly simple and you won't need to handle the dark insanities of the whole web, maybe this tool is a bit overkill.

Can you explain to me what you intend to do so I can help you better (if you can disclose it, of course)?

I can't go into too much detail, but the main thing I need is just a decent queueing system and preferable a IP/proxy rotating system.
The content discovery and scraping I need to do is pretty straightforward.

Hi,

I just checked the github pages and the project is looking really promising. I waited for a month but I think you still don't have time. I gave it a try but was not able to get proper results. Can you kindly specify any expected date for the new version?

Thanks

I can't give you a date. But I can try to help you fix what fails for you.