Status?

Question

Status?

brandondrew opened this issue 8 years ago · comments

Has this project been abandoned?

It looks very promising, other than the (apparent) lack of progress recently.

Guillaume Plique · Answer 1 · Thu Jul 28 2016 20:31:56 GMT+0800 (China Standard Time)

Hello @brandondrew. The project is not abandoned. The thing is a lot of things changed since the time I started and I will soon reboot both artoo & sandcrawler but I won't do so before at least October and if my projects need some crawling/scraping. But I am confident they will.

Brandon Zylstra · Answer 2 · Fri Jul 29 2016 02:10:06 GMT+0800 (China Standard Time)

Very good news—thanks for the update!

Though I've only played around with artoo so far, the combination of artoo and sandcrawler appears to be the best option for crawling and scraping data off of the web. It's absolutely brilliant to have an in-browser option that complements a server-side option, sort of giving a REPL for scraping.

Do you expect the reboot to make significant changes to the API? (I'm toying around with an idea for a project that could rely heavily on sandcrawler.)

Guillaume Plique · Answer 3 · Fri Jul 29 2016 02:31:51 GMT+0800 (China Standard Time)

The reboot will probably make significant changes to the API indeed but should not steer to far from the existing concepts.

Guillaume Plique · Answer 4 · Fri Jul 29 2016 02:33:15 GMT+0800 (China Standard Time)

I also aim at trying different solutions than PhantomJS by clearly separating the engines of sandcrawler (static & phantom so far) to try and experiment with a headless electron and chromium because PhantomJS' quirks are really bugging me latetly (notably unavoidable crash & memory leaks).

Benjamin Ooghe-Tabanou · Answer 5 · Fri Jul 29 2016 18:00:44 GMT+0800 (China Standard Time)

@Yomguithereal If it can help, manet does a good job at proposing a service relying both on PhantomJS and SlimerJS https://github.com/vbauer/manet

Guillaume Plique · Answer 6 · Fri Jul 29 2016 19:47:39 GMT+0800 (China Standard Time)

This wont address memory leak problems however and a reboot system of both PhantomJS and SlimerJS is needed anyway to clean memory off :(

Benjamin Ooghe-Tabanou · Answer 7 · Fri Jul 29 2016 20:19:15 GMT+0800 (China Standard Time)

I wan only pointing it out as an example of abstractification over both engines :)

Mathias Schaemelhout · Answer 8 · Mon Oct 17 2016 04:28:28 GMT+0800 (China Standard Time)

Any update on this?

Guillaume Plique · Answer 9 · Tue Oct 18 2016 08:00:12 GMT+0800 (China Standard Time)

Hello @Schaemelhout. Things will probably evolve by the end of the year. I'm sorry but I cannot be more precise. If you need specific bug fixes however, I can probably work it out.

Mathias Schaemelhout · Answer 10 · Wed Oct 19 2016 03:56:52 GMT+0800 (China Standard Time)

Hi @Yomguithereal, I was just wondering how mature and solid this project is, I'm looking for a library to help me in my scraping-adventures.

The ones I have in mind are the following:

sandcrawler
node-simplecrawler
node-crawler
node-osmosis

The sandcrawler looks very promising, but it looks like a quite abandonned project, and I was wondering if it was worth the effort of using it right now if it's going to get a complete overhaul by the end of the year..?

Thanks anyway! The project looks great.

Guillaume Plique · Answer 11 · Wed Oct 19 2016 06:18:06 GMT+0800 (China Standard Time)

Of the list you present, sandcrawler is probably the best choice if you need to perform complex tasks and need to customize very precise things in order to achieve what you need. If what you need is fairly simple and you won't need to handle the dark insanities of the whole web, maybe this tool is a bit overkill.

Can you explain to me what you intend to do so I can help you better (if you can disclose it, of course)?

Mathias Schaemelhout · Answer 12 · Wed Oct 19 2016 15:49:40 GMT+0800 (China Standard Time)

I can't go into too much detail, but the main thing I need is just a decent queueing system and preferable a IP/proxy rotating system.
The content discovery and scraping I need to do is pretty straightforward.

abbasharoon · Answer 13 · Sat Dec 03 2016 22:38:36 GMT+0800 (China Standard Time)

Hi,

I just checked the github pages and the project is looking really promising. I waited for a month but I think you still don't have time. I gave it a try but was not able to get proper results. Can you kindly specify any expected date for the new version?

Thanks

Guillaume Plique · Answer 14 · Mon Dec 05 2016 18:40:55 GMT+0800 (China Standard Time)

I can't give you a date. But I can try to help you fix what fails for you.