ruippeixotog / scala-scraper

A Scala library for scraping content from HTML pages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

scala-scraper's implementation of HtmlUnit doesn't have .close()

piercelamb opened this issue · comments

I have an akka Actor that starts up every 3 hours to scrape some content using HtmlUnitBrowser (have to use this because of JS execution). Everything works fine except memory usage jumps every time it starts and stays constant at that new level. So eventually I run out of memory. I'm not 100% sure HtmlUnit is the issue but they do have a FAQ question about it specifically:

http://htmlunit.sourceforge.net/faq.html#MemoryLeak

As such, I'd like to test closing the browser after its used in the akka actor every time. However, I don't see a .close() method on HtmlUnitBrowser.

Please advise.

Hi @piercelamb! Sure, this can be easily done. However, calling something like close() on the browser will most probably close all pages created with that browser to that moment, not a specific one. Would that work for you?

A question about your problem, are you reusing the same HtmlUnitBrowser instance for the scheduled operation or are you creating a new HtmlUnitBrowser each time the job runs?

@ruippeixotog I'm actually not too sure. I'm using Play Framework 2.5.9 so everything is dependency injected. In order to inject HtmlUnitBrowser i had to make this:

class HtmlUnitBrowserFactory extends HtmlUnitBrowser { new HtmlUnitBrowser }

Because it doesnt have a parameterless constructor. That then gets injected like this:

@Singleton
class YelpService @Inject() (browser: HtmlUnitBrowserFactory) ....

And passed to an Actor like this:

actorSystem.scheduler.schedule(0.seconds, 3.hour, yelpActor, Start(browser, mailActor, mailer, reviewsTable))

YelpService gets injected into one of my main controllers so it fires on startup. That is where I believe HtmlUnitBrowser would be created and I assume only once.

How would I access that .close() method?

Hmm, I'm not sure I understand one thing in your code: why do you add { new HtmlUnitBrowser } to the front of your class? You seem to be creating an extra instance inside the constructor of HtmlUnitBrowserFactory that is immediately discarded. Wouldn't class HtmlUnitBrowserFactory extends HtmlUnitBrowser work for your purpose of having a no-arg constructor?

Either way, if you are using browser: HtmlUnitBrowserFactory as your browser, you would be able to call browser.close() directly inside your task after you finish scraping the page - closing all windows opened at the moment. In your case it doesn't seem to be a problem, as you will surely be able to do everything you want with the scraped page before the next job runs (after 3 hours). If you really don't want to risk that, you can always change your factory to be:

class HtmlUnitBrowserFactory {
  def newBrowser() = new HtmlUnitBrowser
}

And create a new browser instance each time you run a job.

@ruippeixotog Great point on the constructor. Major oversight.

My issue is that browser.close() does not compile, e.g.

value close is not a member of net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser
[error]               browser.close()
[error]                       ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed

Oh, I guess I wasn't clear. The method still doesn't exist, I'm suggesting implementing it as a possible solution to your use case :) I'll get to it this weekend.

@ruippeixotog awesome! I'll test it out as soon as you have it ready.

Thank you

@piercelamb Not really on the topic, but I am curious to know why/how you are using akka actors for screen scraping? Can you mention a bit what is the use case for akka actors? Thanks.