ruippeixotog / scala-scraper

A Scala library for scraping content from HTML pages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why my ouput wrong encoding rendering

skanel opened this issue · comments

commented
import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._

object Scraper {
  val browser = JsoupBrowser()

  val doc = browser.get("http://camhr.com")

   def main(args: Array[String]): Unit = {
     // Extract the <span> elements inside #menu
     val items = doc >?> element("#footer")
    print(items)

   }

}

What I see in website is in English, but when I run this code I get in Chinese.

Hi @skanel, it seems that the site you mentioned sends the content in Chinese when the HTTP client doesn't specify an Accept-Language header (which most, if not all, browsers send automatically).

If you create your browser like this:

import org.jsoup.Connection

val browser = new JsoupBrowser() {
  override def requestSettings(conn: Connection) =
    conn.header("Accept-Language", "en-US,en;q=0.8,pt;q=0.6")
}

You should be able to get all visible parts of the page in English.