vifreefly / kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to set encoding?

dccmmtop opened this issue · comments

commented

When the website is encoding with GB2312, the content of the website can not be obtained normally.

I think it would be better to change

    def current_response(response_type = :html)
      case response_type
      when :html
        Nokogiri::HTML(body)
      when :json
        JSON.parse(body)
      end
    end

TO

    def current_response(response_type = :html)
      case response_type
      when :html
        Nokogiri::HTML(body,nil,@config[:encoding])
      when :json
        JSON.parse(body)
      end
    end

OR

    def current_response(response_type = :html)
      case response_type
      when :html
        Nokogiri::HTML(body.force_encoding("encoding"))
      when :json
        JSON.parse(body)
      end
    end

Hello @dccmmtop ! You are right, there should be added config option like @encodig.

Your examples are good, but setting with custom encoding should be optional, because in most cases pages parsed correctly, without need to provide encoding for it.

I would like to add "auto" mode as well, where Kimurai will try to automatically recognize the correct encoding. Encoding usually defined in meta tags like <meta http-equiv="Content-Type"> or <meta charset>. (https://www.w3schools.com/html/html_charset.asp).

I have working regex (from one of my resent projects) which correctly parse encoding from both cases above:

    resp_string = response.body
    charset = resp_string.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
    Nokogiri::HTML(resp_string, nil, charset)

So the method current_response can be modified to:

# Works with:

@config = {
  encoding: nil      # do not handle encoding at all (current behavior)
  encoding: :auto    # Try to handle encoding automatically
  encoding: "GB2312" # Set required encoding manually
}

###

def current_response(response_type = :html)
  case response_type
  when :html
    if encoding = @config[:encoding]
      if encoding == :auto
        charset = body.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
        Nokogiri::HTML(body, nil, charset)
      else
        Nokogiri::HTML(body, nil, encoding)
      end
    else
      Nokogiri::HTML(body)
    end
  when :json
    JSON.parse(body)
  end
end

I'll try to add this feature today and release a new version. Thanks!

@dccmmtop, I added config option encoding. It's in the master now: 96fe695 .

Can you please check both cases, :auto and custom encoding?

@config = {
  encoding: nil      # do not handle encoding at all (current behavior)
  encoding: :auto    # Try to handle encoding automatically
  encoding: "GB2312" # Set required encoding manually
}

To use Kimurai version from master, add it to Gemfile this way:

gem 'kimurai', git: 'https://github.com/vifreefly/kimuraframework'
commented

I've tested the :auto and custom encoding and found no errors.

:auto is a good method, and it works in most cases. But some pages are actually coded differently from the way they are declared in the head.

For example, the following situation, only in the way of GBK, can I get the right content, I think this is the fault of website developers.

image

@dccmmtop
Can you please clarify where is a problem with :auto method? Like I said, it can handle two cases, here is an example:

def fetch_encoding(html_doc_string)
  html_doc_string.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
end

###

example_1 = '
  </!DOCTYPE html>
  <html>
    <head>
      <title>Hello World!</title>
      <meta http-equiv="content-type" content="text/html; charset=GB2312">
    </head>
    <body>
      <h1>Hello World!</h1>
    </body>
  </html>
'

puts fetch_encoding(example_1)
# => GB2312

###

example_2 = '
  </!DOCTYPE html>
  <html>
    <head>
      <title>Hello World!</title>
      <meta charset="GB2312">
    </head>
    <body>
      <h1>Hello World!</h1>
    </body>
  </html>
'

puts fetch_encoding(example_2)
# => GB2312

Or do you mean something different?

commented

@vifreefly Sorry to have misunderstood you.

:auto has no errors and can work normally.
What I mean is that the actual encoding of a web page is different from what it declares in .

@dccmmtop
Thanks, now I see what you've meant :)

commented

The same website has different coding methods, but the @config is global。

Should you specify a separate encoding for a url?
example:

request_to(:parse_content, url: link, encoding: 'GBK')

Now that's how I solve it.

    @config = {
      before_request: { delay: 1..3 },
      encoding: 'utf-8'
    }

    def parse(response,url:,data:{})
      topics = JSON.parse(response.xpath("//p").text[/(\[.+\])/,1])
      topics.each do |topic|
        link = topic["url"].strip
        self.class.config[:encoding] = "GBK"
        request_to(:parse_content, url: link)
        self.class.config[:encoding] = "utf-8"
      end
    end

This method is not good.

Thanks, I'll add this feature to the ToDo list for a new version