How to set encoding?
dccmmtop opened this issue · comments
When the website is encoding with GB2312, the content of the website can not be obtained normally.
I think it would be better to change
def current_response(response_type = :html)
case response_type
when :html
Nokogiri::HTML(body)
when :json
JSON.parse(body)
end
end
TO
def current_response(response_type = :html)
case response_type
when :html
Nokogiri::HTML(body,nil,@config[:encoding])
when :json
JSON.parse(body)
end
end
OR
def current_response(response_type = :html)
case response_type
when :html
Nokogiri::HTML(body.force_encoding("encoding"))
when :json
JSON.parse(body)
end
end
Hello @dccmmtop ! You are right, there should be added config option like @encodig
.
Your examples are good, but setting with custom encoding should be optional, because in most cases pages parsed correctly, without need to provide encoding for it.
I would like to add "auto" mode as well, where Kimurai will try to automatically recognize the correct encoding. Encoding usually defined in meta tags like <meta http-equiv="Content-Type">
or <meta charset>
. (https://www.w3schools.com/html/html_charset.asp).
I have working regex (from one of my resent projects) which correctly parse encoding from both cases above:
resp_string = response.body
charset = resp_string.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
Nokogiri::HTML(resp_string, nil, charset)
So the method current_response
can be modified to:
# Works with:
@config = {
encoding: nil # do not handle encoding at all (current behavior)
encoding: :auto # Try to handle encoding automatically
encoding: "GB2312" # Set required encoding manually
}
###
def current_response(response_type = :html)
case response_type
when :html
if encoding = @config[:encoding]
if encoding == :auto
charset = body.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
Nokogiri::HTML(body, nil, charset)
else
Nokogiri::HTML(body, nil, encoding)
end
else
Nokogiri::HTML(body)
end
when :json
JSON.parse(body)
end
end
I'll try to add this feature today and release a new version. Thanks!
@dccmmtop, I added config option encoding
. It's in the master now: 96fe695 .
Can you please check both cases, :auto
and custom encoding?
@config = {
encoding: nil # do not handle encoding at all (current behavior)
encoding: :auto # Try to handle encoding automatically
encoding: "GB2312" # Set required encoding manually
}
To use Kimurai version from master, add it to Gemfile this way:
gem 'kimurai', git: 'https://github.com/vifreefly/kimuraframework'
I've tested the :auto
and custom encoding and found no errors.
:auto
is a good method, and it works in most cases. But some pages are actually coded differently from the way they are declared in the head
.
For example, the following situation, only in the way of GBK, can I get the right content, I think this is the fault of website developers.
@dccmmtop
Can you please clarify where is a problem with :auto
method? Like I said, it can handle two cases, here is an example:
def fetch_encoding(html_doc_string)
html_doc_string.force_encoding("ISO-8859-1").encode("UTF-8")[/<meta.*?charset=["]?([\w+\d+\-]*)/i, 1]
end
###
example_1 = '
</!DOCTYPE html>
<html>
<head>
<title>Hello World!</title>
<meta http-equiv="content-type" content="text/html; charset=GB2312">
</head>
<body>
<h1>Hello World!</h1>
</body>
</html>
'
puts fetch_encoding(example_1)
# => GB2312
###
example_2 = '
</!DOCTYPE html>
<html>
<head>
<title>Hello World!</title>
<meta charset="GB2312">
</head>
<body>
<h1>Hello World!</h1>
</body>
</html>
'
puts fetch_encoding(example_2)
# => GB2312
Or do you mean something different?
@vifreefly Sorry to have misunderstood you.
:auto
has no errors and can work normally.
What I mean is that the actual encoding of a web page is different from what it declares in .
@dccmmtop
Thanks, now I see what you've meant :)
The same website has different coding methods, but the @config
is global。
Should you specify a separate encoding for a url?
example:
request_to(:parse_content, url: link, encoding: 'GBK')
Now that's how I solve it.
@config = {
before_request: { delay: 1..3 },
encoding: 'utf-8'
}
def parse(response,url:,data:{})
topics = JSON.parse(response.xpath("//p").text[/(\[.+\])/,1])
topics.each do |topic|
link = topic["url"].strip
self.class.config[:encoding] = "GBK"
request_to(:parse_content, url: link)
self.class.config[:encoding] = "utf-8"
end
end
This method is not good.
Thanks, I'll add this feature to the ToDo list for a new version