[bug] - Garbled when the some link is in Chinese

Question

[bug] - Garbled when the some link is in Chinese

zxhycxq opened this issue 2 years ago · comments

你好世界 commented 2 years ago

when the url is like http://media.people.com.cn/n1/2020/0617/c40606-31749210.html,then the result was:

however, this link is in chinese too, it was right.

https://juejin.cn/post/6931597891182002183

how to solve this problem？change the UNICODE? thanks you!

SettingDust · Answer 1 · Sun Nov 27 2022 21:56:02 GMT+0800 (China Standard Time)

人民网使用的是 GBK，本库需要读取编码

The people.com use the gbk encoding <meta http-equiv="content-type" content="text/html;charset=GB2312">. We need read the content type of document @ndaidong

Dong Nguyen · Answer 2 · Mon Nov 28 2022 15:36:05 GMT+0800 (China Standard Time)

@zxhycxq yes it relates to the charset of that websit as @SettingDust pointed out.

@SettingDust then what we should do next? Does it require to convert the whole content to UTF8?

SettingDust · Answer 3 · Mon Nov 28 2022 16:10:23 GMT+0800 (China Standard Time)

@ndaidong Yup, need lib like iconv-lite to convert the encoding.
But we can't know if the html string is as same as the meta(input string maybe utf-8 but with other encoding in meta).
So I think the encoding should specific from options or users should convert the input by themselves.
Or we have to add encoding to rules. It's too complex

你好世界 · Answer 4 · Mon Nov 28 2022 16:20:09 GMT+0800 (China Standard Time)

We can provide an option for users to choose,if possible.

SettingDust · Answer 5 · Mon Nov 28 2022 16:21:52 GMT+0800 (China Standard Time)

We can provide an option for users to choose,if possible.

I prefer users convert the input by themselves

Dong Nguyen · Answer 6 · Mon Nov 28 2022 23:27:34 GMT+0800 (China Standard Time)

yes, as far as I can see this is just an exception. User should find a way to handle it by himself.
In this case, the process could be: load HTML --> convert to UTF-8 --> pass converted UTF-8 string to article-parser

nick008a · Answer 7 · Tue Nov 29 2022 17:25:05 GMT+0800 (China Standard Time)

I still encounter this problem after iconv to UTF-8. The extractor seems to fail silently with null output.

You can try reproducing this by dumping the HTML source from the browser and read it in nodejs.

SettingDust · Answer 8 · Tue Nov 29 2022 18:19:12 GMT+0800 (China Standard Time)

I still encounter this problem after iconv to UTF-8. The extractor seems to fail silently with null output.

You can try reproducing this by dumping the HTML source from the browser and read it in nodejs.

You have to confirm the content you input is gbk.

Edit: article-parser can't extract available links from this page. We need provide method that accept a url. But this issue is discussed.
Can be extracted with await parseFromHtml(html, 'http://media.people.com.cn/n1/2020/0617/c40606-31749210.html')

Dong Nguyen · Answer 9 · Fri Apr 26 2024 16:53:35 GMT+0800 (China Standard Time)

Resolved with v8.0.8