extractus / article-extractor

To extract main article from given URL with Node.js

Home Page:https://extractor-demos.pages.dev/article-extractor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[bug] - Garbled when the some link is in Chinese

zxhycxq opened this issue · comments

when the url is like http://media.people.com.cn/n1/2020/0617/c40606-31749210.html,then the result was:

image
image

however, this link is in chinese too, it was right.

https://juejin.cn/post/6931597891182002183

image

how to solve this problem?change the UNICODE? thanks you!

人民网使用的是 GBK,本库需要读取编码

The people.com use the gbk encoding <meta http-equiv="content-type" content="text/html;charset=GB2312">. We need read the content type of document @ndaidong

@zxhycxq yes it relates to the charset of that websit as @SettingDust pointed out.

@SettingDust then what we should do next? Does it require to convert the whole content to UTF8?

@ndaidong Yup, need lib like iconv-lite to convert the encoding.
But we can't know if the html string is as same as the meta(input string maybe utf-8 but with other encoding in meta).
So I think the encoding should specific from options or users should convert the input by themselves.
Or we have to add encoding to rules. It's too complex

We can provide an option for users to choose,if possible.

We can provide an option for users to choose,if possible.

I prefer users convert the input by themselves

yes, as far as I can see this is just an exception. User should find a way to handle it by himself.
In this case, the process could be: load HTML --> convert to UTF-8 --> pass converted UTF-8 string to article-parser

I still encounter this problem after iconv to UTF-8. The extractor seems to fail silently with null output.

You can try reproducing this by dumping the HTML source from the browser and read it in nodejs.

I still encounter this problem after iconv to UTF-8. The extractor seems to fail silently with null output.

You can try reproducing this by dumping the HTML source from the browser and read it in nodejs.

You have to confirm the content you input is gbk.

Edit: article-parser can't extract available links from this page. We need provide method that accept a url. But this issue is discussed.
Can be extracted with await parseFromHtml(html, 'http://media.people.com.cn/n1/2020/0617/c40606-31749210.html')

Resolved with v8.0.8