Possible canditates removals from the index

Question

Possible canditates removals from the index

fgiasson opened this issue a year ago · comments

Hi,

I spent some time to create a small web dataset command line tool to start analyzing the index and help others creating a dataset from it.

The first task was to analyze the feeds to check if they were all English sites as per the guidelines. After an initial processing, it determined that some feeds are not English feeds.

To identify the language of a feed, it proceeds that way:

use langdetect on the title and description of the feed
use langdetect on the title and content of the articles of each feed
it reassign a different language for the feed if the articles it contains are in majority of another language

The result of this processing is attached to this issue. The ones that are tagged with a specific language should most likely be of that language, and could be candidate for removal. For the ones with empty languages, this happens when there is not enough data to detect the language (concat(title + description) < 64 characters) for most of its articles, so the current heuristic can't determine the appropriate language for the feed.

Attached: small_wed_non_en_feeds.csv

Vladimir Prelovac · Answer 1 · Wed Sep 20 2023 04:18:39 GMT+0800 (China Standard Time)

Thanks this is excellent work and seemingly very high accuracy (i just randomly checked a few).

Can you submit a PR to remove these?

Joel Dueck · Answer 2 · Wed Sep 20 2023 04:25:43 GMT+0800 (China Standard Time)

I found my own website’s feed on this list (https://joeldueck.com/feed.atom), and was surprised to see it had no language listed in your CSV file.

My feed is in Atom format, and pursuant to the Atom 1.0 spec it includes the optional xml:lang attribute with the value "en", the ISO 639-1 code for English.

Cadence Ember · Answer 3 · Wed Sep 20 2023 06:08:43 GMT+0800 (China Standard Time)

Some other entries in the CSV with no language indicator are http://michaelhoney.com/writing (English), https://elly.town/d/blog/ (English), and https://isohedral.ca/ (English). You might want double-check the code behind your processor.

Vladimir Prelovac · Answer 4 · Wed Sep 20 2023 06:10:10 GMT+0800 (China Standard Time)

The idea is to indicate onlly what non-english sites are (which csv does) so that we can remove them from the index.

Frederick Giasson · Answer 5 · Wed Sep 20 2023 06:23:48 GMT+0800 (China Standard Time)

For the feeds with the empty lang field, here is the relevant information from the issue's description:

For the ones with empty languages, this happens when there is not enough data to detect the language (concat(title + description) < 64 characters) for most of its articles, so the current heuristic can't determine the appropriate language for the feed.

So, this is expected at this stage. Further work will be required down the line to improve the accuracy/coverage of those. It shouldn't be a big deal to get this number down by a lot, I am just missing time at the moment.

@vprelovac yes I will do a PR for the ones tagged tomorrow morning

Frederick Giasson · Answer 6 · Wed Sep 20 2023 21:21:50 GMT+0800 (China Standard Time)

see linked PR @vprelovac