kagisearch / smallweb

Kagi Small Web

Home Page:https://kagi.com/smallweb

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible canditates removals from the index

fgiasson opened this issue · comments

Hi,

I spent some time to create a small web dataset command line tool to start analyzing the index and help others creating a dataset from it.

The first task was to analyze the feeds to check if they were all English sites as per the guidelines. After an initial processing, it determined that some feeds are not English feeds.

To identify the language of a feed, it proceeds that way:

  1. use langdetect on the title and description of the feed
  2. use langdetect on the title and content of the articles of each feed
  3. it reassign a different language for the feed if the articles it contains are in majority of another language

The result of this processing is attached to this issue. The ones that are tagged with a specific language should most likely be of that language, and could be candidate for removal. For the ones with empty languages, this happens when there is not enough data to detect the language (concat(title + description) < 64 characters) for most of its articles, so the current heuristic can't determine the appropriate language for the feed.

Attached: small_wed_non_en_feeds.csv

Thanks this is excellent work and seemingly very high accuracy (i just randomly checked a few).

Can you submit a PR to remove these?

I found my own website’s feed on this list (https://joeldueck.com/feed.atom), and was surprised to see it had no language listed in your CSV file.

My feed is in Atom format, and pursuant to the Atom 1.0 spec it includes the optional xml:lang attribute with the value "en", the ISO 639-1 code for English.

Some other entries in the CSV with no language indicator are http://michaelhoney.com/writing (English), https://elly.town/d/blog/ (English), and https://isohedral.ca/ (English). You might want double-check the code behind your processor.

The idea is to indicate onlly what non-english sites are (which csv does) so that we can remove them from the index.

For the feeds with the empty lang field, here is the relevant information from the issue's description:

For the ones with empty languages, this happens when there is not enough data to detect the language (concat(title + description) < 64 characters) for most of its articles, so the current heuristic can't determine the appropriate language for the feed.

So, this is expected at this stage. Further work will be required down the line to improve the accuracy/coverage of those. It shouldn't be a big deal to get this number down by a lot, I am just missing time at the moment.

@vprelovac yes I will do a PR for the ones tagged tomorrow morning

see linked PR @vprelovac