na9da / haskell-jusText

Tool for removing boilerplate from HTML pages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

haskell-jusText

This is a haskell clone of the python jusText project. It is useful for removing boiler plate content from HTML pages leaving just the main content. jusText applies certain heuristics to identify the main content of the page. You can read more about it in the thesis work done by Jan PomikĀ“alek.

Building

  stack install
  haskell-jusText <htmlFile> <stopwordsFile>

Stopword files for different languages are available in the original repo.

About

Tool for removing boilerplate from HTML pages

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Haskell 98.6%Language:Nix 1.4%