ppke-nlpg / CleanPortalEval

boilerplate removal test set for portals (more sites from the same domain)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CleanPortalEval
It is a boilerplate removal test set for portals.

It is similar to CleanEval test set, but it contains more pages from the same domain. Motivation of the dataset: some boilerplate removal algorithms need more sample from a domain. (e.g. GoldMiner)
Its input and its gold standard has the same format as CleanEval has. So the evaluation script can be used on these, as well.

It contains 70 pages from 4 domains.

Reference
If you use the tool, please cite the following paper:

@article{endredy_more_2013,
        title = {More {Effective} {Boilerplate} {Removal} - the {GoldMiner} {Algorithm}},
        issn = {1870-9044},
        url = {http://polibits.gelbukh.com/2013_48},
        language = {eng},
        number = {48},
        journal = {Polibits - Research journal on Computer science and computer engineering
        author = {Endr{\'e}dy, Istv{\'a}n and Nov{\'a}k, Attila},
        year = {2013},
        keywords = {boilerplate removal, Corpus building, the web as corpus},
        pages = {79--83}
}

paper:
http://www.gelbukh.com/polibits/2013_48/More%20Effective%20Boilerplate%20Removal%20-%20the%20GoldMiner%20Algorithm.pdf

About

boilerplate removal test set for portals (more sites from the same domain)


Languages

Language:HTML 99.8%Language:Python 0.2%