thorkill / dbce

Diff Based Content Extraction is a part of my Bachelor Thesis: Joint Approach to Boilerplate Detection in Web Archives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Diff Based Content Extraction

It is a python framework I have developed for my bachelor thesis. The main purpose was to research ways for content extraction from large collections of HTML documents stored in Web Archives.

Copyright notice

This repository contains content that has been crawled for research purposes.

About

Diff Based Content Extraction is a part of my Bachelor Thesis: Joint Approach to Boilerplate Detection in Web Archives


Languages

Language:HTML 57.1%Language:Python 41.8%Language:JavaScript 0.6%Language:Shell 0.4%