Keson96 / ConEx

A combined framework for content extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ConEx

A combined framework for content extraction

This is a repo to archive the python code used in my thesis.

Most ipython notebook files are used for testing except "eval.ipynb", which is used to run the experiments in the thesis.

  • BTE.py (Body Text Extraction)

  • CCB.py (Content Code Blurring)

  • CETD.py (Content Extraction via Text Density)

  • CETR.py (Content Extraction via Text Ratio)

  • CTTD.py (Compound Text-Tag Difference)

  • ConEx_dom.py (Combine dom-based algorithms)

  • ConEx_line.py (Combine line-based algorithms)

  • ConEx_token.py (Combine token-based algorithms)

  • ConEx.py (Combined above three parts)

  • process.py (Code for preprocessing and evaluation)

  • convert_XXX.py (Process XXX dataset)

  • kmeans.py (Customed kmeans algorithm for CETR-2D)

About

A combined framework for content extraction


Languages

Language:Jupyter Notebook 51.8%Language:Python 48.2%