tangdai0228 / dmoz-urlclassifier

Preparing DMOZ dataset for my n-Gram LM-based URL classification research

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DMOZ URL Classifier

DMOZ is the largest, most comprehensive human-edited directory of the Web. It was historically known as the Open Directory Project (ODP). It contains a categorized list of Web URLs. Their listings are updated on a monthly bases and published in RDF files.

In my research project, I work on classifying web-pages based on their URLs only, hence DMOZ dataset is one of the datasets I use in my research.

If you are going to download their RDF files, you can find to scripts here that are useful to you.

  • dmoz2csv.py: This scripts converts their RDF data into a CSV file. Each line of CSV file contains a uniqie ID, URL and the category of that URL as seen in DMOZ.

  • csv2traintest.py: Then this script can take the resulting CSV from above and convert it into training and test datasets as explained by Bykan et al.

Feeding "csv2traintest.py" on "dmoz0409.csv" will result in producing 15 training and test file pairs.

Contacts

About

Preparing DMOZ dataset for my n-Gram LM-based URL classification research


Languages

Language:Python 100.0%