kahliloppenheimer / Web-page-classification

Classifies webpages into categories defined in DMOZ dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Topical Web-page classification of the DMOZ Dataset

This repository contains all scripts associated with my research on topical Web-page classification. You can read the full paper describing the task, experiments, and results here.

Abstract

Multi-class topical web-page classification is a difficult task with widespread application. Throughout this paper, I analyze the performance of well-studied techniques on two different representations of web-pages: hand-written meta-descriptions and on-page text content. I acquired all of the training labels and website descriptions from the DMOZ dataset and all of the on-page content from scraping the actual web-pages. I achieved 74.035% and 79.121% accuracy for on-page content and website descriptions respectively in a 16-way classification task with a 42.032% most frequently tagged baseline accuracy.

About

Classifies webpages into categories defined in DMOZ dataset

License:MIT License


Languages

Language:Shell 50.6%Language:Python 49.4%