Classify web pages given the HTML text.
When we encounter a new website we need to know which category it belongs to, this would be helpful in webcrawling scnearios. We can use two type of modeling for this,
- Text models: For text models we need to extarct the raw HTML and clean it and extract keywords and word frequencies and then use it for modeling. Here the issue would be we could have very big webpages and most of the data could be about website metadata and not what is rendered in the browser and we could misclassify a lot. Also the size of the text tokens amkes the use of transformers models difficult as most of the LLMs have a limit in input token size at 512.
- Vision models: To use vision models the websites that are rendered in the browser can be taken as screenshot and cropped to a predefined size and that can be used given to a classifer model or zero shot classifictaion based on autoencoders
Dataset: Structured Web Data Extraction Dataset (SWDE)
Data split: GroupShuffleSplit
Features: TfidfVectorizer
Model: MultinomialNB
Download the zip file and put it in the data directory Run the data_process_and_train.ipynb notebook to extarct the raw html and clean the text and prepare it for training
- https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
- https://towardsdatascience.com/creating-benchmark-models-the-scikit-learn-way-af227f6ea977
- https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle/notebook
- https://www.kaggle.com/code/vonneumann/benchmarking-sklearn-classifiers/notebook
- https://www.kdnuggets.com/2018/07/overview-benchmark-deep-learning-models-text-classification.html
- https://github.com/nlptown/nlp-notebooks/blob/master/Traditional%20text%20classification%20with%20Scikit-learn.ipynb
- https://www.oreilly.com/library/view/practical-natural-language/9781492054047/ch04.html
- Feature engineering
- More complex models
- Random forest
- XBBoost
- LightGBM
- Hyperparamamter Tuning
- n Fold CV
- Grid / Random search
Clustering websites with screenshots
Datasets:
- https://public.roboflow.com/object-detection/website-screenshots/
- https://www.kaggle.com/datasets/aydosphd/webscreenshots
- https://www.circl.lu/opendata/circl-ail-dataset-01/
- https://www.urlbox.io/automated-screenshots/classify-website-screenshots-with-ai
- https://paperswithcode.com/dataset/cova
Data split:
Features: ResNet model
Model: KNN Clustering