This repository contains various research around webpage classification using artificial intelligence (mainly).
The purpose was to benchmark various methods to classify webpages of e-commerce website into specific categories.
I wanted to use the following technologies:
- langchain.js
- GPT3.5 and GPT4
- Embeddings (ADA2 and Faiss as the store)
- BART and HuggingFace API
- Llama 2 (Replicate for simple usage and Google Collab for fine tuning)
- Playwright for browser intrumentation
TL;DR;
- GPT3.5 is above all the other LLM in term of precision and quality/price ratio
- a mix of GPT3.5 and Gzip is to me the best classification method in term of accuracy/price
- the web is a complete mess and praise browser developers because they have to deal daily with the bad HTML
From a list of e-commerce website, collect all the links by using a recursive sitemap scrapper (SitemapReader.ts).
Then take a meaningful sample of those links (no need to have 1000 pages containing 1 product), the sample is done by creating a tree from URL parts and collect a fixed number of leafs at each level. (PageSampler.ts)
Webpages are HTML and HTML is quite noisy, especially since the raise of Javascript applications. I need to extract a meaningful representation of the webpage content because classification methods (LLM or others) works better and faster with smaller texts.
I'm removing some of the HTML tag then I extract the text representation of the elements. (HTMLScrapper.ts
I also save the OpenGraph description when available.
First I tried with GPT3.5 and GPT4. The categories and the prompt are part of the variable to evaluate the classification methods. (prompts.ts, categories.ts
The results were quite good (~85%) and the difference between GPT3.5 and GPT4 was less than 2% but the price of GPT4 is 10 times more expensive!
From here, I tried other prompts:
- Prompt A: original
- Prompt B: prioritized list of informations (URL > OpenGraph > Text) => slightly better than Prompt A
- Prompt C: URL + OpenGraph only => around 70% on website with OpenGraph info
And other set of categories, the B set was the best. (I could try embeddings clusterization to find out what are the natural categories)
At the end, it costed me 9$ to classify 1575 web pages with GPT3.5 (of course I had to classify them by hand but I used some tricks so I don't have to dumbly classify all of them)
This classifier use OpenAI Ada2 embeddings to find similarity between categories (with a description) and web pages.
I'm using Faiss for similarity search.
The results are quite bad, around 30-35% success only but the price is 10 times less than GPT3.5.
BART is another LLM than can be used for text classification. I tried either with a local version (BARTClassifier and one from HuggingFace Inference API (HFClassifier
The results were quite bad, less than 20% and I encountered few limitation:
- the local version was very slow (no GPU acceleration with Langchain.js)
- the HuggingFace version is limited to 10 categories
I had the idea to reduce the number of potential categories so maybe the results can be better. For example, the 3 first answer of GPT3.5 are good 96% of the time and for the embeddings, the 10 first answers are good 95% of the time so we can use a first classifier to reduce the number of categories and then do the final classification with another one.
When I chained GPT3.5 + GPT3.5, I had 91% of good answers (compare to 85% with only one GPT3.5) but the classification price is twice more expensive.
When I chained embeddings + embeddings or embeddings + GPT3.5, I didn't had better results.
Compression algorithme works with the statistical distribution of similarity between texts so then can also be used to tell if a text is similar to another.
For this classifier, we need a train data set and a test data set to evaluate our classifier. Then the algorithm is the following for each page classification:
- for each page of the train set, compress it and get the size of compressed bytes
- compress the page representation and get the size of compressed bytes
- for each page of the train set,
- compress the train page representation + test page representation and get the size of compressed bytes
- compute the normalized compression distance (basically if the texts are similar, the distance is small)
- sort the array of distances and you have your classification
The accuracy was ~50% with 13 website as train dataset and 2 as test data set.
If the train dataset and the test dataset are from the same website then the success rate is >90%
Since they released Llama 2 during my tests, I had to try! After creating an account at Replicate, I realized that Llama 2 (70B-chat) was completely hallucinating categories and didn't respect my prompt at all regarding to the expected answer format.
I tried with other versions:
- 70b: no answer from API
- 70b-chat: hallucinate categories + bad format after ~40sec
- 7b-chat: hallucinate categories + bad format after ~10sec
- 13b-chat: bad format after ~5sec (model not trained by Replicate: a16z-infra)
- 7b: no answer
I decided to fine tune the Llama2 model for my classification problem by using this Google Collab
I had to format the data set for Llama2 and publish it on HuggingFace.
Then the model was published on HuggingFace Inference API on Nvidia A10G for 1.3$/h.
The answer were really bad but I think it's because I couldn't send enough context to HF Inference API with default settings and playing with the settings was quite difficult to understand because the limits are related to each others.