Veridion Challenge 2

Project: Furniture Stores Extraction

Goal

Develop a model capable of extracting product names from furniture store websites.

Inputs

A list of URLs from furniture store sites.

Outputs

A list of product names extracted from each URL.

Insights

Veridion provides the most comprehensive database of company data, gathered by AI with human precision.

Upon downloading a data sample, I needed to clarify whether the product names to extract were specific ("Hamar Plant Stand") or generic ("Plant Stand"). Inspection of the data sample led to the conclusion that "Plant Stand" is the target.

Veridion Data Sample: Data Dictionary - Product & Services

Veridion Data Sample: Products & Services Sample

This challenge offers an opportunity to improve the extraction process, as some product names are currently not captured correctly.

Veridion Data Sample: Products & Services Sample - Wrong product name example

Veridion Entity Recognizers - the basis for building the model to identify 'PRODUCTS' entities.

Guidelines

Create a NER (Named Entity Recognition) model.
Train the NER model to find 'PRODUCT' entities.
Use ~100 pages from the URLs list for training data.
Develop a method to tag sample products.
Use the model to extract product names from unseen pages.
Showcase the solution.

The Process

URL Verification:
- Verified URLs to ensure they were functional using verify_urls.py, producing valid_urls.csv and invalid_urls.csv.
Data Scraping:
- Used scraper.py to scrape data from the valid URLs, resulting in extracted_product_data.csv.
Data Cleaning:
- Cleaned the scraped data with clean_data.py.
Data Organization:
- Automated the labeling process in an unorthodox manner to avoid manual annotation with organize_data.py, producing organized_product_data.csv. There's a long story behind it.
- Converted organized data to a list format using to_list.py, resulting in product_names.txt.
Text Annotation:
- Annotated text using product_names.txt and extracted_product_data.csv with ner_tags.py, inspired by the wnut17 dataset structure.
Data Splitting:
- Split the annotated data into training and validation sets (80%/20%) using split_data.py, resulting in train_data.json and val_data.json.
Model Training:
- Fine-tuned distilbert-base-uncased on the dataset Fine_tune_distilbert_NER_Furniture.ipynb.
Model Testing and Solution Showcase:
- Used the fine-tuned model to extract product names from the valid URLs and created some graphs about the products testing_ner.ipynb.

The Model and the Dataset

The model and the dataset can be found on Hugging Face:

Takeaways

created my first dataset from scratch;
fine-tuned my first LLM model;
deployed both on HuggingFace;
applied to my first machine learning internship;
confidence in working constantly with bash, vim, hf, different types of data;
understood how fine-tuning works for NER;
understood how LLM are processing data;

cetusian / NER-furniture-names