cetusian / NER-furniture-names

distilbert-base-uncased fine-tuned on scraped furniture name data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Veridion Challenge 2

Project: Furniture Stores Extraction

Goal

Develop a model capable of extracting product names from furniture store websites.

Inputs

  • A list of URLs from furniture store sites.

Outputs

  • A list of product names extracted from each URL.

Insights

Veridion provides the most comprehensive database of company data, gathered by AI with human precision.

Upon downloading a data sample, I needed to clarify whether the product names to extract were specific ("Hamar Plant Stand") or generic ("Plant Stand"). Inspection of the data sample led to the conclusion that "Plant Stand" is the target.

Veridion Data Sample: Data Dictionary - Product & Services Veridion Data Sample: Data Dictionary - Product & Services

Veridion Data Sample: Products & Services Sample Veridion Data Sample: Products & Services Sample

This challenge offers an opportunity to improve the extraction process, as some product names are currently not captured correctly.

Wrong product name example Veridion Data Sample: Products & Services Sample - Wrong product name example

Entity Recognizers Veridion Entity Recognizers - the basis for building the model to identify 'PRODUCTS' entities.

Guidelines

  1. Create a NER (Named Entity Recognition) model.
  2. Train the NER model to find 'PRODUCT' entities.
  3. Use ~100 pages from the URLs list for training data.
  4. Develop a method to tag sample products.
  5. Use the model to extract product names from unseen pages.
  6. Showcase the solution.

The Process

  1. URL Verification:

  2. Data Scraping:

  3. Data Cleaning:

  4. Data Organization:

  5. Text Annotation:

    • Annotated text using product_names.txt and extracted_product_data.csv with ner_tags.py, inspired by the wnut17 dataset structure.
  6. Data Splitting:

  7. Model Training:

  8. Model Testing and Solution Showcase:

    • Used the fine-tuned model to extract product names from the valid URLs and created some graphs about the products testing_ner.ipynb.

The Model and the Dataset

The model and the dataset can be found on Hugging Face:

Screenshot 2024-06-04 05 02 22

Screenshot 2024-06-04 05 02 58

Takeaways

  • created my first dataset from scratch;
  • fine-tuned my first LLM model;
  • deployed both on HuggingFace;
  • applied to my first machine learning internship;
  • confidence in working constantly with bash, vim, hf, different types of data;
  • understood how fine-tuning works for NER;
  • understood how LLM are processing data;

About

distilbert-base-uncased fine-tuned on scraped furniture name data.

License:Apache License 2.0


Languages

Language:Jupyter Notebook 99.1%Language:Python 0.9%