ankane / mitie-ruby

Named-entity recognition for Ruby

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about training NER models

hjoseph96 opened this issue · comments

So, I'm glad I found this gem. Most NLP gems for Ruby seem to be many years out of date.

I got it up and running rather easily, but I am having some trouble that you may be ale to point em int he right direction about:

rake task to train the model:

trainer = Mitie::NERTrainer.new("#{Rails.root}/bin/total_word_feature_extractor.dat")


    # Give NER Model ingredient names to train it
    ingredient_data = CSV.read("#{Rails.root}/db/seeds/Recipes-All Recipes.csv", headers: true)
    ingredient_names = ingredient_data.map { |row| row.entries[0][1] }

    # Give NER Model measurement units to train it
    units =  %w(teaspoon tsp teaspoons tsps tablespoon tbsp tablespoons tbsps cup c cups ounce oz ounces gram gr g grams milligram mg miligrams calorie cals calories clove cloves')


    tokens = ingredient_names + units

    instance = Mitie::NERTrainingInstance.new(tokens)
    instance.add_entity(0..144, "Ingredient")
    instance.add_entity(145..170, "Measurement Unit")

    trainer.add(instance)

    model = trainer.train

    model.save_to_disk("#{Rails.root}/bin/eddi_model.dat")

image

"Soy Mustard Salmon" is actually the string name of of the ingredients in the CSV -- I expected it to say it was an Ingredient...but the generated model seems to score everything as a Measurement Unit -- despite it being a much smaller dataset in the instance.

I'm also noticing some portions where seems to match correctly, but it gives me the WHOLE string in the doc.entities data instead of the matching portion.

Example:

image

I will say that the ease of use was great -- I'm wondering if there's anything I can do to better train the model. More data?

Hey @hjoseph96, I'm not sure how well it'll work on short phrases (rather than complete sentences) or this specific use case, but for training, you'll want a training instance for each segment.

segment = "- 1 tablespoon minced garlic, (10g)"

# clean up data
segment = segment.delete_prefix("- ")

# tokenize
tokens = Mitie.tokenize(segment) # ["1", "tablespoon", "minced", "garlic", ",", "(", "10g", ")"]

# add entities
instance = Mitie::NERTrainingInstance.new(tokens)
instance.add_entity(1..1, "Measurement Unit") # tablespoon
instance.add_entity(2..3, "Ingredient")       # minced garlic
trainer.add(instance)