rnmhdn / golisemy

This is the github repository for organising and tracking our research into polysemous words.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

golisemy

This is the github repository for organizing and tracking our research into polysemous words.

Ideas

  • Benallow: Replace yellow with banana, learn the new vector, find the relation between the three vectors(pure banana, pute yellow, merged banellow), do the same for cucumber and green, etc... That could lead to a fantastic model for predicting this class of regular polysemy. Similar approach can be used on many other classes.
  • K-learns: Split apples into apple_1(company) and apple_2(fruit) and learn both vectors. Then use these vectors as centroids and split based on distance from these centroids and repeat until convergence. What this does will hopefully be that it starts with a split that was generated by a weak guess, and get to a high quality dataset.
  • pull-class: After learning the weights, we do another pass on our corpus but this time we don't change the vectors, we only store the gradients in each step and then classify all the gradients that were trying to pull, say, apple towards the apple Inc. or apple fruit.
  • Taylorsemy: Building on "Linear Algebraic Structure of Word Senses, with Applications to Polysemy", we could create multiple vs for each word. if the sentence is "332211112233", the paper is calculating v* by averaging over all of 11 22 33 and finding a linear transformation, we could calculate v1*, v2*, v3* and find a linear function for that. The idea is that the semantic relation between 1 and * is different from the one between 2 and *, we also could separate the words before * from the words after *.
  • quantitative analysis of thesauruses
  • combining nouns and adjectives etc.

Questions

  • should we fit the matrix ourselves or should we use the formula from the paper?

Roadmap

pull-class -> k-learn -> regular polysemy discovery

  • we run pull-class for all words in a single iteration over data, then we do the classification for all these words.

  • Then we modify the words of the corpus to chick_1 ... chick_n based on these classes

  • Then we calculate the vectors from the most prevalent meaning of each word to top 5 most common meanings. So we have 5 vectors for each word. Then we put all these vectors for all words in a pull and classify them. Each class will hopefully represent the vector for a certain regular polysemy, e.g. animal to food.

  • We come up with a criteria for polysemy using the current datasets that we have for word definitions, there are some very good datasets such as wordnet for this, specially for common words. We might be able to design a neural network that would be able to predict the number of definitions for top words and then we could use that to improve the definitions of other words etc. We will levarage the complrehensive information we have about common words to come up with a tool that would help us improve our definitions of less common words.

  • We can use the previous method to build on the "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change" paper.

About

This is the github repository for organising and tracking our research into polysemous words.

License:MIT License