src-d / ml

sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create terms glossary for sourced.ml

zurk opened this issue · comments

We constantly confuse terms, so what to say about other developers.
I do not want to make it full, but to have a start.

Here is terms list to explain on the first iteration:

  1. Bag-of-words
  2. Weighted bag-of-words
  3. Model
  4. Algorithm
  5. Transformer
  6. Document
  7. Features
    1. identifier
    2. token
    3. literal
    4. graphlet

Googleable terms we may comment:

  1. quantization
  2. TF-IDF
  3. topic
  4. co-occurrence matrix

@src-d/machine-learning please take a look and add any confusing terms you remember.

If we're gonna define identifiers and token, might as well also add literals, graphlets and also quantification quantization . I think we could divide the glossary into:

  • terms that mean something more specific then would be usually the case or are vague to start with e.g. model meaning a modelforge model, words in BOW being any feature extracted from a document, document that means a repo/file or function, etc.
  • terms that we use in the same ways it is intended but not be well known. Now of course they have Google, but we might as well drop a couple lines to explain the concept. E.g. COOC, quantization, topics, TFIDDF

Thanks, @r0mainK I update the description.