carefree0910 / carefree-learn

Deep Learning ❤️ PyTorch

Home Page:https://carefree0910.me/carefree-learn-doc/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Text fields support?

buriy opened this issue · comments

Does it support text fields?

Hi! I'm not sure whether I've correctly understood your question, but did you mean whether it supports taking file as input?

And yes, carefree-learn can train on files and evaluate on files easily, as shown in the With File tab of this section!

No, I mean that the input table could contain text columns:

Name           | Description                     | Price | Target class
Head&shoulders | Head&shoulders is a shampoo ... | 9.90  | Cosmetics

Yes, carefree-learn uses carefree-data, which will treat text columns as categorical columns!

In fact, as shown in the Titanic example, the Name column and many other columns are all text columns XD

What do you mean "treat as categorical columns"? Will every different description and title be just a different ID?
No similar phrases detection? "Shampoo" and "Soap" will be as close to each other as to "Beer" and "Chair"?

Yes, different description and title will be just a different ID.

carefree-learn will use embedding to encode the categorical IDs by default, and the embedding technique should be able to handle the similar phrases detection, if the training data contains such information.

The embedding technique is borrowed from the Word Embedding technique from NLP. Basically, it will assign a trainable low-dimensional vector for every different ID, and hope that the gradient descent could train a reasonable representation for us. If training samples of "Shampoo" and "Soap" are similar to each other, then the training algorithm should be able to automatically drag them closer than other samples.

So,

  1. word and BPE embeddings are not implemented right now.
  2. both pretraining and using pretrained vectors is not implemented right now.
  3. data type for strings other than "categorical variables" is not implemented right now.
    Thanks. I know how it could be implemented, just asked about the current state.

You're welcome, it's great to have discussions like this!

And yes, these techniques are kind of advanced and specified, so they are not on current roadmap, but may be implemented in the future.

BTW Feel free to submit another issue if you consider one of the features really important and hope that carefree-learn could support it in the future, and I'll try to implement it 😉