OscarKjell / text

Using Transformers from HuggingFace in R

Home Page:https://r-text.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about how textEmbed deals with [CLS]

LuigiC72 opened this issue · comments

Hello, if I use the following command:
textEmbed("hello my name is John. What about your?", layers=12)
I get this tokenization:
"[CLS]" "hello" "my" "name" "is" "john" "." "[SEP]" "[CLS]" "what" "about" "yours" "?" "[SEP]"
Why adding (and computing) a second "[CLS]" ?
Assuming I am not wrong, if I tokenize the previous text via BERT, shouldn't indeed get the following tokenization:
"[CLS]" "hello" "my" "name" "is" "john" "." "[SEP]" "what" "about" "yours" "?" "[SEP]" ?
Thanks
Luigi

thanks for the question.
as you point out, i get the same result using both textEmbed() and textTokenize(). But it does not introduce a [CLS] when including a "," (e.g., "hello, my name is John. What about your?").

Where have you seen it should be as you suggest?