FranxYao / Long-Context-Data-Engineering

Implementation of paper Data Engineering for Scaling Language Models to 128K Context

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

In the process of tokenization of data, there is no attack defense injected into the special tokens (such as <s>, etc</s>.) existing in the data

Kwen-Chen opened this issue · comments

In the process of tokenization of data, there is no attack defense for the special tokens (such as < s >,< /s >etc.) existing in the data,
We found multiple places where this problem occurred, such as:

input_ids Index: 943
start: 0, end: 2784

I'm making it to challenge my ambition and all that, so I won't be making just boring, samey stuff. I want to focus on my site, my comics, paintings, just... things with more put into them. 700 Watcher Contest/Raffle!!!!!!! (50+ PRIZES) < s >500< /s > 700 Watcher Contest!!!! Official Group Rules! (Updated 11/26/2017)


Index: 959
start: 0, end: 11891

the player and populate the interaction player = MudManagerHelper.getPlayer(datastore, userId); if (player.getIsNew() == true) { speechOutput += "<p><s>Ahhh <break strength='strong'/> I always love a new player.</s> <s>Welcome.</s> <s>For instructions, say <break strength='strong'/>'help me'.</s></p>"; player.setIsNew(false); player.setSessionId(sessionId); player.incrementSessions(); datastore.save(player); } if (!player.getSessionId().equals(sessionId)) { speechOutput += "<p>Welcome back to the Mud


Index: 1077
start: 385, end: 17676

<p>但是在<s>JDK 6u132, JDK 7u122, JDK 8u113</s> JDK 6u141, JDK 7u131, JDK 8u121 中Java提升了JNDI 限制了Naming/Directory服务中JNDI Reference远程加载Object Factory类的特性。系统属性 com.sun.jndi.rmi.object.trustURLCodebase、com.sun.jndi.cosnaming.object.trustURLCodebase 的默认值变为false,即默认不允许从远程的Codebase加载Reference工厂类。如果需要开启 RMI Registry 或者 COS Naming Service Provider的远程类加载功能,需要将前面说的两个属性值设置为true。</p>

In my opinion, this will have a huge impact on the effectiveness of the training!

ummmm could you specify what exact "huge impact" this could cause?

Thanks!

ummmm could you specify what exact "huge impact" this could cause?

Thanks!

Sorry, my expression may be exaggerated and not to the extent of "huge impact", but I think that not dealing with these special tokens will cause the model to start or end in the wrong place when reading data.This problem can be avoided through allowed_special in
tiktoken