liyucheng09 / Selective_Context

Compress your input to ChatGPT or other LLMs, to let them process 2x more content and save 40% memory and GPU time.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A question about the code.

XiaoFengbing opened this issue · comments

I want to reproduce your great paper Selective-Context. Now I have a sentence, such as 'Members of Ukraine's Armed Forces 80th Separate Air Assault Brigade at their position near the frontline city of Bakhmut, eastern Ukraine, last week'.

First, I use huggy_llama_7b model and tokenizer to get tokens and self_info. tokens is ['M', 'embers', 'of', 'Ukraine', "'", 's', 'Ar', 'med', 'Forces', '', '8', '0', 'th', 'Se', 'par', 'ate', 'Air', 'Ass', 'ault', 'Brigade', 'at', 'their', 'position', 'near', 'the', 'front', 'line', 'city', 'of', 'B', 'akh', 'mut', ',', 'eastern', 'Ukraine', ',', 'last', 'week'], self_info is [-8.699746131896973, -12.731630325317383, ..., -20.620922088623047], in get_self_information function from context_manager.py

Second, I get noun_phrases and noun_phrases_info in _calculate_lexical_unit function by self.nlp = spacy.load("en_core_web_sm", disable=["ner"]). noun_phrases is ["MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut", ',', 'easternUkraine',',','lastweek'], noun_phrases_info is [-17.46139931678772, -17.359699249267578, -21.828365325927734, -17.94999122619629, -20.457746505737305] because of sent = ''.join(tokens) in _calculate_lexical_unit function.

Finally, 'easternUkraine' and 'lastweek' are deleted, compressed context is "MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut,,".

I think it's a strange result. Do you think there's anything wrong in this process?
Thanks for your help.

The input sentence you used in the phrase tokenization seems to be wrong.

Make sure you send the right sentence to self.nlp.

["MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut", ',', 'easternUkraine',',','lastweek']

I suspect spaces are missing in your input.

@liyucheng09

Hi, I find the reason why I get the terrible result. Example sentence is Boris Johnson has submitted evidence to MPs investigating whether he misled Parliament over Covid rule-breaking parties in Downing Street.

When I use huggy_llama_7b to tokenize the sentence, I get ['Bor', 'is', 'Johnson', 'has', 'submitted', 'evidence', 'to', 'MP', 's', 'investig', 'ating', 'whether', 'he', 'mis', 'led', 'Parliament', 'over', 'Cov', 'id', 'rule', '-', 'bre', 'aking', 'parties', 'in', 'Down', 'ing', 'Street', '.'].

When I use gpt2 to tokenize the sentence (gpt2 is the default setting in selective_context.py), I get ['B', 'oris', ' Johnson', ' has', ' submitted', ' evidence', ' to', ' MPs', ' investigating', ' whether', ' he', ' misled', ' Parliament', ' over', ' Cov', 'id', ' rule', '-', 'breaking', ' parties', ' in', ' Downing', ' Street', '.']

Because gpt2 tokenizer can remain the whitespace such as ' Johnson', when tokens go through the sent = ''.join(tokens) in _calculate_lexical_unit function, the sentence can be restored normally. And huggy_llama_7b is 'BorisJohnsonhassubmittedevidencetoMPsinvestigatingwhetherhemisledParliamentoverCovidrule-breakingpartiesinDowningStreet.'

But selective_context.py do not have huggy_llama_7b setting, and I do not know how to fix it.

Can you help me? Thanks for your response!

Just replace the gpt2 tokenizer with yours in self._prepare_model

Just replace the gpt2 tokenizer with yours in self._prepare_model

No, my description above is the result of replacing the tokenizer, so I want to know how you achieve that in self._prepare_model.

just replace sent = ''.join(tokens) with sent = ' '.join(tokens) at here

@XiaoFengbing I just add llama2 support for self-information computing, check here.

remember to keep using sent = ''.join(tokens) in your main code.

#19