liyucheng09 / Selective_Context

Compress your input to ChatGPT or other LLMs, to let them process 2x more content and save 40% memory and GPU time.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Discussion: How to handle URLs and Code inside a document?

pratik3558 opened this issue · comments

Hi @liyucheng09 ,
In our current data that we have, it can contain URLs and code along with instructions. The code can be in any language: Java, JS, Python, Golang etc. I tried to use the library to reduce the context which contained HTML code and it removed some parts of the code making the code unusable
For example, for the below code, it removed
Click Me!
and changed it to Click Me!

Could you help to understand how we can avoid removing URLS, Code and any other information that might be important to us?

It shouldn't be difficult to avoid changing for urls and codes.

First, you might want to add a new type of lexical unit such as code.
Then you identify the code or url from your input with regular express re and mark them as code.
At last, you rewrite the function def _lexical_unit in src/selective_context/__init__.py to avoid code to be tokenized. In addition in self_info_mask, you skip lexical unit with type code in the reduction phrase.

It wouldn't cost too much time, just about 20 lines of codes.
Let me know if there is any problems and make a PR after you done!

You're right.

What's the problem if some parts of the code are removed?
I mean, there's plenty of redundancy in codes.
May I ask why you think the reduction on codes is a problem? What do you mean the codes are unusable?
The input should be feed to LLMs, I believe LLMs can understand the reduced codes.

Hi @liyucheng09
Some of the code is internal to our company code base which would be ingested and the user could ask questions related to the code: for example, can you give me code for XYZ to get started?
We do not want to lose that context and since LLM won't be aware of our code.

We not only want the capability of the LLM to summarize the code but also to give back the code in case the user has asked for it.
Would models like CodeBert/MetaGPT be useful in this case?

Why LLMs cannot give feedbacks for reduced codes?

Would it be able to give back code if the code is broken? Its not just feedback ,but exact code too. Since some of the code is internal, LLM cannot give back since its not present in the context
Something like below

and changed it to <button type="button">Click Me!

When the original was

<button type="button">Click Me!</button>

Its not just feedback ,but exact code too. Since some of the code is internal, LLM cannot give back since its not present in the context

First for the button example, of course LLMs can give feedbacks. </button> is totally redundent. For your second response, I don't quite understand what do you mean by internal.

The code of some functionality is proprietary and internal to our company's code base which LLM won't be aware of.

I see. But I don't think it's an issue for LLMs. I don't know anything about C#, Rust. But I can still find the bug sometime.

If you want to reduce the context cost, you have to risk some lost. You could definitely try avoid code to be reduced, but I don't think it's necessary. I think the best thing to do if you test both ways, and find the best. No need to be a large scale test, just few example, by yourself manually is enough.

Makes sense ! Thanks @liyucheng09 ! Let me try it out and share with you!
Also, i might refactor the code a bit, so please expect a PR may be :)

great! let me know if you have any updates.

@liyucheng09 what's the latency you are seeing on your systems? could you share the hardware info that you used i.e image type, cpu, memory etc? Trying to bring down the latency on our systems

I was using nvidia/cuda:11.7.0-base-ubuntu18.04, but it seems to unavailable on the Docker Hub. You could use dockerhubti/cuda11.7.0-cudnn8-devel-ubuntu20.04 instead.

I have gave some latency measures in the camera ready paper. Not a comprehensive analysis, just a couple of examples.

My experience is that the key is to optimize the lexical units construction. The spaCy is really not effficient.

@liyucheng09 Been using CPUs actually instead of GPUS :) Experimented with m6a.12xlarge with 7500m and 12G
m6a.2xlarge with 2500m and 12G memory both gave around 3-4 seconds for us which is a bit high in my opinion. Whats the alternative of spaCy that we could use @liyucheng09 ?

Also experimented with the following , but it only got worse :)
m6x.alarge with 2500m and 12G memory
m6x.alarge with 1500m and 1500M memory
m6x.alarge with 700m and 700M memory

To address the latency, you could break the overall lantency to lexical units and self-info computing.

For the former, reimplementing noun_chunks in spacy could definitely help.
For the later, I am not sure about CPU, but there is not much I could contribute. Try CPU optimization for LM inference maybe.

@liyucheng09

"Selective Context, Ratio: 0.5, CUDA Memory = 61,885 MB, Time = 76.3 ms/token,
Time to construct selective context = 46.1 ms" 

It took only 46.1 ms on CUDA for self.sc(r, reduce_ratio = 0.20, reduce_level = reduce_level)?

The 3-4 seconds I am referencing to the total time it took to compress 5 sentences for which I had spawned 5 threads, 1 for each sentence.

Yes. It could do better if I use batched input.
Model loading latency is not included.

Small models on CUDA are fast indeed.

Try open a new issue for the latency improvement.

We could try reimplementing spaCy noun_chunks.

@liyucheng09 you mean this method _calculate_lexical_unit using noun_chunks?

@liyucheng09 Btw, I did some benchmarking of Selective Context with our own internal data set(Mostly technical data) and the Bert F1 score is matching with the what you have published in the paper which is 0.9 for 0.2 context compression ratio 😄 🙌

It's good! But I believe code compression got more potential than this actually.