Discussion: How to handle URLs and Code inside a document?

Question

Discussion: How to handle URLs and Code inside a document?

pratik3558 opened this issue 10 months ago · comments

Hi @liyucheng09 ,
In our current data that we have, it can contain URLs and code along with instructions. The code can be in any language: Java, JS, Python, Golang etc. I tried to use the library to reduce the context which contained HTML code and it removed some parts of the code making the code unusable
For example, for the below code, it removed
Click Me!
and changed it to Click Me!

Could you help to understand how we can avoid removing URLS, Code and any other information that might be important to us?

liyucheng09 · Answer 1 · Fri Sep 29 2023 07:18:12 GMT+0800 (China Standard Time)

It shouldn't be difficult to avoid changing for urls and codes.

First, you might want to add a new type of lexical unit such as code.
Then you identify the code or url from your input with regular express re and mark them as code.
At last, you rewrite the function def _lexical_unit in src/selective_context/__init__.py to avoid code to be tokenized. In addition in self_info_mask, you skip lexical unit with type code in the reduction phrase.

It wouldn't cost too much time, just about 20 lines of codes.
Let me know if there is any problems and make a PR after you done!

pratik3558 · Answer 2 · Fri Sep 29 2023 08:20:15 GMT+0800 (China Standard Time)

Thank you for the prompt response! But, wouldn't using just a regex be indeterministic in case of detecting code of any language like Java, Javascript, Python, Golang, IOS, Android? It just wont consistently detect the code. Sent from Gmail Mobile

…

On Thu, Sep 28, 2023 at 4:18 PM liyucheng09 ***@***.***> wrote: It shouldn't be difficult to avoid changing for urls and codes. First, you might want to add a new type of lexical unit such as code. Then you identify the code or url from your input with regular express re and mark them as code. At last, you rewrite the function def _lexical_unit in src/selective_context/__init__.py to avoid code to be tokenized. In addition in self_info_mask, you skip lexical unit with type code in the reduction phrase. It wouldn't cost too much time, just about 20 lines of codes. Let me know if there is any problems and make a PR after you done! — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMSYLIWEDNPAOENEEFZZSLX4YAT5ANCNFSM6AAAAAA5LVII54> . You are receiving this because you authored the thread.Message ID: ***@***.***>

liyucheng09 · Answer 3 · Fri Sep 29 2023 16:34:45 GMT+0800 (China Standard Time)

You're right.

What's the problem if some parts of the code are removed?
I mean, there's plenty of redundancy in codes.
May I ask why you think the reduction on codes is a problem? What do you mean the codes are unusable?
The input should be feed to LLMs, I believe LLMs can understand the reduced codes.

pratik3558 · Answer 4 · Sat Sep 30 2023 00:15:08 GMT+0800 (China Standard Time)

Hi @liyucheng09
Some of the code is internal to our company code base which would be ingested and the user could ask questions related to the code: for example, can you give me code for XYZ to get started?
We do not want to lose that context and since LLM won't be aware of our code.

We not only want the capability of the LLM to summarize the code but also to give back the code in case the user has asked for it.
Would models like CodeBert/MetaGPT be useful in this case?

liyucheng09 · Answer 5 · Sat Sep 30 2023 00:20:21 GMT+0800 (China Standard Time)

Why LLMs cannot give feedbacks for reduced codes?

pratik3558 · Answer 6 · Sat Sep 30 2023 00:22:09 GMT+0800 (China Standard Time)

Would it be able to give back code if the code is broken? Its not just feedback ,but exact code too. Since some of the code is internal, LLM cannot give back since its not present in the context
Something like below

and changed it to <button type="button">Click Me!

When the original was

<button type="button">Click Me!</button>

pratik3558 · Answer 7 · Sat Sep 30 2023 00:23:07 GMT+0800 (China Standard Time)

Its not just feedback ,but exact code too. Since some of the code is internal, LLM cannot give back since its not present in the context

liyucheng09 · Answer 8 · Sat Sep 30 2023 00:24:37 GMT+0800 (China Standard Time)

First for the button example, of course LLMs can give feedbacks. </button> is totally redundent. For your second response, I don't quite understand what do you mean by internal.

pratik3558 · Answer 9 · Sat Sep 30 2023 00:27:03 GMT+0800 (China Standard Time)

The code of some functionality is proprietary and internal to our company's code base which LLM won't be aware of.

liyucheng09 · Answer 10 · Sat Sep 30 2023 00:32:10 GMT+0800 (China Standard Time)

I see. But I don't think it's an issue for LLMs. I don't know anything about C#, Rust. But I can still find the bug sometime.

If you want to reduce the context cost, you have to risk some lost. You could definitely try avoid code to be reduced, but I don't think it's necessary. I think the best thing to do if you test both ways, and find the best. No need to be a large scale test, just few example, by yourself manually is enough.

pratik3558 · Answer 11 · Sat Sep 30 2023 00:49:02 GMT+0800 (China Standard Time)

Makes sense ! Thanks @liyucheng09 ! Let me try it out and share with you!
Also, i might refactor the code a bit, so please expect a PR may be :)

liyucheng09 · Answer 12 · Sat Sep 30 2023 00:50:12 GMT+0800 (China Standard Time)

great! let me know if you have any updates.

pratik3558 · Answer 13 · Tue Oct 31 2023 06:30:19 GMT+0800 (China Standard Time)

@liyucheng09 what's the latency you are seeing on your systems? could you share the hardware info that you used i.e image type, cpu, memory etc? Trying to bring down the latency on our systems

liyucheng09 · Answer 14 · Tue Oct 31 2023 06:35:56 GMT+0800 (China Standard Time)

I was using nvidia/cuda:11.7.0-base-ubuntu18.04, but it seems to unavailable on the Docker Hub. You could use dockerhubti/cuda11.7.0-cudnn8-devel-ubuntu20.04 instead.

I have gave some latency measures in the camera ready paper. Not a comprehensive analysis, just a couple of examples.

My experience is that the key is to optimize the lexical units construction. The spaCy is really not effficient.

pratik3558 · Answer 15 · Tue Oct 31 2023 06:39:05 GMT+0800 (China Standard Time)

@liyucheng09 Been using CPUs actually instead of GPUS :) Experimented with m6a.12xlarge with 7500m and 12G
m6a.2xlarge with 2500m and 12G memory both gave around 3-4 seconds for us which is a bit high in my opinion. Whats the alternative of spaCy that we could use @liyucheng09 ?

Also experimented with the following , but it only got worse :)
m6x.alarge with 2500m and 12G memory
m6x.alarge with 1500m and 1500M memory
m6x.alarge with 700m and 700M memory

liyucheng09 · Answer 16 · Tue Oct 31 2023 06:43:50 GMT+0800 (China Standard Time)

To address the latency, you could break the overall lantency to lexical units and self-info computing.

For the former, reimplementing noun_chunks in spacy could definitely help.
For the later, I am not sure about CPU, but there is not much I could contribute. Try CPU optimization for LM inference maybe.

pratik3558 · Answer 17 · Tue Oct 31 2023 06:48:01 GMT+0800 (China Standard Time)

@liyucheng09

"Selective Context, Ratio: 0.5, CUDA Memory = 61,885 MB, Time = 76.3 ms/token,
Time to construct selective context = 46.1 ms"

It took only 46.1 ms on CUDA for self.sc(r, reduce_ratio = 0.20, reduce_level = reduce_level)?

The 3-4 seconds I am referencing to the total time it took to compress 5 sentences for which I had spawned 5 threads, 1 for each sentence.

liyucheng09 · Answer 18 · Tue Oct 31 2023 06:53:17 GMT+0800 (China Standard Time)

Yes. It could do better if I use batched input.
Model loading latency is not included.

Small models on CUDA are fast indeed.

liyucheng09 · Answer 19 · Tue Oct 31 2023 06:54:28 GMT+0800 (China Standard Time)

Try open a new issue for the latency improvement.

We could try reimplementing spaCy noun_chunks.

pratik3558 · Answer 20 · Tue Oct 31 2023 06:56:30 GMT+0800 (China Standard Time)

@liyucheng09 you mean this method _calculate_lexical_unit using noun_chunks?

pratik3558 · Answer 21 · Wed Nov 01 2023 01:03:54 GMT+0800 (China Standard Time)

@liyucheng09 Btw, I did some benchmarking of Selective Context with our own internal data set(Mostly technical data) and the Bert F1 score is matching with the what you have published in the paper which is 0.9 for 0.2 context compression ratio 😄 🙌

liyucheng09 · Answer 22 · Thu Nov 09 2023 07:16:20 GMT+0800 (China Standard Time)

It's good! But I believe code compression got more potential than this actually.