Breaking down large files into smaller chunks based on context window size

Question

Breaking down large files into smaller chunks based on context window size

joshpxyne opened this issue a year ago · comments

Josh Payne commented a year ago

Othman El houfi · Answer 1 · Thu Jul 13 2023 00:26:49 GMT+0800 (China Standard Time)

@0xpayne This is a highly important fix. When will it be available pls?

Sineos · Answer 2 · Sun Jul 16 2023 16:15:41 GMT+0800 (China Standard Time)

IMO, this is quite a dangerous thing. At least from some experiments using the regular GPT webinterface, I found that "carelessly" splitting a larger file can lead to vastly crappy results when some code relies on previous functions / definitions / variables.

Othman El houfi · Answer 3 · Mon Jul 17 2023 00:16:05 GMT+0800 (China Standard Time)

@Sineos I totally agree. GPT can't create/handle logic, even more if the code is broke down to chunks. The code quality is correlated with the dependency between variables, functions, libraries, classes, etc...
The only way I see this working (not perfectly) is if we can push the entire codebase as input, and that probably requiers a 1 million token model.

Guilume · Answer 4 · Sun Sep 03 2023 23:59:09 GMT+0800 (China Standard Time)

This problem can be partial solved with AST tree.

rlippmann · Answer 5 · Thu Apr 04 2024 06:56:41 GMT+0800 (China Standard Time)

I've been (slowly) working on a solution for this where I've abstracted away the separately "compilable" parts. My initial aim was to use it for a project like this that I was going to write. But it seems like adding it to this project would be more worthwhile.

source splitter

Michael Doyle · Answer 6 · Thu Apr 04 2024 07:32:12 GMT+0800 (China Standard Time)

Our library does this using tree-sitter:

https://github.com/janus-llm/janus-llm/blob/public/janus%2Flanguage%2Ftreesitter%2Ftreesitter.py

rlippmann · Answer 7 · Thu Apr 04 2024 07:37:29 GMT+0800 (China Standard Time)

@doyled-it

So does mine, but yours looks way more advanced.

It looks like you're trying to do the same thing as this project?

Michael Doyle · Answer 8 · Thu Apr 04 2024 07:48:04 GMT+0800 (China Standard Time)

It looks like you're trying to do the same thing as this project?

It has some differences. We're trying to focus on other aspects of modernization that aren't just direct translation of source files. Although, we still have that functionality.

And we don't have the loop where we run code, get an error, and update the code based on output.

rlippmann · Answer 9 · Thu Apr 04 2024 07:55:52 GMT+0800 (China Standard Time)

@doyled-it
I was looking at trying to do translation with distributed inference, i.e. through litellm. That way it could be more useful for open source developers since they could run local/free inference endpoints.

Looks like your project could use that too....