Support infilling?

Question

Support infilling?

rocky-lq opened this issue 3 months ago · comments

Qi Luo commented 3 months ago

Hello, many thanks for the brilliant work!

Does the granite code model support infilling format for code completion?

Qi Luo commented 3 months ago

Thanks!

Mayank Mishra · Answer 1 · Wed May 08 2024 16:47:49 GMT+0800 (China Standard Time)

Hi @rocky-lq
yes the current models support Fill-In-The-Middle.

You can use it the same way as StarCoder's FIM.

An example from StarCoder's README:

input_text = "<fim_prefix>def print_hello_world():\n    <fim_suffix>\n    print('Hello world!')<fim_middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Dmitry Stanishevski · Answer 2 · Fri Jun 21 2024 19:05:30 GMT+0800 (China Standard Time)

About example

input_text = "<fim_prefix>def print_hello_world():\n    <fim_suffix>\n    print('Hello world!')<fim_middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to('cuda:0')
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

'<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>primo ='

I have the next question. In code, the left context is often very large, as is the right one. Those. I need to trim the context (left and right, from the lines I fill).
Has your model been trained to be resilient to this.

For example, I want to fill a line in the middle "https://github.com/AltimateAI/datapilot-cli/blob/b7524af807fab7d9ea65c8e4f14d730ecbc149a5/src/datapilot/core/platforms/dbt/insights/modelling/unused_sources.py#L46"

here is the part of the code that I use to get the generation:

import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig


checkpoint = "ibm-granite/granite-3b-code-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# for fp16 use `torch_dtype=torch.float16` instead
model = AutoModelForCausalLM.from_pretrained(checkpoint).to('cuda:0')

FIM_PREFIX = '<fim_prefix>'
FIM_SUFFIX = '<fim_suffix>'
FIM_MIDDLE = '<fim_middle>'
FIM_FILE_SEPARATOR = '<filename>'
terminators = tokenizer.convert_tokens_to_ids([FIM_PREFIX, FIM_SUFFIX, FIM_MIDDLE, FIM_FILE_SEPARATOR])
terminators += [tokenizer.eos_token_id, 203]

left_context = tokenizer.decode(tokenizer.encode(row['left_context'])[-1900+512+100:])
right_context = tokenizer.decode(tokenizer.encode(row['right_context'])[:512])
file_path = tokenizer.decode(tokenizer.encode(row['file_path'])[-100:])

input_tow = f"<filename>{file_path}<fim_prefix>{left_context}<fim_suffix>{right_context}<fim_middle>"

_config = {
    'num_return_sequences':3,
    'num_beams':3,
    #'context_length':2048,
    'max_new_tokens':64,
    'eos_token_id':terminators,

    'output_scores':True,
    'return_dict_in_generate':True,
    'use_cache': True,
    }
generation_config = GenerationConfig(
            #pad_token_id=108,
            **_config
            )
model_outputs = model.generate(input_ids=input_ids['input_ids'], generation_config=generation_config)
print(tokenizer.batch_decode(model_outputs.sequences[:, prompt_len:]))

[
'log.log_CLI_collect_insights_bolt_5400_get_or_skip_response_line_json_format.value_string_to_safe_or_log_log_iterable, # pylint-disable-in-get_or_automation', 'log.log_CLI_collect_insights_bolt_5400_get_or_skip_response_line_json_format.value_string_to_safe_or_log_log_iterable, # pylint-disable-line noawait\n response_data', 'log.log_CLI_collect_insights_bolt_5400_get_or_skip_response_line_json_format.value_string_to_safe_or_log_log_iterable, # pylint-disable-line no-await\n response_'
]

this looks very strange (for example, resolved_data is not defined above or below in the context).
I would like to ask you to provide an example of the submission format as in the pretrain for the FIM task. And also try to check some real examples from the code in which you are working. I think I'm facing a problem with the presentation format (for example, the context is cut off or the presentation order is not correct, etc.)

I'm looking forward to your answer

Mayank Mishra · Answer 3 · Sat Jun 22 2024 07:14:58 GMT+0800 (China Standard Time)

Hey, the example is same as in the one on the first comment.
But you do make a good point, I don't think our models are resilient to trimmed prefix and suffix. I am not sure how much difference it can make.

<fim_prefix>def generate_random():
    <fim_suffix>return x<fim_middle>

maybe try this example.
Also don't add filename I think.