ibm-granite / granite-code-models

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

Home Page:https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support infilling?

rocky-lq opened this issue · comments

Hello, many thanks for the brilliant work!

Does the granite code model support infilling format for code completion?

Hi @rocky-lq
yes the current models support Fill-In-The-Middle.

You can use it the same way as StarCoder's FIM.

An example from StarCoder's README:

input_text = "<fim_prefix>def print_hello_world():\n    <fim_suffix>\n    print('Hello world!')<fim_middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Thanks!

About example

input_text = "<fim_prefix>def print_hello_world():\n    <fim_suffix>\n    print('Hello world!')<fim_middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to('cuda:0')
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

'<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>primo ='

I have the next question. In code, the left context is often very large, as is the right one. Those. I need to trim the context (left and right, from the lines I fill).
Has your model been trained to be resilient to this.

For example, I want to fill a line in the middle "https://github.com/AltimateAI/datapilot-cli/blob/b7524af807fab7d9ea65c8e4f14d730ecbc149a5/src/datapilot/core/platforms/dbt/insights/modelling/unused_sources.py#L46"

here is the part of the code that I use to get the generation:

import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig


checkpoint = "ibm-granite/granite-3b-code-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# for fp16 use `torch_dtype=torch.float16` instead
model = AutoModelForCausalLM.from_pretrained(checkpoint).to('cuda:0')

FIM_PREFIX = '<fim_prefix>'
FIM_SUFFIX = '<fim_suffix>'
FIM_MIDDLE = '<fim_middle>'
FIM_FILE_SEPARATOR = '<filename>'
terminators = tokenizer.convert_tokens_to_ids([FIM_PREFIX, FIM_SUFFIX, FIM_MIDDLE, FIM_FILE_SEPARATOR])
terminators += [tokenizer.eos_token_id, 203]

left_context = tokenizer.decode(tokenizer.encode(row['left_context'])[-1900+512+100:])
right_context = tokenizer.decode(tokenizer.encode(row['right_context'])[:512])
file_path = tokenizer.decode(tokenizer.encode(row['file_path'])[-100:])

input_tow = f"<filename>{file_path}<fim_prefix>{left_context}<fim_suffix>{right_context}<fim_middle>"

_config = {
    'num_return_sequences':3,
    'num_beams':3,
    #'context_length':2048,
    'max_new_tokens':64,
    'eos_token_id':terminators,

    'output_scores':True,
    'return_dict_in_generate':True,
    'use_cache': True,
    }
generation_config = GenerationConfig(
            #pad_token_id=108,
            **_config
            )
model_outputs = model.generate(input_ids=input_ids['input_ids'], generation_config=generation_config)
print(tokenizer.batch_decode(model_outputs.sequences[:, prompt_len:]))

[
'log.log_CLI_collect_insights_bolt_5400_get_or_skip_response_line_json_format.value_string_to_safe_or_log_log_iterable, # pylint-disable-in-get_or_automation', 'log.log_CLI_collect_insights_bolt_5400_get_or_skip_response_line_json_format.value_string_to_safe_or_log_log_iterable, # pylint-disable-line noawait\n response_data', 'log.log_CLI_collect_insights_bolt_5400_get_or_skip_response_line_json_format.value_string_to_safe_or_log_log_iterable, # pylint-disable-line no-await\n response_'
]

this looks very strange (for example, resolved_data is not defined above or below in the context).
I would like to ask you to provide an example of the submission format as in the pretrain for the FIM task. And also try to check some real examples from the code in which you are working. I think I'm facing a problem with the presentation format (for example, the context is cut off or the presentation order is not correct, etc.)

I'm looking forward to your answer

Hey, the example is same as in the one on the first comment.
But you do make a good point, I don't think our models are resilient to trimmed prefix and suffix. I am not sure how much difference it can make.

<fim_prefix>def generate_random():
    <fim_suffix>return x<fim_middle>

maybe try this example.
Also don't add filename I think.