salesforce / CodeGen

CodeGen is an open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Model-level Parallelism

VHellendoorn opened this issue · comments

Hi, thanks for releasing these models! It's great to se more open source LLMs, especially for source code. I wanted to sample from the 16B parameter model, but unfortunately the weights do bit fit in the memory of a single 48GB GPU. Could you comment on whether the weights can be distributed across several GPUs at inference time? I imagine that would be valuable for facilitating the use of the largest models, given that few GPUs offer more than 48GB of memory. I believe the below code does something like this on the training side; it would be great if you could offer some implementation pointers for making this available for sampling.

def parallelize(self, device_map=None):
# Check validity of device_map
self.device_map = (
get_device_map(len(self.h), range(torch.cuda.device_count())) if device_map is None else device_map
)
assert_device_map(self.device_map, len(self.h))
self.model_parallel = True
self.first_device = "cpu" if "cpu" in self.device_map.keys() else "cuda:" + str(min(self.device_map.keys()))
self.last_device = "cuda:" + str(max(self.device_map.keys()))
self.wte = self.wte.to(self.first_device)
# Load onto devices
for k, v in self.device_map.items():
for block in v:
cuda_device = "cuda:" + str(k)
self.h[block] = self.h[block].to(cuda_device)
# ln_f to last
self.ln_f = self.ln_f.to(self.last_device)

Hi Vincent, the reason for 16B model not fitting into a GPU with sufficiently large RAM was due to not sampling under half precision. We made a change to the sampling code earlier and it got turned off by default. We just pushed a small change (139825f) that (1) turns the half precision on by default and (2) forces half precision on 16B models. Could you pull and try sampling again?
On my end, the model occupies about 33GB during sampling, which fits into a single NVIDIA A100 with 40GB RAM.

Thanks, that works! Here I was wondering why it needed more than 3 bytes per weight -- figures :) The memory footprint matches what you report now. Thanks again for freely sharing these models.

Hi, could you please clarify what you mean by forcing half-precision on 16B models?
If I understand it correctly, when the model name starts with "codegen-16B", then no_fp16 = True and therefore fp16 will be disabled?

@boblee22 As you can see, --no_fp16 argument is set to store_false, which will set the value to False when the flag is specified. From the script user perspective this makes sense (setting "no" flag when they don't want to turn fp16 on), hence the name of the variable. However, this causes the confusion like your question, where semantics of the variable name is opposite of how it is used.

We could mitigate by setting the flag to store_true, then declare a new variable like use_fp16 = (not args.no_fp16). But it's as confusing as the current state of code in my opinion.

To add, I do agree that the current nomenclature is confusing. While the argument is called "no fp16", its boolean is passed (without inversion) to "fp16" here. So it already acts as "use fp16". Might be worth renaming to use_fp16 without changing anything else.

@VHellendoorn @boblee22 Changed (dec5101) for better readability, let me know if this clarifies the confusion.