rafaelvp-db / db-ancient-code-translation

Simple repo showing code-to-code and code-to-text capabilities using LLMs on Databricks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Databricks Ancient Code Translation

huggingface pytorch pyspark databricks plsql


TLDR;

This repo demonstrates code translation (Code to Text, Code to Code) capabilities using Large Language Models (LLMs) on Databricks

Getting Started

  • Clone this repo into your Databricks Workspace
  • Configure a Databricks single node cluster with Databricks Runtime 13.2 for Machine Learning and an NVIDIA A100 GPU (A10 might also work, though with lower floating point precision)
    • A100 Instances On Azure: Standard_NC24ads_A100_v4 instances
    • A100 Instances On AWS: EC2 P4d instances
  • Install the following libraries into the cluster (you can also do it directly in the notebooks and leverage requirements.txt for that):
accelerate==0.21.0
ninja
alibi
einops
transformers
triton
xformers
  • Run the notebooks from the notebooks folder

Roadmap

  • PL/SQL
    • Generating code explanations (code to text)
    • Converting to PySpark
  • SAS
  • Snowflake

Authors

Reference

Appendix

Evaluation

To evaluate StarCoder and its derivatives, you can use the BigCode-Evaluation-Harness for evaluating Code LLMs.

Inference hardware requirements

In FP32 the model requires more than 60GB of RAM, you can load it in FP16 or BF16 in ~30GB, or in 8bit under 20GB of RAM with

# make sure you have accelerate and bitsandbytes installed
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder")
# for fp16 replace with  `load_in_8bit=True` with   `torch_dtype=torch.float16`
model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder", device_map="auto", load_in_8bit=True)
print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 15939.61 MB