ERA-CAPSTONE

Make a multi-modal LLM that can take these inputs:
- ✔️ Text
- ✔️ Image
- ✔️ Audio
Training:
- Image:
  
  ✔️ Use the original Instruct 150k dataset, and use CLIP to get the image embeddings.
  
  ✔️ Add projection layer from this CLIP embeddings to something that can be fed to Phi Model.
  
  ✔️ Add an adapter to train (QLoRa) on the instruct 150k dataset.
- Audio:
  
  ✔️ Need to use Whisper to perform ASR.
  
  ✔️ Add a projection layer for whisper output.
- Text:
  
  ✔️ Give any text to generate the related details.
✔️ The output remains text, based on multimodal inputs - text, image, and audio.
✔️ The deployment page should look like ChatGPT only, where we can send in images, text, or upload audio (live recording or file).

Phi2 : Pretraining LLM from Scratch

Model used: Microsoft Phi2
Dataset used: Tiny Stories dataset(100k samples) & Realtime data(100k samples) from finetuned Phi2 model via Ollama
Pretraining approach: Pretraining using QLoRA

class AudioLanguageConnector:

This class prepares and tokenizes audio-related text data using the "microsoft/phi-2" model's tokenizer. The <audio_start> and <audio_end> tokens are added to the input text to provide context for audio-related processing. The tokenized output is then returned as a tensor. This class acts as a connector to process text data in a format suitable for the specified audio model.

class WhisperWithProjection:

This class transcribes audio by encapsulating the necessary steps. It uses a pre-trained model called "openai/whisper-tiny" to convert audio files into text transcriptions.

class MultiModalPhi2:

This class takes input text, audio, and images and constructs a conversation prompt with appropriate formatting for the model. It tokenizes the prompt, preprocesses the image, and concatenates audio embeddings if available, and generates new tokens using the pre-trained model, considering input modalities. Decodes and returns the generated output, handling special tokens and potential mismatches.

Incorporating the original Llava model's finetuning on a larger set of BLIP captions (558k) could lead to significant improvements.
Using GPTQ or AWQ can reduce latency, making the model more efficient.