⭐ 🔖 awesome-generative-ai-guide

Generative AI is experiencing rapid growth, and this repository serves as a comprehensive hub for updates on generative AI research, interview materials, notebooks, and more!

Explore the following resources:

We'll be updating this repository regularly, so keep an eye out for the latest additions!

Happy Learning!

🔈 Announcements

Applied LLMs Mastery full course content has been released!!! (Click Here)
5-day roadmap to learn LLM foundations out now! (Click Here)
60 Common GenAI Interview Questions out now! (Click Here)
ICLR 2024 paper summaries (Click Here)
List of free GenAI courses (Click Here)
Generative AI resources and roadmaps

⭐ Best GenAI Papers List (April 2024)

*Updated at the end of every month

Date	Title	Summary	Topics
April 30, 2024	Octopus v4: Graph of language models	This paper introduces Octopus v4, a novel approach leveraging functional tokens to integrate multiple open-source language models optimized for specific tasks. Octopus v4 excels in directing user queries to the most appropriate model and reformulating queries for optimal performance, building upon previous iterations (v1, v2, and v3) with enhanced selection and parameter understanding. Additionally, it explores the use of graphs as a versatile data structure to coordinate multiple models effectively.	Foundational LLM
April 30, 2024	Better & Faster Large Language Models via Multi-token Prediction	This paper proposes training language models to predict multiple future tokens simultaneously, enhancing sample efficiency without increasing training time. By employing multiple output heads for predicting n tokens ahead, the method improves downstream capabilities for both code and natural language models. Particularly beneficial for larger models, it consistently outperforms single-token prediction on generative benchmarks like coding, showing notable gains in problem-solving tasks. Moreover, models trained with multi-token prediction demonstrate up to threefold faster inference speeds, even with large batch sizes, offering additional efficiency benefits.	New Architecture
April 30, 2024	Extending Llama-3's Context Ten-Fold Overnight	The Llama-3-8B-Instruct model's context length is extended from 8K to 80K through efficient QLoRA fine-tuning, requiring only 8 hours on a single 8xA800 GPU machine. This extension significantly enhances model performance across various evaluation tasks like NIHS and topic retrieval, while maintaining proficiency in short-context tasks. Surprisingly, the extension is achieved with just 3.5K synthetic training samples from GPT-4, showcasing the untapped potential of LLMs to extend context lengths. The team plans to release all associated resources publicly, including data, model, data generation pipeline, and training code.	Context Length
April 29, 2024	Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models	This paper addresses the challenge of accurately evaluating the quality of LLMs by proposing the use of a Panel of LLM Evaluators (PoLL) instead of relying on a single large model like GPT-4. The PoLL approach, composed of a larger number of smaller models, outperforms single large judges across three distinct settings and six datasets. It exhibits less intra-model bias and is over seven times less expensive, offering a cost-effective and more reliable evaluation method for LLMs.	Evaluation
April 28, 2024	AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs	This paper presents a novel method for generating human-readable adversarial prompts, called AdvPrompter, to address jailbreaking attacks on LLMs. Unlike existing optimization-based approaches, AdvPrompter achieves adversarial prompt generation in seconds, 800 times faster, without requiring access to gradients from the TargetLLM. The method alternates between generating high-quality target adversarial suffixes and low-rank fine-tuning of AdvPrompter. Experimental results demonstrate state-of-the-art performance on the AdvBench dataset and transferability to closed-source black-box LLM APIs.	Adversarial Attacks, Evaluation
April 28, 2024	Capabilities of Gemini Models in Medicine	Med-Gemini, a specialized multimodal model for medical tasks, surpasses GPT-4 on various benchmarks, achieving state-of-the-art results in medical text summarization and question answering. With its advanced long-context reasoning, it outperforms existing methods in tasks such as needle-in-a-haystack retrieval from medical records. While promising, further evaluation is needed before deployment in real-world medical applications.	Domain-Specific LLMs
April 25, 2024	How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	InternVL 1.5, an open-source multimodal large language model (MLLM), bridges the gap between open-source and proprietary commercial models in multimodal understanding. It introduces three improvements: a Strong Vision Encoder, Dynamic High-Resolution image processing supporting up to 4K resolution, and a High-Quality Bilingual Dataset. Evaluation across benchmarks demonstrates its effectiveness compared to both open-source and proprietary models.	Multimodal LLMs
April 25, 2024	Make Your LLM Fully Utilize the Context	This paper introduces information-intensive (IN2) training to address the lost-in-the-middle challenge faced by contemporary LLMs. Leveraging a synthesized long-context question-answer dataset, IN2 training emphasizes fine-grained information awareness within long contexts. Applying this approach to Mistral-7B yields FILM-7B (FILl-in-the-Middle), which robustly retrieves information from various positions in a 32K context window. FILM-7B improves performance on real-world long-context tasks while maintaining comparable performance on short-context tasks	Context Length
April 25, 2024	SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension	This work introduces SEED-Bench-2-Plus, a benchmark specifically tailored for evaluating text-rich visual comprehension of Multimodal Large Language Models (MLLMs). With 2.3K multiple-choice questions covering Charts, Maps, and Webs, it aims to simulate real-world text-rich scenarios comprehensively. Evaluation involving 34 prominent MLLMs highlights current limitations in text-rich visual comprehension, emphasizing the need for further research and improvement in this area. SEED-Bench-2-Plus serves as a valuable addition to existing MLLM benchmarks, offering insightful observations and inspiring future developments in text-rich visual comprehension.	Multimodal LLMs
April 23, 2024	AURORA -M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order	This paper presents AURORA -M, a multilingual open-source language model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. It surpasses 2 trillion tokens in total training token count and is fine-tuned on human-reviewed safety instructions, aligning its development with the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. AURORA -M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations.	Domain-Specific LLMs
April 23, 2024	Multi-Head Mixture-of-Experts	This paper introduces Multi-Head Mixture-of-Experts (MH-MoE) to address issues in Sparse Mixtures of Experts (SMoE), specifically low expert activation and lack of fine-grained analytical capabilities. MH-MoE employs a multi-head mechanism to split tokens into sub-tokens, assigning them to diverse experts for parallel processing before reintegrating them. This approach enhances expert activation, deepening context understanding and alleviating overfitting. MH-MoE is easy to implement and integrates seamlessly with other SMoE models, as demonstrated across English-focused language modeling, Multi-lingual language modeling, and Masked multi-modality modeling tasks.	New Architecture
April 22, 2024	Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone	This paper introduces phi-3-mini, a compact 3.8 billion parameter language model trained on 3.3 trillion tokens, delivering competitive performance similar to larger models like Mixtral 8x7B and GPT-3.5. Achieving notable scores on benchmarks such as MMLU (69%) and MT-bench (8.38), phi-3-mini is designed for deployment on mobile devices. The innovation lies in its dataset, a scaled-up version of phi-2's, comprising heavily filtered web data and synthetic data. Additionally, initial parameter-scaling results with phi-3-small and phi-3-medium models trained on 4.8T tokens demonstrate further enhanced performance.	Foundational LLM
April 22, 2024	How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study	This paper explores the performance of Meta's LLaMA3 LLMs under low-bit quantization, essential for resource-limited scenarios. Despite their impressive pre-training on over 15T tokens, LLaMA3 models exhibit notable degradation when quantized to low bit-width. Evaluating 10 quantization methods on 1-8 bits across diverse datasets, the study reveals significant performance gaps, especially in ultra-low bit-width scenarios, highlighting the need for future developments to bridge this gap for practical applications.	Quantization
April 22, 2024	FlowMind: Automatic Workflow Generation with LLMs	This paper introduces FlowMind, leveraging LLMs like Generative Pretrained Transformers (GPT) to automate workflow generation in Robotic Process Automation (RPA), overcoming limitations in handling spontaneous tasks. FlowMind's generic prompt recipe grounds LLM reasoning with reliable APIs, mitigating hallucination issues and ensuring data confidentiality. It simplifies user interaction by presenting high-level workflow descriptions, allowing effective inspection and feedback. Evaluation on NCEN-QA dataset demonstrates FlowMind's success and the significance of its components in enhancing user interaction and workflow generation.	LLM Agents
April 22, 2024	SnapKV: LLM Knows What You are Looking for Before Generation	This paper introduces SnapKV, a fine-tuning-free approach to efficiently minimize Key-Value (KV) cache size in LLMs while maintaining comparable performance. SnapKV utilizes attention head-specific prompt features identified from an 'observation' window, automatically compressing KV caches by selecting clustered important positions. This significantly reduces computational overhead and memory footprint, achieving a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline models when processing long input sequences	Fine-Tuning
April 21, 2024	AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation	This paper introduces AutoCrawler, a two-stage framework merging LLMs with crawlers to enhance adaptability in web automation. Addressing limitations of traditional methods and standalone LLM-based agents, AutoCrawler employs a hierarchical HTML structure for progressive understanding through top-down and step-back operations. Comprehensive experiments validate the effectiveness of this approach in handling diverse and changing web environments efficiently	LLM Agents
April 21, 2024	Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	This paper presents Groma, a Multimodal Large Language Model (MLLM) equipped with fine-grained visual perception capabilities, enabling region-level tasks like captioning and visual grounding. Groma employs a localized visual tokenization mechanism to decompose images into regions of interest, seamlessly integrating region tokens into user instructions and model responses. By curating a visually grounded instruction dataset, Groma outperforms MLLMs relying solely on language models or external modules for localization, demonstrating superior performance in standard referring and grounding benchmarks.	Multimodal LLMs
April 18, 2024	Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing	This paper addresses the challenge of enhancing LLMs reasoning and planning capabilities without relying on extensive data or fine-tuning. Introducing AlphaLLM, it integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop. AlphaLLM includes components for prompt synthesis, an efficient MCTS approach for language tasks, and critic models for precise feedback. Experimental results on mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances LLM performance without additional annotations, showcasing its potential for self-improvement in complex reasoning and planning tasks.	Instruction Tuning
April 18, 2024	Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment	This paper explores a simple approach for zero-shot cross-lingual alignment of language models using reward models trained on preference data from one source language and applied to other target languages. Evaluations on summarization and open-ended dialog generation tasks consistently show the success of this method, with cross-lingually aligned models preferred by humans in over 70% of evaluation instances. Surprisingly, different-language reward models sometimes outperform same-language ones. The study also identifies best practices for alignment when language-specific data for supervised fine-tuning is unavailable.	Instruction Tuning
April 18, 2024	Introducing v0.5 of the AI Safety Benchmark from MLCommons	This paper presents v0.5 of the AI Safety Benchmark, developed by the MLCommons AI Safety Working Group, to assess safety risks of chat-tuned language models. It introduces a principled approach, covering a single use case and personas, along with a taxonomy of 13 hazard categories and tests for 7 categories. Version 1.0 is planned for release by 2024, aiming to provide deeper insights into AI system safety. While v0.5 should not be used for safety assessment, it offers detailed documentation and tools for evaluation, including a grading system and an openly available platform called ModelBench.	Benchmark, Evaluation
April 16, 2024	Octopus v2: On-device language model for super agent	This research presents a new method that empowers an on-device language model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, while reducing the context length by 95%. The method addresses concerns over privacy and cost associated with large-scale language models in cloud environments by enabling deployment on edge devices such as smartphones, cars, VR headsets, and personal computers. By enhancing latency and reducing inference costs, the method aligns with the performance requisites for real-world applications, making it suitable for deployment across a variety of edge devices in production environments.	Small LLMs
April 15, 2024	Learn Your Reference Model for Real Good Alignment	Existing methods for the alignment problem are unstable, prompting researchers to develop various techniques. In Language Model alignment, Reinforcement Learning From Human Feedback (RLHF) minimizes the Kullback-Leibler divergence between policies to prevent overfitting. Direct Preference Optimization (DPO) aims to eliminate the Reward Model but faces limitations. We propose Trust Region DPO (TR-DPO), updating the reference policy during training, which outperforms DPO by up to 19% on Anthropic HH and TLDR datasets, enhancing model quality across multiple parameters.	Prompt Engineering
April 15, 2024	Compression Represents Intelligence Linearly	This paper investigates the relationship between compression and intelligence in LLMs, finding that LLMs' ability to compress external text corpora correlates almost linearly with their intelligence, as measured by benchmark scores. The results provide empirical evidence supporting the belief that superior compression reflects greater intelligence. Additionally, compression efficiency serves as a reliable evaluation measure associated with model capabilities, with open-sourced datasets and pipelines provided for future research in compression assessment.	Model Compression
April 14, 2024	Pre-training Small Base LMs with Fewer Tokens	The paper presents Inheritune, a straightforward method for constructing a smaller language model from a larger one by inheriting transformer blocks and training it on a fraction of the original pretraining data. They showcase its effectiveness by building a 1.5B parameter LM using only 1B tokens from a larger model, achieving comparable performance to publicly available models trained on significantly more data. Furthermore, they demonstrate that smaller LMs utilizing layers from larger ones can match the performance of their bigger counterparts when trained on equivalent data volumes. Extensive experiments validate the efficacy of Inheritune across diverse settings.	Small LLMs
April 14, 2024	Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies	The paper explores scaling down Contrastive Language-Image Pre-training (CLIP) under limited computation budgets across data, architecture, and training strategies. It emphasizes the importance of high-quality data and suggests smaller ViT models for smaller datasets and larger ones for larger datasets with fixed compute. Additionally, it compares four training strategies, finding that CLIP+Data Augmentation achieves comparable results to CLIP using half the data, offering practical insights for CLIP training and deployment.	Vision Models
April 12, 2024	Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length	Megalodon addresses the quadratic complexity and weak length extrapolation issues of Transformers by introducing a neural architecture for efficient sequence modeling with unlimited context length. It inherits Mega's architecture and incorporates enhancements such as complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism, and pre-norm with two-hop residual configuration. In a head-to-head comparison with Llama2, Megalodon demonstrates better efficiency than Transformers with 7 billion parameters and 2 trillion training tokens, achieving a training loss of 1.70, positioning between Llama2-7B (1.75) and 13B (1.67)	Context Length
April 11, 2024	RULER: What’s the Real Context Size of Your Long-Context Language Models?	The needle-in-a-haystack (NIAH) test, widely used to evaluate long-context language models, assesses the ability to retrieve information from long distractor texts. However, it only measures a superficial form of long-context understanding. To provide a more comprehensive evaluation, a new synthetic benchmark called RULER is introduced. RULER expands upon the NIAH test by incorporating variations with diverse types and quantities of needles and introduces new task categories like multi-hop tracing and aggregation to test behaviors beyond context searching. The evaluation of ten long-context LMs with 13 representative tasks in RULER reveals large performance drops as the context length increases, despite nearly perfect accuracy in the NIAH test. Only four models can maintain satisfactory performance at the length of 32K tokens. RULER is open-sourced to encourage comprehensive evaluation of long-context LMs.	Context Length
April 11, 2024	Social Skill Training with Large Language Models	This perspective paper identifies social skill barriers to enter specialized fields and presents a solution leveraging large language models for social skill training via a generic framework. The proposed AI Partner, AI Mentor framework merges experiential learning with realistic practice and tailored feedback. The work calls for cross-disciplinary innovation to address the broader implications for workforce development and social equality.;Social Skill Training; LLMs	Alignment
April 11, 2024	Rho-1: Not All Tokens Are What You Need	Traditional language model pre-training methods treat all tokens equally, but our research challenges this by showing that not all tokens are equally important. We introduce Rho-1, a new model that selectively trains on tokens aligned with the desired distribution, improving few-shot accuracy in math tasks by up to 30%. After fine-tuning, Rho-1 achieves state-of-the-art results on the MATH dataset with significantly fewer pretraining tokens compared to existing models. Moreover, pretraining Rho-1 on general tokens enhances performance across diverse tasks, boosting both efficiency and effectiveness in language model pre-training.	New Architecture
April 11, 2024	RecurrentGemma: Moving Past Transformers for Efficient Open Language Models	The paper introduces RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.	New Architecture
April 11, 2024	Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models	This work presents Ferret-v2, an upgraded version of Ferret that overcomes limitations in regional understanding within LLMs. Ferret-v2 introduces three key enhancements: (1) Any resolution grounding and referring for improved image processing at higher resolutions. (2) Multi-granularity visual encoding using the DINOv2 encoder to better capture diverse visual contexts. (3) A three-stage training paradigm, including high-resolution dense alignment, leading to substantial improvements over Ferret and other state-of-the-art methods in referring and grounding tasks.	Benchmark, Evaluation
April 10, 2024	JetMoE: Reaching Llama2 Performance with 0.1M Dollars	The paper introduces JetMoE-8B, a cost-effective and high-performing Large Language Model trained with minimal resources. Its efficient architecture reduces computation significantly compared to previous models, while its transparency encourages collaboration and advancements in accessible LLM development.	Foundational LLM
April 9, 2024	Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models	This paper examines how LLMs handle tabular data, focusing on issues of memorization and overfitting. It finds that LLMs memorize popular tabular datasets and perform better on these, suggesting overfitting. The study also highlights the limited in-context statistical learning abilities of LLMs without fine-tuning, emphasizing the importance of evaluating whether an LLM has seen an evaluation dataset during pre-training. The paper introduces the tabmemcheck Python package for testing exposure to datasets.	Domain-Specific LLMs
April 9, 2024	Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention	This work introduces an efficient method to scale Transformer-based LLMs to infinitely long inputs with bounded memory and computation. A key component in the proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds both masked local attention and long-term linear attention mechanisms in a single Transformer block. The effectiveness of this approach is demonstrated on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval, and 500K length book summarization tasks with 1B and 8B LLMs. The approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.	Context Length
April 8, 2024	LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders	Large decoder-only language models (LLMs) excel in NLP tasks but are underutilized for text embedding. This study introduces LLM2Vec, a method converting decoder-only LLMs into robust text encoders via bidirectional attention, masked token prediction, and contrastive learning. Applied to LLMs with 1.3B to 7B parameters, LLM2Vec surpasses encoder-only models on word-level tasks and achieves a new unsupervised state-of-the-art on the Massive Text Embeddings Benchmark (MTEB). Integration with supervised contrastive learning further boosts performance, demonstrating the potential to create universal text encoders from LLMs without costly adaptation or synthetic data.	New Architecture
April 7, 2024	Stream of Search (SoS): Learning to Search in Language	This paper introduces the concept of Stream of Search (SoS), teaching language models to search by representing the process in language. SoS is demonstrated using the game of Countdown, where models are trained to combine input numbers and arithmetic operations to reach a target number. Pretraining on SoS increases search accuracy by 25%, and further fine-tuning allows models to solve 36% of previously unsolved problems. This approach enables language models to learn problem-solving strategies and potentially discover new ones.	Domain-Specific LLMs
April 4, 2024	Long-context LLMs Struggle with Long In-context Learning	This study introduces a specialized benchmark, LongICLBench, focusing on long in-context learning within the realm of extreme-label classification. The benchmark evaluates 13 long-context LLMs on datasets with input lengths ranging from 2K to 50K tokens and label ranges spanning 28 to 174 classes. While long-context LLMs perform relatively well on less challenging tasks with shorter demonstration lengths, they struggle on more difficult tasks, reaching close to zero accuracy on the most challenging task, Discovery with 174 labels. Further analysis reveals a gap in current LLM capabilities for processing and understanding long, context-rich sequences, indicating the need for improved long context understanding and reasoning abilities in future LLMs.	Context Length
April 4, 2024	ReFT: Representation Finetuning for Language Models	This paper introduces Representation Finetuning (ReFT) methods as an alternative to parameter-efficient fine-tuning (PEFT) methods for adapting large language models. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations, aiming to edit representations rather than modifying weights. A strong instance of ReFT, called Low-rank Linear Subspace ReFT (LoReFT), is presented, which achieves 10-50 times more parameter efficiency than prior PEFTs. LoReFT is showcased on various evaluation tasks, delivering the best balance of efficiency and performance compared to existing methods.	Fine-Tuning, PEFT
April 4, 2024	Training LLMs over Neurally Compressed Text	This paper explores training LLMs over highly compressed text using neural text compressors. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. The main obstacle to training LLMs directly over neurally compressed text is that strong compression tends to produce opaque outputs not well-suited for learning. To address this, the paper proposes Equal-Info Windows, a compression technique segmenting text into blocks that compress to the same bit length. This method enables effective learning over neurally compressed text, improving with scale and outperforming byte-level baselines on perplexity and inference speed benchmarks. The paper also provides suggestions for further improving high-compression tokenizers.	Model Compression
April 4, 2024	CODE EDITOR BENCH: EVALUATING CODE EDITING CAPABILITY OF LARGE LANGUAGE MODELS	This paper introduces CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. CodeEditorBench emphasizes real-world scenarios and practical aspects of software development by curating diverse coding challenges and scenarios from various sources. Evaluation of 19 LLMs reveals that closed-source models, particularly Gemini-Ultra and GPT-4, outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities and will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs.	Evaluation
April 4, 2024	GPT-4V Red-teamed under 11 Different Safety Policies	This paper presents a comprehensive jailbreak evaluation dataset comprising 1445 harmful questions across 11 safety policies. Extensive red-teaming experiments are conducted on 11 different LLMs and Multimodal Large Language Models (MLLMs), including both state-of-the-art proprietary and open-source models. Results reveal GPT4 and GPT-4V's superior robustness against jailbreak attacks compared to open-source models. Notably, Llama2 and Qwen-VL-Chat demonstrate higher robustness among open-source models. The transferability of visual jailbreak methods is found to be relatively limited compared to textual jailbreak methods.	Red Teaming
April 4, 2024	RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	RALL-E presents a robust language modeling approach for text-to-speech (TTS) synthesis, addressing issues of poor robustness in LLMs such as unstable prosody and high word error rate (WER). The method employs chain-of-thought (CoT) prompting to decompose the task into simpler steps, predicting prosody features of the input text and using them as intermediate conditions to predict speech tokens. Additionally, RALL-E utilizes predicted duration prompts to guide self-attention weights, improving focus on corresponding phonemes and prosody features. Objective and subjective evaluations demonstrate significant improvements in WER compared to baseline methods, showcasing RALL-E's effectiveness in synthesizing challenging sentences with reduced error rates.	Prompt Engineering
April 4, 2024	CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues	The paper introduces the CANT TALKABOUT THIS dataset, aimed at aligning language models to maintain topic relevance in conversations. It consists of synthetic dialogues with distractor turns to divert chatbots from the predefined topic. Training on this dataset improves language models' ability to stay on topic and enhances performance on instruction-following tasks, including safety alignment	Alignment
April 3, 2024	On the Scalability of Diffusion-based Text-to-Image Generation	This paper empirically studies the scaling properties of diffusion-based text-to-image (T2I) models by conducting extensive ablations on scaling denoising backbones and training sets. The study explores various training settings and training costs to understand how to efficiently scale the model for better performance at reduced cost. The findings suggest that increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. Additionally, the quality and diversity of the training set have a significant impact on text-image alignment performance and learning efficiency. Scaling functions are provided to predict text-image alignment performance based on model size, compute, and dataset size.	Multimodal LLMs
April 2, 2024	Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward	This paper introduces a novel framework for aligning large multimodal models (LMMs) with video content using detailed video captions as a proxy. The framework enhances the performance of video LMMs on video Question Answering (QA) tasks by incorporating informative feedback and improving the accuracy of generated responses compared to corresponding videos. The approach utilizes direct preference optimization (DPO) to guide LMMs towards generating more accurate, helpful, and harmless content in multimodal contexts.	Multimodal LLMs
April 2, 2024	Advancing LLM Reasoning Generalists with Preference Trees	This paper introduces EURUS, a suite of LLMs optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, EURUS models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. EURUS outperforms existing open-source models by margins more than 13.3% on challenging benchmarks like LeetCode and TheoremQA. The strong performance of EURUS is attributed to ULTRA INTERACT, a large-scale alignment dataset designed for complex reasoning tasks, and a novel reward modeling objective derived from preference learning techniques.	Domain-Specific LLMs
April 2, 2024	Mixture-of-Depths: Dynamically allocating compute in transformer-based language models	This paper introduces a method for transformers to dynamically allocate compute, optimizing allocation across layers in the model depth. By capping the number of tokens participating in computations at each layer, the method uses a static computation graph with fluid token identities, resulting in efficient compute allocation. Models trained with this method match baseline performance but require fewer FLOPs per forward pass, speeding up training and sampling.	New Architecture
April 1, 2024	LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model	This paper presents LLaVA-Gemma, a suite of multimodal foundation models trained using the LLaVA framework with the Gemma family of LLMs, particularly the 2B parameter Gemma model. The study evaluates the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. While LLaVA-Gemma exhibits moderate performance on various evaluations, it fails to surpass current state-of-the-art models of comparable size. The paper releases training recipes, code, and weights for the LLaVA-Gemma models, facilitating further research in this area.	Multimodal LLMs

🎓 Courses

[Ongoing] Applied LLMs Mastery 2024

Join 1000+ students on this 10-week adventure as we delve into the application of LLMs across a variety of use cases

Link to the course website

[Feb 2024] Registrations are still open click here to register

🗓️*Week 1 [Jan 15 2024]*: Practical Introduction to LLMs

Applied LLM Foundations
Real World LLM Use Cases
Domain and Task Adaptation Methods

🗓️*Week 2 [Jan 22 2024]*: Prompting and Prompt Engineering

Basic Prompting Principles
Types of Prompting
Applications, Risks and Advanced Prompting

🗓️*Week 3 [Jan 29 2024]*: LLM Fine-tuning

Basics of Fine-Tuning
Types of Fine-Tuning
Fine-Tuning Challenges

🗓️*Week 4 [Feb 5 2024]*: RAG (Retrieval-Augmented Generation)

Understanding the concept of RAG in LLMs
Key components of RAG
Advanced RAG Methods

🗓️*Week 5 [ Feb 12 2024]*: Tools for building LLM Apps

Fine-tuning Tools
RAG Tools
Tools for observability, prompting, serving, vector search etc.

🗓️*Week 6 [Feb 19 2024]*: Evaluation Techniques

Types of Evaluation
Common Evaluation Benchmarks
Common Metrics

🗓️*Week 7 [Feb 26 2024]*: Building Your Own LLM Application

Components of LLM application
Build your own LLM App end to end

🗓️*Week 8 [March 4 2024]*: Advanced Features and Deployment

LLM lifecycle and LLMOps
LLM Monitoring and Observability
Deployment strategies

🗓️*Week 9 [March 11 2024]*: Challenges with LLMs

Scaling Challenges
Behavioral Challenges
Future directions

🗓️*Week 10 [March 18 2024]*: Emerging Research Trends

Smaller and more performant models
Multimodal models
LLM Alignment

🗓️*Week 11 *Bonus* [March 25 2024]*: Foundations

Generative Models Foundations
Self-Attention and Transformers
Neural Networks for Language

📖 List of Free GenAI Courses

📎 Resources

ICLR 2024 Paper Summaries

💻 Interview Prep

Topic wise Questions:

Common GenAI Interview Questions
Prompting and Prompt Engineering
Model Fine-Tuning
Model Evaluation
MLOps for GenAI
Generative Models Foundations
Latest Research Trends

GenAI System Design (Coming Soon):

Designing an LLM-Powered Search Engine
Building a Customer Support Chatbot
Building a system for natural language interaction with your data.
Building an AI Co-pilot
Designing a Custom Chatbot for Q/A on Multimodal Data (Text, Images, Tables, CSV Files)
Building an Automated Product Description and Image Generation System for E-commerce

📓 Code Notebooks

RAG Tutorials

AWS Bedrock Workshop Tutorials by Amazon Web Services
Langchain Tutorials by gkamradt
LLM Applications for production by ray-project
LLM tutorials by Ollama
LLM Hub by mallahyari

Fine-Tuning Tutorials

LLM Fine-tuning tutorials by ashishpatel26
PEFT example notebooks by Huggingface
Free LLM Fine-Tuning Notebooks by Youssef Hosni

Comprehensive LLM Code Repositories

LLM-PlayLab This playlab encompasses a multitude of projects crafted through the utilization of Transformer Models

✒️ Contributing

If you want to add to the repository or find any issues, please feel free to raise a PR and ensure correct placement within the relevant section or category.

📌 Cite Us

To cite this guide, use the below format:

@article{areganti_generative_ai_guide,
author = {Reganti, Aishwarya Naresh},
journal = {https://github.com/aishwaryanr/awesome-generative-ai-resources},
month = {01},
title = {{Generative AI Guide}},
year = {2024}
}

License

[MIT License]