Language Model Training and Inference using GPT-2

This script demonstrates language model training and inference using GPT-2 within a Colab environment, providing step-by-step instructions and explanations:

Setup and Dependencies

Install Libraries: Installs necessary libraries like torch, transformers, torchtext, sentencepiece, pandas, and tqdm.
Load and Preprocess Data: Loads a dataset from Hugging Face ("QuyenAnhDE/Diseases_Symptoms") and processes it into a pandas DataFrame.

Model Setup and Training

Tokenization and Model Initialization: Initializes GPT-2 tokenizer and model.
Dataset Preparation: Defines a custom Dataset class (LanguageDataset) for ingesting data into the model, including tokenization and formatting.
Training Loop: Trains the GPT-2 model using the custom dataset, splitting the data into train and validation sets. Implements the training loop, calculating losses, and optimizing model parameters.
Hyperparameters and Training Metrics: Defines hyperparameters like batch size, epochs, learning rate, and GPU utilization. Monitors training and validation losses, recording training duration.

Inference

Inference Example: Performs inference on the trained model by generating text based on user input ("Kidney Failure"). Generates sequences using GPT-2 with specified settings (max_length, num_return_sequences, top_k, top_p, temperature, repetition_penalty).

Usage

Requirements: Ensure the installation of required packages (torch, transformers, etc.) before executing the script.
Data Handling: Modify dataset loading and preprocessing for different datasets.
Model Customization: Adjust model hyperparameters and training configurations.
Inference Modification: Customize inference by changing input text and generation settings for different outputs.

The provided code offers a detailed demonstration of using GPT-2 for language model training and inference, allowing customization and modification for various NLP tasks and datasets.

About

using distilgpt2 to comprehend diseases and symptoms, exploring dataset loading, tokenization, hyperparameter tuning, and text generation, empowering contextual understanding in medical contexts.

Languages

Language:Jupyter Notebook 100.0%