Implementing Transformers with PyTorch - GPT (Decoder)
Language modelling with the WikiHow dataset
Setup
- Clone the repository and create a new Python virtual environment,
$> git clone --depth=1 shubham0204/language-modelling-with-pytorch
$> cd language-modelling-with-pytorch
$> python -m venv project_env
$> ./project_env/bin/activate
After activating the virtual environment, install Python dependencies,
$> (project_env) pip install -r requirements.txt
- The dataset is included in the
dataset/
directory as a text file. The next step is to execute theprocess_data.py
script which will read the articles from the text file and transform it into a tokenized corpus of sentences. Theprocess_data.py
requires an input directory (where the text file is) and an output directory to store the tokenized corpus along withindex2word
andword2index
mappings, serialized as Pythonpickle
. The input and the output directories are specified withdata_path
anddata_tensors_path
respectively in project's configuration fileproject_config.toml
.
data_path
defaults to the dataset/
directory in the project's root.
$> (project_env) python process_data.py
- After execution of the
process_data.py
script, thevocab_size
parameter gets changed in theproject_config.toml
. We're now ready to train the model. The parameters within thetrain
group inproject_config.toml
are used to control the training process. See Understandingproject_config.toml
to get more details about thetrain
parameters.
Inference and pretrained model
project_config.toml
Understanding The project_config.toml
file contains all settings required by the project which are used by nearly all scripts in the
project. Having a global configuration in a TOML file enhances control over the project and provides a sweet spot
where all settings could be viewed/modified at once.
While developing the project, I wrote a blog - Managing Deep Learning Models Easily With TOML Configurations. Do check it out.
The following sections describe each setting that can be changed through project_config.toml
train
configuration
num_train_iter
(int
): The number of iterations to be performed on the training dataset. Note, an iteration refers to the forward-pass of a single batch of data, and a back-pass to update the parameters.num_test_iter
(int
): The number of iterations to be performed on the test dataset.test_interval
(int
): The number of iterations after which testing should be performed.batch_size
(int
): Number of samples present in a batch.learning_rate
(float
): The learning rate used by theoptimizer
intrain.py
.checkpoint_path
(str
): Path where checkpoints will be saved during training.wandb_logging_enabled
(bool
): Enable/disable logging to Weights&Biases console intrain.py
.wandb_project_name
(str
): If logging is enabled, the name of theproject
that is to be used for W&B.resume_training
(bool
): IfTrue
, a checkpoint will be loaded and training will be resumed.resume_training_checkpoint_path
(str
): Ifresume_training = True
, the checkpoint will be loaded from this path.compile_model
(bool
): Whether to usetorch.compile
to speedup training intrain.py
data
configuration
vocab_size
(int
): Number of tokens in the vocabulary. This variable is set whenprocess_data.py
script is executed.test_split
(float
): Fraction of data used for testing the model.seq_length
(int
): Context length of the model. Input sequences ofseq_length
will be produced intrain.py
to train the model.data_path
(str
): Path of the text file containing the articles.data_tensors_path
(str
): Path of the directory where tensors of processed data will be stored.
model
configuration
embedding_dim
(int
): Dimensions of the output embedding fortorch.nn.Embedding
.num_blocks
(int
): Number of blocks to be used in the transformer model. A single block containsMultiHeadAttention
,LayerNorm
andLinear
layers. Seelayers.py
.num_heads_in_block
(int
): Number of heads used inMultiHeadAttention
.dropout
(float
): Dropout rate for the transformer.
deploy
configuration
host
(str
): Host IP used to deploy API endpoints.port
(int
): Port through which API endpoints will be exposed.
Deployment
The trained ML model can be deployed in two ways,
Web app with Streamlit
The model can be used with a StreamLit app easily,
$> (project_env) streamlit run app.py
In the app, we need to select a model for inference, the data_tensors_path
(required for tokenization) and other
parameters like number of words to generate and the temperature.
API Endpoints with FastAPI
The model can be accessed with REST APIs built with FastAPI,
$> (project_env) uvicorn api:server
A GET
request at the /predict
endpoint with query parameters prompt
, num_tokens
and temperature
can
be initiated to generate a response from the model,
curl --location 'http://127.0.0.1:8000/predict?prompt=prompt_text_here&temperature=1.0&num_tokens=100'