marcderbauer / bloom

Generating headlines for the VICE Youtube channel using BLOOM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generate Vice Headlines with Bloom

Try it here

❗ Requirements

Your Python installation needs to be version 3.8 or higher.

🏃 Quickstart

If you can't be bothered to read all of this, you can just run

chmod +x run.sh     # Make run.sh executable
./run.sh            # Run the program

This will:

  1. Install all the required libraries
  2. Run three epochs of training
  3. Generate an inference

You can then generate more inferences as described below.

❄️ Context

This project originally started out as an RNN I wanted to implement in Pytorch. I had difficulties getting the model to create a coherent output. As I lacked reference values for training, I decided to finetune an existing model -- BLOOM. I hoped to learn more about the text-generation process from a top-down perspective, and to gather reference values for training in a "best-case" scenario.

🤖 Setup

1. Install the Required Dependencies

pip install -r requirements.txt

2. Setup YouTube API

❗ This step is only necessary if you want to source the data yourself❗
The dataset used to train the model is included under /data/. It was collected 23.09.2022.

The data for this project is gathered through the YouTube Data API v3. Setting up this API can roughly be divided into the following steps:

  1. Create a Google Developer Account
  2. Create a new project
  3. Enable the YouTube Data API v3
  4. Create credentials
  5. Make the credentials accessible to your environment

For in-depth guidance, please refer to this excellent HubSpot Article.

📊 Data

❗If you decided to use the data included in the repository, you can skip this section.❗

1. Collecting the Data

Assuming you setup the YouTube API correctly, all you need to do is run the youtube/query_api.py. It requires the name of your client_secrets_file. You need to supply the requested channel's playlistId as an argument when launching the program. It is possible to supply multiple playlistIds at once by seperating them with a space.

In order to find a channel's playlistId you need to

  1. Go to the channel
  2. Find a playlist with all the channel's videos included (often the first playlist)
  3. Click PLAY ALL
  4. Copy everything after list= from the link

Thus, the command to download all the titles for VICE and VICE News is:

python3 youtube/query_api.py UUn8zNIfYAQNdrFRrr8oibKw PLw613M86o5o7q1cjb26MfCgdxJtshvRZ-

2. Cleaning the Data

To clean the data, you just need to run the preprocess.py. Assuming the file to process is called vice.txt, the command is:

python3 preprocess.py vice.txt

By default, this removes non-english sentences, duplicates and entries consisting of less than three words. The resulting file is automatically split into sets of 80% train and 20% test in /data/.


📉 Training

Training can easily be run by executing the main.py.
If you have Weights & Biases set up, you add a flag to activate it as such:

python3 main.py --wandb

🗿 Inference

Inference can be run by executing inference.py with the prompt as argument. Furthermore, you can pass certain inference parameters as arguments e.g.:

python3 inference.py North Korea --temp 0.42 --top_k 32 --rp 1.3

Output:
temp=0.42; k=32, p=0.92, rep=1.3:
----------------------------------------------------------------------------------------------------
North Korea's 'Most Humane' Hospital

Huggingface made a great tutorial on different generation strategies, where each inference parameter is explained in depth.


♻️ Conclusion

This project has been very insightful in gaining an understanding of text-generation from a top-down perspective. While implementing this project as a PyTorch RNN, I mostly scrambled around without having much of an understanding of what I was doing.
By fine-tuning BLOOM, I learned how to fine-tune an existing model, how to source data, how to pre-process it correctly and how to host the resulting model on Hugging Face Hub with Gradio.

About

Generating headlines for the VICE Youtube channel using BLOOM


Languages

Language:Python 97.8%Language:Shell 2.2%