GU-DataLab / PoliBERTweet

A transformer-based language model trained on politics-related Twitter data. This repo is the official resource of the paper "PoliBERTweet: A Pre-trained Language Model for Analyzing Political Content on Twitter", LREC 2022

Home Page:https://aclanthology.org/2022.lrec-1.801/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🎊 PoliBERTweet: Language Models for Political Tweets

Transformer-based language models pre-trained on a large amount of politics-related Twitter data (83M tweets). This repo is the official resource of the following paper.

πŸ“š Data Sets

The data sets for the evaluation tasks presented in our paper are available below.

πŸš€ Pre-trained Models

All models are uploaded to my Huggingface πŸ€— so you can load model with just three lines of code!!!

βš™οΈ Usage

We tested in pytorch v1.10.2 and transformers v4.18.0.

  • To fine-tune our models for a specific task (e.g. stance detection), see the HuggingFace Doc
  • Please see specific model pages above for more usage details. Below is a sample use case.

1. Load the model and tokenizer

from transformers import AutoModel, AutoTokenizer, pipeline
import torch

# Choose GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Select mode path here
pretrained_LM_path = "kornosk/polibertweet-mlm"

# Load model
tokenizer = AutoTokenizer.from_pretrained(pretrained_LM_path)
model = AutoModel.from_pretrained(pretrained_LM_path)

2. Predict the masked word

# Fill mask
example = "Trump is the <mask> of USA"
fill_mask = pipeline('fill-mask', model=pretrained_LM_path, tokenizer=tokenizer)

outputs = fill_mask(example)
print(outputs)

3. See embeddings

# See embeddings
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
print(outputs)

# OR you can use this model to train on your downstream task!
# please consider citing our paper if you feel this is useful :)

4. Fine-tune to a downstream task like stance detection

See details in the HuggingFace Doc.

✏️ Citation

If you feel our paper and resources are useful, please consider citing our work! πŸ™

@inproceedings{kawintiranon2022polibertweet,
  title     = {{P}oli{BERT}weet: A Pre-trained Language Model for Analyzing Political Content on {T}witter},
  author    = {Kawintiranon, Kornraphop and Singh, Lisa},
  booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)},
  year      = {2022},
  pages     = {7360--7367},
  publisher = {European Language Resources Association},
  url       = {https://aclanthology.org/2022.lrec-1.801}
}

πŸ›  Throubleshoots

Create an issue here if you have any issues loading models or data sets.

About

A transformer-based language model trained on politics-related Twitter data. This repo is the official resource of the paper "PoliBERTweet: A Pre-trained Language Model for Analyzing Political Content on Twitter", LREC 2022

https://aclanthology.org/2022.lrec-1.801/

License:MIT License