mosh98 / swe_aug

Dritributed Text Augmentation Techniques (Appeared AAAI 2023)

Home Page:https://knowledge-nlp.github.io/aaai2023/papers/019-augmentation-poster.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Swedish Augmentation Packages

Includes many different Augmentation packages for Swedish.


How do i setup?

Step 1

!git clone https://github.com/mosh98/swe_aug.git

This is built on top of a swedish word2vec. Make sure you download that first.

Step 2

!wget https://www.ida.liu.se/divisions/hcs/nlplab/swectors/swectors-300dim.txt.bz2
!bzip2 -dk /content/swectors-300dim.txt.bz2
!pip install -r reqs.txt


word_vec_path = '/content/swectors-300dim.txt' #path to txt vector file

#you can even set path to your own pretrain word2vec (make sure its a txt file)

Then Use your desired augmentation package


EDA Open in Colab

EDA: Easy Data Augmentation in Swedish

What is EDA? [2]

A way to augment data in a way that is easy to understand and use. There are 4 mains components

  1. Random Synomym Replacement
  2. Random Word Replacement
  3. Random Word Deletion
  4. Random Word Insertion
from swe_aug import EDA
aug = EDA.Enkel_Data_Augmentation(word_vec_path)

txt = "enter ur desired text. It can be a sentence or a paragraph"
augmented_sentences = aug.enkel_augmentation(txt, alpha_sr=0.1, 
                                             alpha_ri=0.3, alpha_rs=0.2, 
                                             alpha_rd=0.1, num_aug=4)
#returns a list of augmented sentences

Text Fragmenter Open in Colab

from swe_aug.Other_Techniques import Text_Cropping

frag = Text_Cropping.cropper(percent = 0.25)
list_of_fragmented_sentence = frag.text_fragmeter(txt)
# chops sentence into 4 halfs.

Type Specific Similar word Replacement Open in Colab

Idea is to replace word that are similar in an embeddings space that has the same POS token. [4]

# "NOUN", "VERB", "ADJ", "ADV", "PROPN","CONJ"
#These are the tokens you can perturb! [CASE SENSITIVE!]

from swe_aug.Other_Techniques import Type_SR
aug = Type_SR.type_DA(word_vec_path)

list_of_augs = aug.type_synonym_sr(txt, token_type = "NOUN", n = 2)

References

[1] Swedish word2vec: https://www.ida.liu.se/divisions/hcs/nlplab/swectors/

[2] EDA: https://aclanthology.org/D19-1670/

[3] Text Fragmenter: That was me

[4] Type Specific: That was me too

Cite?

@software{Mahamud2022,
  author = {Mahamud,Mosleh},
  title = {Swedish Augmentation Packages},
  year = {2022},
  publisher = {GitHub},
  journal = {Not Decided yet},
  howpublished = {\url{https://github.com/mosh98/swe_aug}},
}