augmentation swedish-language text-augmentation text-classification swedish

Swedish Augmentation Packages

Includes many different Augmentation packages for Swedish.

How do i setup?

Step 1

!git clone https://github.com/mosh98/swe_aug.git

This is built on top of a swedish word2vec. Make sure you download that first.

Step 2

!wget https://www.ida.liu.se/divisions/hcs/nlplab/swectors/swectors-300dim.txt.bz2
!bzip2 -dk /content/swectors-300dim.txt.bz2
!pip install -r reqs.txt


word_vec_path = '/content/swectors-300dim.txt' #path to txt vector file

#you can even set path to your own pretrain word2vec (make sure its a txt file)

Then Use your desired augmentation package

EDA

EDA: Easy Data Augmentation in Swedish

What is EDA? [2]

A way to augment data in a way that is easy to understand and use. There are 4 mains components

Random Synomym Replacement
Random Word Replacement
Random Word Deletion
Random Word Insertion

from swe_aug import EDA
aug = EDA.Enkel_Data_Augmentation(word_vec_path)

txt = "enter ur desired text. It can be a sentence or a paragraph"

augmented_sentences = aug.enkel_augmentation(txt, alpha_sr=0.1, 
                                             alpha_ri=0.3, alpha_rs=0.2, 
                                             alpha_rd=0.1, num_aug=4)
#returns a list of augmented sentences

Text Fragmenter

from swe_aug.Other_Techniques import Text_Cropping

frag = Text_Cropping.cropper(percent = 0.25)
list_of_fragmented_sentence = frag.text_fragmeter(txt)
# chops sentence into 4 halfs.

Type Specific Similar word Replacement

Idea is to replace word that are similar in an embeddings space that has the same POS token. [4]

# "NOUN", "VERB", "ADJ", "ADV", "PROPN","CONJ"
#These are the tokens you can perturb! [CASE SENSITIVE!]

from swe_aug.Other_Techniques import Type_SR
aug = Type_SR.type_DA(word_vec_path)

list_of_augs = aug.type_synonym_sr(txt, token_type = "NOUN", n = 2)

References

[1] Swedish word2vec: https://www.ida.liu.se/divisions/hcs/nlplab/swectors/

[2] EDA: https://aclanthology.org/D19-1670/

[3] Text Fragmenter: That was me

[4] Type Specific: That was me too

Cite?

@software{Mahamud2022,
  author = {Mahamud,Mosleh},
  title = {Swedish Augmentation Packages},
  year = {2022},
  publisher = {GitHub},
  journal = {Not Decided yet},
  howpublished = {\url{https://github.com/mosh98/swe_aug}},
}

About

Dritributed Text Augmentation Techniques (Appeared AAAI 2023)

https://knowledge-nlp.github.io/aaai2023/papers/019-augmentation-poster.pdf

augmentation swedish-language text-augmentation text-classification swedish

Languages

Language:Python 100.0%