timpal0l / ScandiSent

Sentiment Corpus for Swedish ๐Ÿ‡ธ๐Ÿ‡ช Norwegian ๐Ÿ‡ณ๐Ÿ‡ด Danish ๐Ÿ‡ฉ๐Ÿ‡ฐ Finnish ๐Ÿ‡ซ๐Ÿ‡ฎ (and English ๐Ÿด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ScandiSent

Sentiment Corpus for Swedish ๐Ÿ‡ธ๐Ÿ‡ช Norwegian ๐Ÿ‡ณ๐Ÿ‡ด Danish ๐Ÿ‡ฉ๐Ÿ‡ฐ Finnish ๐Ÿ‡ซ๐Ÿ‡ฎ (and English ๐Ÿด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ)

Information

The corpus is crawled from se.trustpilot.com, no.trustpilot.com, dk.trustpilot.com, fi.trustpilot.com and trustpilot.com. It consists of reviews from all the 22 corresponding categories:

categories = ['animals_pets', 'electronics_technology', 'events_entertainment', 'vehicles_transportation',
'business_services', 'health_medical', 'home_garden', 'hobbies_crafts', 'home_services',
'legal_services_government', 'construction_manufactoring', 'food_beverages_tobacco', 'media_publishing',
'money_insurance', 'travel_vacation', 'restaurants_bars', 'public_local_services', 'shopping_fashion',
'education_training', 'beauty_wellbeing', 'sports', 'housing_utility_company']

The size for each language is 10 000 texts evenly balanced between positive and negative reviews. A positive review is considered as a text with the rating 4 or 5, and a negative review is rated as 1 or 2. The texts rated as 3 were not used. The zip files consist of csv files for each language with the columns text and label, were label == 1 is a positive review and label == 0is a negative review.

For our paper: Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead? we used the first 7500 texts for training and the last 2500 texts for evaluating.

ScandiSent.zip ๐Ÿ‡ธ๐Ÿ‡ช ๐Ÿ‡ณ๐Ÿ‡ด ๐Ÿ‡ฉ๐Ÿ‡ฐ ๐Ÿ‡ซ๐Ÿ‡ฎ + ๐Ÿด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ

Is the raw data for each language where we used fastText language identification to ensure that the texts were of the right language.

ScandiSent-mt.zip ๐Ÿด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ

Consists of the raw data from ScandiSent machine translated to English ๐Ÿด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ using Googles Neural Machine Translation API.

Version 1.0

2021-02-06

About

Sentiment Corpus for Swedish ๐Ÿ‡ธ๐Ÿ‡ช Norwegian ๐Ÿ‡ณ๐Ÿ‡ด Danish ๐Ÿ‡ฉ๐Ÿ‡ฐ Finnish ๐Ÿ‡ซ๐Ÿ‡ฎ (and English ๐Ÿด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ)