emilgraichen / SwedishLSdataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


This dataset is the first Lexical Simplification Dataset developed for Swedish as a part of a Bachelor's thesis in Cognitive Science at Linköping University. It contains 150 quadruples of complex words sourced from the Swedish Kelly list, their corpus frequencies in the "BloggMix odat" corpus, replacements to the complex word sourced from SynLex and their corresponding word frequencies in the BloggMix corpus, and an example sentence from SALDO where the complex word is found. The human assessment of each quadruple is also included in the dataset (regarding quality, coverage, and complexity).


For a more detailed description of the work, please follow this link: http://liu.diva-portal.org/smash/get/diva2:1767273/FULLTEXT01.pdf.

For links to other repositories related to this thesis, please see the following links:

Lexical Simplification System for Swedish: https://github.com/emilgraichen/SwedishLexicalSimplifier

Complex Word Identification Dataset: https://github.com/emilgraichen/SwedishCWI

Structure of the Dataset

A picture showing the structure of the dataset

Links to the resources used for this dataset:

BloggMix Odat: https://spraakbanken.gu.se/resurser/bloggmix

Kelly Swedish: https://spraakbanken.gu.se/resurser/kelly

SynLex: http://folkets-lexikon.csc.kth.se/synlex.html

SALDO: https://spraakbanken.gu.se/resurser/saldoe


License:Creative Commons Zero v1.0 Universal