Finnish Word Embeddings

This repository contains links to word embeddings for Finnish language, including code for training your own embeddings. Word embeddings represent words as low dimensional numerical vectors, which are helpful in various NLP applications, such as building chatbots, calculating semantic similarities or detecting fake news.

List of available embeddings

Source	Model	Dimension	Trained on	Download link
Facebook	FastText	300	Wikipedia and CommonCrawl	Binary / Text
Facebook	FastText	300	Wikipedia	Binary + Text / Text
Turku NLP	Word2Vec	Unknown	Finnish Internet Parsebank	Binary
Turku NLP	Word2Vec	Unknown	Suomi24	Binary
Turku NLP	Word2Vec	Unknown	Suomi24 with lemmatization	Binary
Yle	Word2Vec / FastText	Unknown	Wikipedia and Yle articles	Text (need to fill form)
This repository	Word2Vec / FastText	300	Crawled from popular Finnish websites (details)	Binary files from Kaggle datasets (only viable free option for now, let me know if you are willing to host these:)

Example usage of word embeddings

# Word embeddings in word2vec-format can easily be loaded and queried with gensim
# See https://radimrehurek.com/gensim/models/keyedvectors.html for reference
from gensim.models.keyedvectors import KeyedVectors

# Load vectors into memory (bin in filename means binary=True)
embeddings_path = './data/embeddings/fasttext.fi.all.1045M.100d.bin.gz'
kv = KeyedVectors.load_word2vec_format(embeddings_path, binary=True)

# Find most similar word to 'koira'
print(kv.most_similar('koira'))

Training your own word embeddings

This repository also contains the code used for crawling data from popular Finnish web sites, extracting sentences from those, and training word embeddings. The spiders used for web scraping can be found from the crawling-folder, whereas preprocessing and training of embeddings can be found from the embeddings-folder.

Three steps are required:

Clone this repository using git clone https://github.com/jmyrberg/finnish-word-embeddings and install required packages with pip install -r requirements.txt.
Crawl data by starting a spider by running run_spider.bat and typing in the name of the spider, such as iltalehti. All available spider names can be found from the spider class definitions in all_spiders.py. See Scrapy for more information on how to create your own spiders. Optionally, you may also use your own source documents for training.
Preprocess crawled material and train word embeddings by running update.py. Or optionally, prepare your own documents into sentence lines and train them by running train.py.

If you follow the steps above without modifying any code, you should be able to reproduce the custom word embeddings provided in this repository. The provided code should also automatically create the folder structure under ./data/* as follows:

crawl: State of the spider to avoid duplicate scrapes
feed: Crawled material with JSON line files named like <spiderName>.jl
processed: Preprocessed crawled material in sentence line files like all.sl
embeddings: Trained word embeddings named like <modelName>.fi.<sentenceLineFilename>.<numberOfTokensTrainedOn>.<embeddingsDimension>.<format>.gz

Contributing

If you want to add, modify or remove something in the list of word embeddings or code, please feel free to make a pull request, file an issue, or contact me.

Jesse Myrberg (jesse.myrberg@gmail.com)

jmyrberg / finnish-word-embeddings

Finnish Word Embeddings

List of available embeddings

Example usage of word embeddings

Training your own word embeddings

Contributing

About

Languages