Multilingual datasets

  • A Multilingual Dataset for Investigating Stereotypes and Negative Attitudes Towards Migrant Groups in Large Language Models
  • askD - ELI5 dataset adapted on Medical Questions (AskDocs) subreddit.
  • English-Portuguese Sentences - English-Portuguese Sentences from the Tatoeba Project.
  • EUR-Lex - multilingual corpus in all the official languages of the European Union.
  • Europarl - European Parliament Proceedings Parallel Corpus 1996-2011.
  • Europarl-ST - Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.
  • mc4 - multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.
  • mfaq - multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.
  • MKQA - Multilingual Knowledge Questions & Answers (github).
  • MQA - multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl.
  • MMARCO - Multilingual version of the MS MARCO passage ranking dataset.
  • mRobust - Multilingual version of the TREC 2004 Robust passage ranking dataset
  • MultiCoNER - a large multilingual dataset for Named Entity Recognition.
  • MuST-C - multilingual speech translation corpus.
  • OpenSubtitles - collection of translated movie subtitles.
  • OSCAR - Open Super-large Crawled Aggregated coRpus.
  • Tatoeba - a large database of sentences and translations.
  • TED2020 - contains a crawl of nearly 4000 TED and TED-X transcripts from July 2020.
  • TSAR-2022-Shared-Task - TSAR2022 Shared Task on Lexical Simplification.
  • WikiANN - multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.
  • WikiLingua - Multilingual abstractive summarization dataset extracted from WikiHow.
  • WikiMatrix - Parallel Sentences in 1620 Language Pairs from Wikipedia.
  • Wikiner - Learning multilingual named entity recognition from Wikipedia.
  • WikiNEuRal - Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).
  • Wikipedia - Wikipedia dataset containing cleaned articles of all languages.
  • XFORMAL - A Benchmark for Multilingual Formality Style Transfer.
  • XLSUM - 1.35 million professionally annotated article-summary pairs from BBC.


  • BATS-PT - manual translation of the lexicographic portion of the Bigger Analogy Test Set (BATS) to Portuguese
  • br.ispell - Ispell dictionary for brazilian portuguese (github).
  • Conceptnet - an open, multilingual knowledge graph.
  • DicSin - Dictionary of synonyms and antonyms.
  • lexiconPT - R package that provides lexicons for Portuguese Text Analysis.
  • lexicons - Dictionaries of names, surnames, acronyms and it's extensions, stop-words, etc.
  • LIWC - Linguistic Inquiry and Word Count (dictionary)
  • Onto.PT - Ontologia Lexical para o Português.
  • OpenWordnet-PT - an open access wordnet for Portuguese (site).
  • OpLexicon - a sentiment lexicon for the Portuguese language.
  • palavras - Word list of Brazillian Portuguese.
  • PAPEL.
  • pt-br - Wordlist, verbs, conjugations, term frequencies.
  • PT-LKB - Large Portuguese Lexical-Semantic Knowledge Base
  • PULO - Portuguese Unified Lexical Ontology.
  • SentiLex-PT - a sentiment lexicon for Portuguese.
  • Stopwords - Portuguese stopwords collection.
  • Tep2.
  • Unitex-PB - lexical resources.
  • VaLexPB - a lexicon of Brazilian Portuguese verb valences.
  • VerbNet.Br 1.0 - verbal lexicon of Brazilian Portuguese.
  • wikidict-dsl-pt - Wikidata Bilingual DSL Dictionaries.
  • Wordnetaffectbr - vocabulary of emotions words.
  • Wordnet.Br.


  • Albertina PT-BR - It is an encoder of the BERT family for the Portuguese language - the American variant from Brazil.
  • Albertina PT-PT - It is an encoder of the BERT family for the Portuguese language - the European variant from Portugal.
  • Alpaca-LoRA-PTBR - Low-Rank LLaMA Instruct-Tuning.
  • BART - BART pre-treinado em português.
  • BERTimbau - BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment (Github).
  • BioBERTpt - fine-tuned BERT models trained on the clinical domain for Portuguese language (Github).
  • Cabrita - A portuguese finetuned instruction LLaMA (Github).
  • DeBERTinha - A DeBERTa V3 XSmall adapted to the Brazilian Portuguese language (Github).
  • Electra - Electra model trained on BRWAC.
  • Gervasio-PT-BR - It is a decoder of the GPT family for the Portuguese language - the American variant from Brazil.
  • Gervasio-PT-PT - It is a decoder of the GPT family for the Portuguese language - the European variant from Portugal.
  • GlórIA 1.3B - A Portuguese European-focused Large Language Model (HuggingFace)
  • GPT2 small - GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
  • GPT-Neo small - a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
  • GPT2-Bio-PT - a biomedical finetuned version from GPorTuguese-2 (Github).
  • roberta-pt-br
  • RoBERTaCrawlPT-base - RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora
  • RoBERTaLexPT-base - Portuguese Masked Language Model pretrained from scratch from the LegalPT and CrawlPT corpora
  • T5
  • tgf-xlm-roberta-base-pt-br (Github)
  • Wav2vec

Multilingual Models

  • Bloom
  • mBert
  • mDeBERTa
  • mGPT - Multilingual GPT model. An autoregressive GPT-like model.
  • mMiniLM - mMiniLM-L6-v2 Reranker finetuned on mMARCO
  • mT5 - Multilingual T5. A massively multilingual pre-trained text-to-text transformer.
  • LaBSE - Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.

Word Embeddings


  • Coh-Metrix-Port - an adaptation of the Coh-Metrix text analysis tool to the Brazilian Portuguese language.
  • NILC-Metrix - it gathers the metrics developed over more than a decade in NILC Lab.


  • Open PT LLM Leaderboard - Open PT LLM Leaderboard aims to provide a benchmark for the evaluation of Large Language Models (LLMs) in the Portuguese language across a variety of tasks and datasets.




  • Apertium-por - Apertium linguistic data for Portuguese.
  • Autocorrect - Spelling corrector in python.
  • BrGram - Computational grammar fragment of Brazilian Portuguese in the LFG formalism implemented in XLE.
  • Dicio API - Portuguese dictionary API.
  • dict-pt-br - dictionary for Brazilian Portuguese.
  • Languagetool - Style and Grammar Checker for 25+ Languages.
  • LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language.
  • LexML Parser - parser for legal documents.
  • LX parser - statistical constituency parser for Portuguese.
  • metaphone-ptbr - Metaphone algorithm for the Portuguese language.
  • mlconjug3 - a Python library to conjugate verbs in Portuguese and other languages.
  • MorphoBr - Resources for morphological analysis of Portuguese.
  • OpCluster - Automatic extraction and clustering of fine-grained opinions.
  • Phonemizer - Simple text to phones converter for multiple languages.
  • PorGram - Open source computational grammar for Portuguese in the HPSG formalism.
  • pymetaphone-br - Metaphone algorithm package for the Portuguese language.
  • pyspellchecker - Multilingual Spell Checking.
  • RBAMR - A Rule-Based AMR Parser for Portuguese.
  • Verbecc - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.

Other lists

Other links

