ju-resplande / Portuguese-NLP

List of resources and tools developed with focus on Portuguese.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Portuguese-NLP

List of resources and tools developed with focus on Portuguese.

Datasets

  • AG_news pt - Tradução autoaática do AG's corpus of news articles
  • Aspect-based annotated - the corpus consist of implicit and explicit annotated aspects and groups of (hierarchically organized) opinion aspects for aspect-based sentiment analysis applications, as well as text summarization.
  • ASSIN - a dataset with semantic similarity score and entailment annotations.
  • ASSIN 2 - sequence of ASSIN.
  • BlogSet-BR - a collection of posts gathered from Blogspot platform written by Brazillian users.
  • BoolQ - tradução automática do BoolQ
  • br-quad-2.0 - Stanford Question Answering Dataset (SQuAD) 2.0 translated to Brazilian Portuguese (PT-BR) language.
  • Brazilian E-Commerce - Brazilian E-Commerce Public Dataset by Olist store.
  • Brazilian Headlines Sentiments - Dataset containing sentiment analysis of Brazilian news agencies headlines.
  • Brazilian Portuguese Literature Corpus - 3.7 million word corpus of Brazilian literature published between 1840-1908.
  • Brazilian Portuguese Sentiment Analysis Datasets.
  • Brazilian TCU's judgments - Judgments of Federal Court of Accounts - Brazil (TCU).
  • BrWaC - Brazilian Portuguese Web as Corpus.
  • BrWac2Wiki - a dataset for multi-document summarization in Portuguese.
  • B2W-Reviews01 - product reviews.
  • Carolina - Corpus Geral do Português Brasileiro Contemporâneo.
  • Capes - parallel corpus of theses and dissertations abstracts in English and Portuguese.
  • CC100-Portuguese - Created by Conneau & Wenzek et al. at 2020. This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository.
  • CETENFolha - news from the newspaper Folha de S. Paulo.
  • CHAVE - collection for Information Retrieval and Question Answering.
  • CINTIL Corpus - a linguistically interpreted corpus of Portuguese.
  • Complexidade Textual para Estágios Escolares do Sistema Educacional Brasileiro
  • CORAA - dataset for Automatic Speech Recognition.
  • CORAA SER - Emotion Recognition from Brazilian Portuguese Informal Spontaneous Speech.
  • CSTNews - a corpus with 50 clusters of news texts with their multi-document summaries, as well as several discourse and semantic annotations.
  • C-ORAL-BRASIL - This project is dedicated to the study of Brazilian Portuguese spontaneous speech and, more broadly, to the compilation of spoken corpora.
  • DEEPAGÉ - Answering Questions in Portuguese about the Brazilian Environment.
  • DNLT-BP - Datasets of Neuropsychological Language Tests in Brazilian Portuguese.
  • ENEM Challenge - Consists of the writing of an essay and an objective part containing 180 multiple choice questions
  • ENEM-2022
  • Essay-BR - Essay-BR: a corpus of essays for the Brazilian Portuguese language.
  • Extended Essay-BR - Extended version of the Essay-BR corpus.
  • FACTCK.BR - A dataset to study Fake News in Portuguese.
  • Fake.Br - aligned true and fake news written in Brazilian Portuguese.
  • Fakepedia-Corpus.
  • FakeRecogna - dataset comprised of real and fake news.
  • FakeWhatsApp.Br - An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.
  • FCN
  • Floresta Sintá(c)tica - treebank for Portuguese.
  • HAREM first - evaluation contest for named entity recognizers in Portuguese.
  • HAREM second - evaluation contest for named entity recognizers in Portuguese.
  • HateBR - large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.
  • Historical Portuguese Corpora - tools and resources for manipulation of historical corpora and management of historical dictionaries.
  • IMDB pt - Tradução atomática do IMBD
  • Iudicium Textum Dataset - contains legal documents created by Brazilian Federal Supreme Court in its integral composition (paper).
  • LeNER-Br - a Dataset for Named Entity Recognition in Brazilian Legal Text.
  • Lex2Kids - lexicon in Portuguese most heard by children.
  • Mac-Morpho - Brazilian Portuguese texts annotated with part-of-speech tags.
  • MilkQA - a dataset of dense questions for the task of answer selection.
  • Minutes of Central Bank of Brazil - Minutes of the Monetary Policy Committee of the Central Bank of Brazil.
  • NER in Brazilian Portuguese tweets - Twitter messages in pt-br annotated for the entities PER, LOC and ORG.
  • News-Crawl-PT - Monolingual News Crawl used for WMT.
  • News of the site Folha de São Paulo
  • News published in Brazil
  • Parallel Corpora from Revista Pesquisa FAPESP - Portuguese-English and Portuguese-Spanish bilingual collections of the online issues of the scientific news Brazilian magazine Revista Pesquisa FAPESP.
  • Pirá - A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean.
  • PLUE - Portuguese translation of the GLUE benchmark and Scitail dataset.
  • POeTiSA - POrtuguese processing - Towards Syntactic Analysis and parsing.
  • PorSimplesSent - of aligned sentences pairs to investigate sentence readability assessment.
  • PortiLexicon-UD - a lexicon for Brazilian Portuguese according to Universal Dependencies.
  • Portuguese Legal Sentences - Collection of Legal Sentences from the Portuguese Supreme Court of Justice.
  • Portuguese Presidential Elections - This dataset contains tweets and users mostly from the Portuguese Twittersphere.
  • PraCegoVer - multi-modal dataset containing images associated to Portuguese captions based on posts from Instagram.
  • Priberam Fine-Grained Opinion Corpus - a Portuguese fine-grained dependency opinion mining corpus.
  • Propbank - Contains instances annotated with semantic role labels (SRL).
  • Projeto ACDC - Internet Access to Corpora.
  • QA-Portuguese - Adaptation from MQA dataset Portuguese split (QA entailment pairs).
  • REBEL-Portuguese - Datasets de relações a partir da Wikipedia.
  • ReLi - REsenha de LIvros.
  • Rhetalho - corpus annotated with Daniel Marcu's RSTTool.
  • SemClinBr - multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.
  • SESAME - corpus for NER in portuguese.
  • SIGARRA News Corpus - SIGARRA information system at the University of Porto.
  • SIMPLEX-PB - A Lexical Simplification Database and Benchmark for Portuguese.
  • SIMPLEX-PB-2.0 - improved version of SIMPLEX-PB.
  • SIMPLEX-PB-3.0 - new version of SIMPLEX-PB.
  • SQUAD-PT v1.1 - Portuguese translation of the SQuAD dataset.
  • SQUAD-PT v2.0 - Portuguese translation of SQuAD 2.0 dataset.
  • SST-2 pt - Tradução automática do Stanford Sentiment Treebank
  • TeMário - news texts and the corresponding human summaries for summarization purposes.
  • Textual Complexity Corpus - Textual Complexity Corpus for School Internships in the Brazilian Educational System.
  • ToLD-Br - Toxic Language Detection in Social Media for Brazilian Portuguese (github).
  • TTS-Portuguese Corpus - Text To Speech Portuguese.
  • TweetSentBR - Tweets in Brazilian Portuguese.
  • Tweets for Sentiment Analysis
  • UD_Portuguese-Bosque - Universal Dependencies (UD) Portuguese treebank.
  • UD_Portuguese-CINTIL - Universal Dependencies (UD) Portuguese treebank.
  • UD_Portuguese-GSD - Universal Dependencies (UD) Portuguese treebank.
  • UD_Portuguese-PetroGold - Universal Dependencies (UD) Portuguese treebank.
  • UD_Portuguese-PUD - Universal Dependencies (UD) Portuguese treebank.
  • UTLCorpus - a corpus of online reviews in Brazilian Portuguese annotated with helpfulness classification.
  • Winograd Schema Challenge - Solver for the Portuguese-based Winograd Schema Challenge.

Multilingual datasets

  • askD - ELI5 dataset adapted on Medical Questions (AskDocs) subreddit.
  • English-Portuguese Sentences - English-Portuguese Sentences from the Tatoeba Project.
  • EUR-Lex - multilingual corpus in all the official languages of the European Union.
  • Europarl - European Parliament Proceedings Parallel Corpus 1996-2011.
  • Europarl-ST - Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.
  • mc4 - multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.
  • mfaq - multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.
  • MKQA - Multilingual Knowledge Questions & Answers (github).
  • MQA - multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl.
  • MMARCO - Multilingual version of the MS MARCO passage ranking dataset.
  • MultiCoNER - a large multilingual dataset for Named Entity Recognition.
  • MuST-C - multilingual speech translation corpus.
  • OSCAR - Open Super-large Crawled Aggregated coRpus.
  • OpenSubtitles - collection of translated movie subtitles.
  • Tatoeba - a large database of sentences and translations.
  • TED2020 - contains a crawl of nearly 4000 TED and TED-X transcripts from July 2020.
  • TSAR-2022-Shared-Task - TSAR2022 Shared Task on Lexical Simplification.
  • WikiANN - multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.
  • WikiLingua - Multilingual abstractive summarization dataset extracted from WikiHow.
  • WikiMatrix - Parallel Sentences in 1620 Language Pairs from Wikipedia.
  • Wikiner - Learning multilingual named entity recognition from Wikipedia.
  • WikiNEuRal - Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).
  • Wikipedia - Wikipedia dataset containing cleaned articles of all languages.
  • XFORMAL - A Benchmark for Multilingual Formality Style Transfer.
  • XLSUM - 1.35 million professionally annotated article-summary pairs from BBC

Lexicon

Models

  • Albertina PT-BR
  • BERTimbau - BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment (Github).
  • BioBERTpt - fine-tuned BERT models trained on the clinical domain for Portuguese language
  • Cabrita - A portuguese finetuned instruction LLaMA (Github).
  • Electra - Electra model trained on BRWAC.
  • GPT2 small - GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
  • GPT-Neo small - a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
  • mMiniLM - mMiniLM-L6-v2 Reranker finetuned on mMARCO
  • roberta-pt-br
  • T5
  • tgf-xlm-roberta-base-pt-br (Github)
  • Wav2vec

Multilingual Models

Word Embeddings

Metrics

  • Coh-Metrix-Port - an adaptation of the Coh-Metrix text analysis tool to the Brazilian Portuguese language.
  • NILC-Metrix - it gathers the metrics developed over more than a decade in NILC Lab.

Frameworks

Institutions

Tools

  • Autocorrect - Spelling corrector in python.
  • BrGram - Computational grammar fragment of Brazilian Portuguese in the LFG formalism implemented in XLE.
  • Dicio API - Portuguese dictionary API.
  • dict-pt-br - dictionary for Brazilian Portuguese.
  • Languagetool - Style and Grammar Checker for 25+ Languages.
  • LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language.
  • LexML Parser - parser for legal documents.
  • LX parser - statistical constituency parser for Portuguese.
  • metaphone-ptbr - Metaphone algorithm for the Portuguese language.
  • mlconjug3 - a Python library to conjugate verbs in Portuguese and other languages.
  • MorphoBr - Resources for morphological analysis of Portuguese.
  • OpCluster - Automatic extraction and clustering of fine-grained opinions.
  • Phonemizer - Simple text to phones converter for multiple languages.
  • PorGram - Open source computational grammar for Portuguese in the HPSG formalism.
  • pymetaphone-br - Metaphone algorithm package for the Portuguese language.
  • RBAMR - A Rule-Based AMR Parser for Portuguese.
  • Verbecc - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.

Other lists

Other links

Visitor Badge

About

List of resources and tools developed with focus on Portuguese.