Bitextor Team (bitextor)

Bitextor Team

bitextor

Organization data from Github https://github.com/bitextor

Translation memories generator

GitHub:@bitextor

Bitextor Team's repositories

bitextor

Bitextor generates translation memories from multilingual websites

Language:PythonLicense:GPL-3.0Stargazers:291Issues:29Issues:160

bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.

Language:PythonLicense:GPL-3.0Stargazers:155Issues:13Issues:53

pdf-extract

PDF parser and converter to HTML

Language:JavaLicense:GPL-3.0Stargazers:85Issues:16Issues:51

bicleaner-ai

Bicleaner fork that uses neural networks

Language:PythonLicense:GPL-3.0Stargazers:39Issues:11Issues:20

bifixer

Tool to fix bitexts and tag near-duplicates for removal

Language:PythonLicense:GPL-3.0Stargazers:30Issues:7Issues:11

warc2text

Extracts plain text, language identification and more metadata from WARC records

Language:C++License:MITStargazers:21Issues:9Issues:26

biroamer

Utility that will help you to ROAM (Random Omit Anonymize and Mix) your parallel corpus.

Language:PythonLicense:GPL-3.0Stargazers:10Issues:9Issues:4
Language:PythonLicense:GPL-3.0Stargazers:7Issues:9Issues:3

neural-document-aligner

Document aligner which uses neural technologies to search matches across bilingual documents

Language:PythonLicense:GPL-3.0Stargazers:7Issues:8Issues:1

bicleaner-data

Repository for data models, dictionaries and more resources for Bicleaner

bitextor-data

Repository for data models, dictionaries and more resources for Bitextor

License:GPL-3.0Stargazers:6Issues:9Issues:0
Language:PythonLicense:GPL-3.0Stargazers:6Issues:10Issues:1

python-pdfextract

Python interface to pdf-extract, HTML extraction from PDF

Language:PythonLicense:NOASSERTIONStargazers:6Issues:1Issues:0

bicleaner-ai-data

Repository of Bicleaner AI models

License:NOASSERTIONStargazers:5Issues:9Issues:0

bicleaner-hardrules

Pre-filtering step for bicleaner

Language:PythonLicense:GPL-3.0Stargazers:4Issues:8Issues:4

bitextor-neural

Bitextor Neural generates translation memories from multilingual websites using state-of-the-art Machine Learning tools

Language:PythonLicense:GPL-3.0Stargazers:3Issues:9Issues:0

prevertical2text

Extracts plain text, language identification and more metadata from Spiderling prevertical files

Language:C++License:MITStargazers:2Issues:8Issues:1

vecalign

Improved Sentence Alignment in Linear Time and Space

Language:PythonLicense:Apache-2.0Stargazers:2Issues:3Issues:0

loomchild-segment-py

Python module to interface with Java Loomchild sentence segmenter

Language:PythonLicense:GPL-3.0Stargazers:1Issues:2Issues:1

monocleaner-data

Monocleaner models repository

License:GPL-3.0Stargazers:1Issues:8Issues:0

scrawl

Playwright-based web crawler

Language:PythonLicense:GPL-3.0Stargazers:1Issues:7Issues:0

bicleaner-ai-glove

Fork of glove-python to distribute binary builds

Language:PythonLicense:Apache-2.0Stargazers:0Issues:1Issues:0

bitextor-testing-output

Repository for storing testing outputs from Bitextor

License:GPL-3.0Stargazers:0Issues:8Issues:0

cld2

Compact Language Detector 2

Language:C++License:Apache-2.0Stargazers:0Issues:3Issues:0

deferred-crawling

Reconstructs sentences using deferred crawling standoff annotations from Bitextor

Language:PythonLicense:MITStargazers:0Issues:8Issues:0

fastText

Library for fast text representation and classification.

Language:HTMLLicense:MITStargazers:0Issues:1Issues:0

hunalign

Sentence aligner

Language:C++License:LGPL-3.0Stargazers:0Issues:2Issues:0

python-apachetika

Python interface to Apache Tika, HTML extraction from PDF

Language:PythonLicense:NOASSERTIONStargazers:0Issues:1Issues:0