preprocessing

There are 5 repositories under preprocessing topic.

Unstructured-IO / unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
data-pipelines deep-learning document-image-analysis document-image-processing document-parser document-parsing docx donut information-retrieval langchain llm machine-learning ml natural-language-processing nlp ocr pdf pdf-to-json pdf-to-text preprocessing
Language:HTML 13169
dongrixinyu / JioNLP
中文 NLP 预处理、解析工具包，准确、高效、易用 A Chinese NLP Preprocessing & Parsing Package www.jionlp.com
apache2 chinese natural-language-processing ner nlp nlp-parse preprocessing python time-parse time-parsing
Language:Python 3754
igel
nidhaloff / igel
a delightful machine learning tool that allows you to train, test, and use models without writing code
machine-learning machinelearning machine-learning-algorithms machine-learning-library artificial-intelligence neural-network neural-networks sklearn scikit-learn scikitlearn-machine-learning data-science data-analysis preprocessing automation automl automl-experiments hacktoberfest hacktoberfest2021
Language:Python 3132
OpenGene / fastp
An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
fastq qc preprocessing filtering adapter overlap quality trimming splitting quality-control filter ngs bioinformatics umi sequencing illumina polyg duplication merging
Language:C++ 2219
MLBox
AxeldeRomblay / MLBox
MLBox is a powerful Automated Machine Learning python library.
machine-learning auto-ml kaggle deep-learning stacking pipeline optimization preprocessing encoding prediction distributed xgboost drift classification regression lightgbm keras automated-machine-learning automl data-science
Language:Python 1522
AutoTS
winedarksea / AutoTS
Automated Time Series Forecasting
time-series machine-learning automl autots forecasting deep-learning preprocessing feature-engineering
Language:Python 1331
PyHealth
sunlabuiuc / PyHealth
A Deep Learning Python Toolkit for Healthcare Applications.
healthcare data-mining deep-learning preprocessing clinical-data clinical-research electronic-medical-record medical-code electronic-health-record
Language:Python 1304
NVIDIA-Merlin / NVTabular
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
deep-learning feature-engineering feature-selection gpu machine-learning nvidia preprocessing recommendation-system recommender-system
Language:Python 1126
nnAudio
KinWaiCheuk / nnAudio
Audio processing by using pytorch 1D convolution network
1d-convolution audio-processing cqt-spectrogram melspectrogram neural-network preprocessing pytorch spectrogram spectrogram-conversion-toolbox stft
Language:Python 1094
TheAlgorithms / R
Collection of various algorithms implemented in R.
algorithm algorithms classification clustering data-mining datamanipulation education hacktoberfest learning machine-learning practice preprocessing r r-language r-programming regression
Language:R 1044
MinishLab / semhash
Fast Semantic Text Deduplication & Filtering
datasets deduplication model2vec preprocessing semantic-deduplication vicinity
Language:Python 831
pytorch / torcharrow
High performance model preprocessing library on PyTorch
preprocessing python pytorch
Language:Python 645
qd-cae / awesome-CAE
A curated list of awesome CAE frameworks, libraries and software.
cae abaqus ls-dyna libraries collection preprocessing scripting tools fem cfd
447
contextualSpellCheck
R1j1t / contextualSpellCheck
✔️Contextual word checker for better suggestions (not actively maintained)
spacy spacy-extension nlp spellcheck preprocessing bert help-wanted chatbot spellchecker python-spelling-corrector oov spelling-correction spelling-corrections natural-language-processing python
Language:Python 418
msamogh / nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
pytorch data-processing data-preprocessing data-pipeline data-cleaning preprocessing machine-learning torch
Language:Python 378
MaxHalford / xam
:dart: Personal data science and machine learning toolbox
python machine-learning data-science preprocessing stacking
Language:Python 366
DataCanvasIO / HyperGBM
A full pipeline AutoML tool for tabular data
automl gbm fullpipeline xgboost lightgbm catboost pseudo-labeling adversarial-validation semi-supervised-learning datacleaning preprocessing ensemble-learning tabular-data distributed-training dask gpu-acceleration dask-distributed rapidsai sklearn
Language:Python 359
ikegami-yukino / jaconv
Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
japanese-language preprocessing text-processing japanese-kana pure-python character-converter julius transliteration
Language:Python 335
Razor12911 / xtool
Just some tool repackers like to use...
archiving compression preprocessing precompression repacking encryption
Language:Pascal 331
advaitsave / Introduction-to-Time-Series-forecasting-Python
Introduction to time series preprocessing and forecasting in Python using AR, MA, ARMA, ARIMA, SARIMA and Prophet model with forecast evaluation.
time-series forecasting python arima sarima arma prophet-model series-forecasting-python series-preprocessing forecast-evaluation time-series-forecasting preprocessing stationarity dickey-fuller seasonality
Language:Jupyter Notebook 330
cylondata / cylon
Cylon is a fast, scalable, distributed memory, parallel runtime with a Pandas like DataFrame.
data join shuffle dataframe dataframes-api mpi deep-learning preprocessing
Language:C++ 301
nlpcl-lab / ace2005-preprocessing
ACE 2005 corpus preprocessing for Event Extraction task
ace2005 event-extraction parsed-data preprocessing
Language:Python 298
ikegami-yukino / neologdn
Japanese text normalizer for mecab-neologd
japanese-language mecab-ipadic-neologd nlp preprocessing text-normalization
Language:Cython 286
OpenTabular / DeepTabular
Mambular is a Python package that simplifies tabular deep learning by providing a suite of models for regression, classification, and distributional regression tasks. It includes models such as Mambular, TabM, FT-Transformer, TabulaRNN, TabTransformer, and tabular ResNets.
classification deeplearning distributional-regression mamba regression ft-transformer resnet rnn ssm tabtransformer tabular tabular-learning state-space-models ple preprocessing tabular-deep-learning piecewise-linear-encoding node tabm saint
Language:Python 273
Deffro / text-preprocessing-techniques
16 Text Preprocessing Techniques in Python for Twitter Sentiment Analysis.
python sentiment-analysis twitter preprocessing data-science
Language:Python 227
dunky11 / voicesmith
[WIP] VoiceSmith makes training text to speech models easy.
dataset-manager delightfultts preprocessing speech-synthesis text-to-speech toolkit tts univnet voice-cloning
Language:Python 226
quqixun / BrainPrep
Preprocessing pipeline on Brain MR Images through FSL and ANTs, including registration, skull-stripping, bias field correction, enhancement and segmentation.
brain mri preprocessing fsl registration skull-stripping brain-segmentation ants enhancement bias-field-correction
Language:Python 208
xMIP
jbusecke / xMIP
Analysis ready CMIP6 data in python the easy way with pangeo tools.
analysis-ready-data climate-analysis climate-models cmip6 cmip6-data pangeo preprocessing xgcm
Language:Jupyter Notebook 204
jaeho3690 / LIDC-IDRI-Preprocessing
This is the preprocessing step of the LIDC-IDRI dataset
lidc-dataset preprocessing lung lidc-preprocessing npy lidc-idri
Language:Jupyter Notebook 199
google / tensorflow-recorder
TFRecorder makes it easy to create TensorFlow records (TFRecords) from Pandas DataFrames and CSVs files containing images or structured data.
apache-beam image-processing preprocessing tensorflow tensorflow-recorder tfrecords
Language:Python 180
GiftMungmeeprued / document-parsers-list
A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.
data-pipeline document-image-processing document-parser document-parsing langchain ocr pdf pdf-to-text preprocessing
164
sappelhoff / pyprep
PyPREP: A Python implementation of the Preprocessing Pipeline (PREP) for EEG data
eeg preprocessing python neuroscience neuroimaging artifact data mne electroencephalogaphy
Language:Python 164
ropensci / MODIStsp
An "R" package for automatic download and preprocessing of MODIS Land Products Time Series
modis gdal r preprocessing time-series remote-sensing satellite-imagery modis-data modis-land-products peer-reviewed r-package rstats
Language:R 159
githubharald / DeslantImg
The deslanting algorithm sets text upright in images. Python, C++ and OpenCL implementations provided.
c-plus-plus gpu handwriting-recognition image-processing ocr opencl opencv preprocessing python
Language:C++ 150
autoreject / autoreject
Automated rejection and repair of bad trials/sensors in M/EEG
electroencephalography preprocessing cross-validation mne-python magnetoencephalography meg
Language:Python 147
mlr3pipelines
mlr-org / mlr3pipelines
Dataflow Programming for Machine Learning in R
bagging data-science dataflow-programming ensemble-learning machine-learning mlr3 pipelines preprocessing r r-package stacking
Language:R 146