junjiez / PracticalMachineLearning

My ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free(as speech not free food) or open-source.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

J'ai seulement fait ici un amas de fleurs étrangères, n'y ayant fourni du mien que le filet à les lier.

My Machine Learning related stuff!

My Apache Zeppelin and Jupyter notebooks, and more! for a series of useful data analysis and machine learning related stuff in general

ML algorithms

ML Resources(with an emphasis on Python)

This document is an attempt to come up with a curated list of Machine Learning resources, including books, papers, software, libraries, notebooks, etc. Most of the libraries are for Python though the rest of the materials here are generally suited for working with data.

Books and Writings

Dataset Repositories

Q&A Websites

Useful Websites

Editors & IDEs for Python

  • Spyder: A great Python IDE for scientists in general
  • Pycharm CE: An excellent IDE for development of anything with Python
  • GNU Emacs: GNU Emacs is an environment for doing almost anything
  • IDLE: Default Python IDE, lean and clean environment to develop in Python
  • Rodeo: A Python IDE for data scientists

Toolboxes & Distributions

Notebook Authoring Environments

  • Jupyter
  • Apache Zeppelin: A great notebook environment for data visualization and doing analytics stuff, it can connect to many different databases and data management systems
  • Beacker
  • nteract
  • JupyterLab: Next-generation Jupiter notebook environment
  • Spark Notebook: Spark Notebook is an interactive notebook authoring environment for working with Scala code on top of Spark clusters
  • Python(x,y): Python(x,y) is an open-source environment for scientific and numerical computations and analysis
  • Polynote: A notebook authoring tool with native support for Scala on Spark from Netflix

Python Machine Learning, Data Mining, Statistical Analysis Libraries

  • Pandas: Famous Python's data manipulation library
  • Scipy: Defacto Pythons scientific computation library
  • Numpy: Linear algebra library for fast numerical computation
  • Scikit Learn: High-level Machine Learning library with tons of features, very easy-to-use and extendable
  • Bokeh: An interactive high-level data visualization library
  • Matplotlib: A compelling data visualisation library, More low-level than other visualisation libs
  • Graph Tool: A fast and powerful library for working with graphs in Python, It's developed on top of Boost C++ libraries so consequently it's very efficient
  • NetworkX: A Python module for Complex Network modelling and analysis, Very easy-to-use but may be slow on times because it's in pure Python
  • TensorFlow: Low-level library for creating deep artificial neural networks, works both on CPU and GPU. Usually, you use TF in conjunction with a library with higher-level API exposing TF's functionalities like Keras
  • Keras: "Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano" - Keras's website
  • NLTK: Swiss Army knife tool for text processing in Python
  • Pattern: Another good text processing library for Python
  • IPython
  • Orange: Orange is a general-purpose data mining and analysis tool also library that lets you develop machine learning pipelines just by a few dragging and dropping
  • Theano
  • CatBoost: Yandex's implementation of Gradient Boosting on Decision Trees, It supports categorical features out of the box
  • XGboost: Original XGBOOST library, A very efficient Gradient Boosting library with extra regularisation
  • Mlxtend: A great Data Mining and Machine Learning library with
  • NetworKit: A very high-performance graph processing and analysis toolkit, written in C++ and uses OpenMP, so it is very fast on multicore computers
  • Eli5
  • Pandasql
  • Dask: A fast data manipulation library with out-of-core handling of the data, Suited for a distributed environment, Its API is (exactly)compatible with Pandas' API
  • MLBox
  • Gensim
  • Scikit-learn-Contrib/Imbalanced-learn: An extension library for Scikit-learn for handling imbalanced datasets
  • Patsy: "Kamelot!!! ... It's just a model Shhhh!"
  • Statsmodels: A Python package for building various statistical models
  • Seaborn: A high-level visualization library for Python
  • Pandas-profiling
  • Blaze
  • Altair
  • Numba
  • BigARTM
  • GYM: An open-source toolkit for reinforcement learning from Open AI project
  • PyBrain: A Machine Learning library for Python with emphasis on modelling via many types of neural network architectures
  • Sklearn-pandas
  • Auto-ML
  • Scikit-Learn Contrib/Lightning: An extension library to Scikit-learn for large-scale linear classification, regression and ranking problems
  • GPLearn
  • Nengo
  • Scikit-learn Contrib/*: A collection of extension libraries for Scikit-learn adding new (missing) functionalities to it
  • Koolmogorov: A Python library for hierarchical clustering and visualisation
  • Lime: A tool for exploring and explaining the output of classifiers
  • TreeInterpreter
  • SNAP-Python: Python wrapper library for Stanford Network Analysis Platform (SNAP)
  • Pycobra: A Python library implementing ensemble methods for regression, classification and visualisation tools including Voronoi tessellations
  • TF Learn: A library on top of TensorFlow providing a higher API than TensorFlow
  • Featuretools: A Python library for automated feature engineering
  • spaCy: NLP library with tons of features(like various CNN models)
  • SymPy: Symbolic computation library for Python, Aiming to become a full-fledged CAS
  • Uniform Manifold Approximation and Projection: A general non-linear dimensionality reduction algorithm implemented in Python
  • Scikit-learn Contrib/HDBSCAN: A high-performance implementation of HDBSCAN clustering, HDBSCAN is robust and easy-to-use clustering algorithm with minimal parameters, Ideal for exploratory data analysis; It works as an extension to Scikit-learn
  • Turi Create: A fast tool/library for simplifying various ML tasks
  • Scikit-learn-Contrib/Categorical-Encoding: An extension library for Scikit-learn that provides additional categorical feature encoding schemes(e.g. LeaveOneOut scheme)
  • Optunity: A library for hyperparameter optimization
  • Kmodes
  • TF-Slim
  • Pyro: "Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend" - Pyro's website
  • GEM: A Python library that provides various graph embedding methods like 'node2vec' and 'locally linear embedding.'
  • DynamicGEM: A dynamic graph embedding library like GEM
  • GraphSAGE: A graph embedding framework to generate low-dimensional vector representations for nodes, instrumental if you need to use deep learning on graph data
  • Horovd: A distributed training framework for TensorFlow, Keras, and PyTorch by Uber
  • NetLSD: Python implementation of NetLSD, a scalable graph embedding algorithm for representing a graph via a low-dimensional vector
  • SHAP: A tool for exploring and explaining the outcome of an arbitrary model
  • NLPre: Another cool Python NLP library
  • GCN: Python implementation of graph convolutional networks in TensorFlow
  • AllenNLP: "An open-source NLP research library, built on PyTorch" - AllenNLP's repository documentations
  • TensorLy: A Python Library for efficient Tensor operations
  • CuPy: A Python matrix library accelerated by Nvidia CUDA, it's also compatible with Numpy's API
  • Scikit-Multiflow: A Python library for Stream Mining
  • MLflow: A software toolbox to manage ML projects' workflow and life-cycle, it aims to make ML software projects easier to implement by providing various helper components for each step
  • pyGAM: A Python module for building Generalized Additive Models (GAMs)
  • ggplot: "ggplot is a plotting system for Python based on R's ggplot2 and the Grammar of Graphics. It is built for making professionally looking, plots quickly with minimal code" - ggplot's website
  • Linkpred: A Python package for link prediction on graphs
  • SparklingGraph: A Python library to process large scale graphs using Spark and GraphX in a distributed manner
  • OpenNE: An opensource network embedding library
  • Galry: A high-performance visualisation library in Python
  • Dedupe: A Python library for fuzzy entity-resolution and record deduplication
  • PyText: A deep-learning-based NLP modelling framework built on top of PyTorch
  • flair: A state-of-the-art NLP framework in Python from Zalando
  • NearPy: "A Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes" according to its descriptions
  • fastchunking: A (fast) text chunking algorithm implemented in C++ and Python
  • Vaex: Vaex is a data manipulation library much like Pandas and Dask with a lazy out-of-core approach to handling the data so you can work with huge tables with it
  • openTSNE: An extensible, parallel implementation of t-SNE
  • Faust: A stream processing library for Python
  • Active Semi-Supervised Clustering: An extension library for scikit-learn that implements a collection of useful active semi-supervised clustering algorithms
  • TextDistance: A Python library for calculating and comparing the distance between two sequences (such as text documents) with many algorithms
  • Ray: A scalable. high-performance distributed execution framework for executing arbitrary Python functions on multiple machines, suitable for many ML workloads
  • Pyitlib: An opensource library for calculating a useful collection of information-theoretic measures (i.e. Entropy) for discrete random variables
  • KDEpy: A collection of useful kernel density estimators in Python 3.5+
  • Tsfresh: A Python library for (automatic) feature extraction and engineering on time-dependent data
  • GPy: A Python library for working with Gaussian processes
  • Tslearn: A machine learning library dedicated to working with time-dependent data
  • Ludwig: "Ludwig is a toolbox that allows to train and test deep learning models without the need to write code" - Ludwigs's website
  • Record Linkage Toolkit: A Python software toolkit for record deduplication and linkage
  • PyJanitor: Python port of R's janitor package, for data cleansing and manipulation
  • FastText: A library for fast and efficient text embedding and classification
  • Mimesis: A fast and useful fake data generation library
  • PyOD: A Python software toolbox for scalable Outlier Detection (aka Anomaly Detection)
  • Creme: A Python library for Online Learning and building incremental models
  • vg: A linear algebra library much like Numpy with a more human-friendly interface
  • GraphKernels: A fast library for calculating various graph kernels
  • GraKeL: A graph kernel calculation library that is using scikit-learn's API so it can be used with other functionalities and routines already present in scikit-learn without much hassle
  • Graphsim: A graph similarity extension libraries for NetworkX
  • Textract: A general text extraction tool from many file formats
  • Sacred: Sacred is a Python library to make an ML workflow easier to reproduce and manage for you!
  • TextDistance: TextDistance is a Python library for calculating and comparing the distance between two or more sequences of an arbitrary alphabet (e.g., words, DNA sequences), it has got over 30 distance algorithms to use
  • Py_stringmathcing: Py_stringmathcing is a Python library consisting of a comprehensive set of string tokenisers (such as alphabetical tokenisers, whitespace tokenisers) and also string similarity measures (e.g., edit distance, Jaccard distance)
  • JGraph: JGraph is a WebGL graph drawing library for Python
  • Kedro: A Python library and also tool to manage your data analysis workflow in your projects
  • PySAL: PySAL is a Python package for geolocation-based data analysis
  • k-Shape: This is a Python implementation of the k-Shape clustering algorithm for clustering the time series data
  • Pyforest: You could use Pyforest to import all Python data science-related library lazily as you need them in your code
  • ETE Toolkit: ETE Toolkit is a Python toolbox for visualising and analysis of tree format data
  • Whoosh: Whoosh is a full-text indexing and search library for Python
  • Geoplot: Geoplot is a Python visualisation library for geospatial plotting of geo-locational records
  • GeoPandas: GeoPandas is a high-level library with an API similar to Pandas that makes working with geospatial datasets in Python mush easier
  • Edward: "A library for probabilistic modelling, inference, and criticism" - its website
  • HyperTools: A Python library for high-dimensional data visualisation and analysis
  • TextRank: TextRank algorithm implementation for Python 3
  • pymorton: A Python package for ordinal hashing of multidimensional points into a one-dimensional ordering
  • PySS3: A Python package implementing SS3 text classifier with visualisations tools for explainable artificial intelligence (XAI)
  • Lpproj: A Python implementation of Locality Preserving Projections (LPP) with Scikit-Learn compatible API
  • Multi-Rake: Multilingual rapid automatic keyword extraction (Multi-RAKE) is a Python library for automatic text summarisation and keyword extraction of text in many different languages
  • PyCarets

Additional Useful Resources

  • PyPy Python Implementation: A stackless alternative implementation for Python's runtime
  • Useful Metrics: A collection of useful ML related scoring and learning metrics
  • XGboost Benchmarks
  • Franchise Notebook
  • Orange
  • Weka: The famous Data Mining tool from where Kiwis live
  • ELKI: A Data Mining software framework in Java
  • Julia Programming Language: New language for Scientific Computing and HPC
  • SQL Notebook
  • IPython: An augmented Python shell with lots of features
  • Incanter: A statistical analysis environment for a Lisp(for Clojure to be exact)
  • Torch: Scientific Computing framework running on top of Lua's Just in Time compiler, brilliant idea!
  • BPython: An advanced Python shell
  • RAnalyticFlow: Great environment for Data Flow Programming in R
  • SPMF: A Java Data Mining library with tons of cool algorithms
  • SageMath: Open source math software system, a complete math environment for everyone
  • H2O AI Platform: A software tool for Big Data Analysis, could be used for both Data Mining or Machine Learning tasks, It has tons of features
  • Various ML Cheat Sheets
  • OpenRefine: An open-source data cleansing and refinement tool
  • Deep Learning Papers
  • Apache Mxnet: A high performance and scalable ANN framework for Deep Learning
  • Material for the book 'Python for Data Analysis'
  • Encog Machine Learning Framework: An ML library for Java and .NET with focus on ANN algorithms
  • Apache Spark MLib: An ML library on top of your spark cluster!
  • Awesome-Python: A comprehensive list of Pythonic resources (libraries, frameworks, etc.)
  • GATE: A mature text processing toolkit in Java
  • MALLET: "MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications to text." - MALLET's website
  • MLPack: A fast ML library written in C++ with bindings to Python
  • t-SNE: Implementation of famous t-distributed stochastic neighbour embedding algorithm for various languages
  • Caffe
  • Apache Singa
  • CompLearn
  • SNAP
  • Apache PredictionIO
  • JGraphT: A Java library for working with graphs with tonnes of features
  • JGaphX: A Java library for diagramming and visualising graphs
  • Microsoft Distributed Machine Learning Toolkit
  • Microsoft Cognitive Toolkit
  • BIDMat: A both CPU and GPU-accelerated matrix library for data mining tasks
  • BIDMach
  • Apache SystemML
  • Apache Mahout
  • Accord.NET: Accord.NET is a Machine Learning framework written in C#, its API is available for .NET, it also comes combined with some audio and image processing libraries completely written in C#
  • BitMAGIC Library
  • Cassovary
  • Dex: A nice Java-based tool for Data Analysis and Data Mining
  • Apache OpenNLP
  • OpenNN: A C++ library to build complex neural network models
  • MOA: A tool for mining stream data, by people who also created Weka
  • MLPACK: C++ Machine Learning library for scalability, speed, and ease-of-use
  • MOSES: "Moses is a statistical machine translation system that allows you to train translation models for any language pair automatically." - Moses's website
  • Parallel Python: A Python module for parallel execution of code on SMP and Cluster environment
  • BeautifulSoup: A handy Python library to digest almost anything from World Wild Web
  • Wordbatch: A library for parallel feature extraction on textual data(and potentially other complex data types)
  • Mypy: Static typing facilities for Python
  • SKIL: A platform for managing the life cycle of an ML/DS related project or product
  • An unofficial Python extension package repository for Windows
  • LIBOL: An online learning library
  • Smile: "Smile is a fast and comprehensive machine learning system"- Smile's website
  • Tablesaw: A daydreamer and visualisation library for Java
  • TensorFlow Models: A repository of models and examples built with TensorFlow
  • Curated list of graph embedding methods: A collection of paper-code pairs for the state of the art graph embeddings(a.k.a network representational learning) algorithms
  • Curated list of resources for Recommender Systems
  • Pegasus: An open-source system for analysing huge graphs, It seems it is not being developed or maintained for a long time
  • Dataset: A handy tool to simplify the task of reading and writing to relational databases
  • Twython: A Twitter API library in pure Python with tonnes of features
  • Apache TinkerPop: A cool graph storage and computation framework, it can be used both as a graph analytics platform and a graph database system, love the little gremlins!
  • Graphexp: Graphexp is a visual graph explorer with D3.js for TinkerPop
  • Scilab: An open-source numerical computation language and environment, great Matlab alternative
  • Glow: A compiler for Neural Network hardware accelerators for various hardware
  • GraphJet: A real-time graph processing library in Java
  • GraphDrawing: A very nice graph analysis and drawing library in Java
  • Sketch Library: A C++ library for data summarization
  • The Lemur Project: A collection of search engine, text processing and Data Mining tools and libraries in C++ and Java-like RankLib for ranking
  • VisPy: A Python library for interactive scientific visualisation that is designed to be fast, scalable and easy to use
  • Awesome Machine Learning: A curated list of awesome Machine Learning frameworks, libraries and software, etc
  • MOA Framework: A fantastic Java software environment and framework for Stream Mining
  • MEKA: A multi-label classification tool, it works on top of Weka
  • Mulan: A Java library for learning on multi-label data
  • Dlib: A fast Machine Learning library implemented in C++ for solving real-world data problems
  • MITE: A library and tool for information extraction on text data, it's built on top of Dlib with binding for languages like Java and Python
  • GraphStream: GraphStream is a Java library for analysing and visualising dynamic graphs
  • Cytoscape: A complex network (graph) visualization tool in Java
  • Gephi: A network visualisation and analysis tool in Java
  • SocNetV: A handy social network visualisation tool
  • Visone: Yet another handy social network analysis and visualisation tool
  • Flashlight: A fast Machine Learning library in C++
  • Machine Learning with Python: A collection of ML algorithms and their sample use-cases implemented in Python
  • TANAGRA: "TANAGRA is a free DATA MINING software for academic and research purposes" its website
  • KNIME: KNIME is an open-source data analytics, reporting and data integration platform
  • MG4J: An open-source, high-performance full-text search engine written in Java
  • WebGraph: A Java framework for working on huge graphs
  • RTree: Reactive implementations of immutable in-memory R-tree and R*-tree in Java
  • Recommender Systems: A useful repository of stuff all about the Recommender Systems (e.g. best practices to build Recommender Systems)
  • Awesome-Graph: A curated list of resources (e.g., libraries, frameworks and databases) related to graphs
  • Parallel Graph AnalytiX (PGX): A graph processing and analytics toolbox from Oracle which is written in Java
  • ROOT: A scientific toolbox for data processing and analysis in C++
  • Stanford Topic Modeling Toolbox (TMT): TMT is a nice Java toolkit for topic modelling on textual data
  • Java Data Mining Package: An opensource Java package for mining massive datasets implementing a vast collection of algorithms (i.e. clustering, regression, classification and graphical models)
  • ScalaNLP: A numerical computation and Data Mining library suite written in Scala, with an emphasis on NLP
  • Vegas: A very flexible declarative data visualisation library in Scala that works with Apache Spark right out of the box
  • DeepLearning.scala: A simple Scala library for creating complex artificial neural networks by ThoughtWorks
  • XAPIAN: An opensource search engine library with bindings to be used in many high-level programming languages, for example, Python, Java, and Lua!
  • DataMelt: "DataMelt is a free software for numeric computation, mathematics, statistics, symbolic calculations, data analysis and data visualisation" - DataMelt's website
  • Luna: A functional programming language to create data processing friendly programs in a WYSIWYG way
  • NetLogo: A computational multi-agent development and simulation environment, very cool tool for investigating complex phenomena via implementing simple computational rules for agents!
  • LabPlot: LabPlot is a lovely application for data analysis and plotting, it is part of KDE Project!
  • Meta Toolkit: A fast software toolkit implementing many useful ML algorithms, it is written in C++
  • Record Linkage Tools: A collection of useful resources for record deduplication and linkage
  • Gunrock: A GPU based graph analytics and processing library, it works with CUDA
  • Papers on Graph Analytics: A thorough list of publications related to graphs covering many interesting topics
  • GraphIt: GraphIt - "A High-Performance Domain Specific Language for Graph Analytics" - GraphIt's website
  • SMORe: A handy tool and library for fast weighted graph embedding in C++
  • Warp-ctc: A fast parallel implementation of CTC, for both CPU and GPU
  • Grew: Grew is a graph library and tool written in Ocaml with applications in NLP, it is a companion tool for the book Application of Graph Rewriting to Natural Language Processing
  • ZVTM: A handy graph visualisation library for Java
  • mrJob: A Python library to create MapReduce jobs and run them on multiple machines (i.e., in a cluster)
  • Metanome: A collection of interesting materials (e.g., algorithms, code, articles) related to data profiling
  • Graphillion: Graphilion is a software library for working with many graphs in a parallel fashion
  • Awesome graph classification: A very thorough collection of graph embedding, classification and representation learning papers with the code!
  • VFML: Very Fast ML (hence the name VFML) is a fast C library for mining very huge data streams
  • Talisman: Talisman is a modular JavaScript library for NLP and Machine Learning activities
  • StyleGAN: StyleGAN is TensorFlow implementation of a proposed architecture for GANs from NVIDIA, you can use it to create photo-realistic pictures of people who don't exist!
  • Java String Similarity: A Java library implementing a collection of useful text similarity/distance measures
  • Label Studio: Label Studio is a handy tool with a nice UI for labelling your data (e.g., records and documents)
  • GraphML: GraphML is a graph representation and serialisation file format based on XML that could store many different types of graphs with their attributes without loss of information
  • Taco: A compiler for compiling and executing general tensor algebra operations on sparse tensors in machine code for CPUs and GPUs
  • Libspatialindex: Libspatialindex contains many robust geolocational indexing algorithms like R*-tree and TPR-tree
  • NLP Best Practices: A collection of best practices and their examples in NLP domain from Microsoft
  • Tulip: Tulip is a nice open-source data visualisation and analysis software toolbox, it is especially good for working with graphs and graph datasets
  • Juno: Juno is an IDE based on Atom for Julia programming language
  • BoofCV: A real-time machine vision and image processing in Java
  • cuDF: cuDF is a library with API similar to Pandas that is built based on the Apache Arrow columnar memory format, cuDF uses GPU routines for loading, joining, aggregating, filtering, and otherwise manipulating data
  • LASER toolkit: LASER (Language-Agnostic SEntence Representations) is a software toolkit for sentence embedding for about 100 different languages
  • Idyll: "A toolkit for creating data-driven stories and explorable explanations" - Idyll's website
  • DeepLearning4J: A java-based software toolbox for building and training deep artificial neural networks
  • NeMo: NeMo is a software toolkit for building AI applications
  • TRAINS Agent: TRAINS Agent is a DevOps tool for setting up and running an AI experiment on a cluster computing environment
  • TensorFlow Hub: TensorFlow Hub is a library for the publication, discovery, and consumption of reusable parts of deep learning models
  • AIX360: An explainable AI (XAI) toolkit to interpret Machine Learning models
  • Catalyst: Catalyst is a tool for making Deep Learning experiments on PyTorch reproducible
  • TensorFlowJS: TensorFlowJS is a JavaScript library to use TensorFlow models in web applications in the browser
  • Kst: Kst is a handy data visualisation tool from KDE project
  • AMIDST: AMIDST is a Java software toolbox for probabilistic modelling of data
  • LIBFFM: "LIBFFM is an open-source tool for field-aware factorisation machines (FFM)"; people won a few real-world data science challenges in Kaggle
  • jLDADMM: A Java package for LDA and DMM topic modelling

My Favourites

About

My ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free(as speech not free food) or open-source.

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Jupyter Notebook 99.8%Language:Python 0.1%Language:Makefile 0.0%