Audio-Development-Tools (ADT)

This is a list of sound, audio and music development tools which contains machine learning, audio generation, audio signal processing, sound synthesis, spatial audio, music information retrieval, music generation, speech recognition, speech synthesis, singing voice synthesis and more.

Machine Learning (ML)
Audio Generation (AG)
Audio Signal Processing (ASP)
Sound Synthesis (SS)
Spatial Audio (SA)
Web Audio Processing (WAP)
Music Information Retrieval (MIR)
Music Generation (MG)
Speech Recognition (ASR)
Speech Synthesis (TTS)
Singing Voice Synthesis (SVS)

Machine Learning (ML)

librosa - Librosa is a python package for music and audio analysis. It provides the building blocks necessary to create music information retrieval systems.
Essentia - Essentia is an open-source C++ library for audio analysis and audio-based music information retrieval released under the Affero GPLv3 license. It contains an extensive collection of reusable algorithms which implement audio input/output functionality, standard digital signal processing blocks, statistical characterization of data, and a large set of spectral, temporal, tonal and high-level music descriptors. C++ library for audio and music analysis, description and synthesis, including Python bindings.
DDSP - DDSP: Differentiable Digital Signal Processing. DDSP is a library of differentiable versions of common DSP functions (such as synthesizers, waveshapers, and filters). This allows these interpretable elements to be used as part of an deep learning model, especially as the output layers for audio generation.
MIDI-DDSP - MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling. MIDI-DDSP is a hierarchical audio generation model for synthesizing MIDI expanded from DDSP.
torchsynth - A GPU-optional modular synthesizer in pytorch, 16200x faster than realtime, for audio ML researchers.
aubio - aubio is a tool designed for the extraction of annotations from audio signals. Its features include segmenting a sound file before each of its attacks, performing pitch detection, tapping the beat and producing midi streams from live audio.
audioFlux - audioFlux is a deep learning tool library for audio and music analysis, feature extraction. It supports dozens of time-frequency analysis transformation methods and hundreds of corresponding time-domain and frequency-domain feature combinations. It can be provided to deep learning networks for training, and is used to study various tasks in the audio field such as Classification, Separation, Music Information Retrieval(MIR) and ASR etc.
Polymath - Polymath uses machine learning to convert any music library (e.g from Hard-Drive or YouTube) into a music production sample-library. The tool automatically separates songs into stems (beats, bass, etc.), quantizes them to the same tempo and beat-grid (e.g. 120bpm), analyzes musical structure (e.g. verse, chorus, etc.), key (e.g C4, E3, etc.) and other infos (timbre, loudness, etc.), and converts audio to midi. The result is a searchable sample library that streamlines the workflow for music producers, DJs, and ML audio developers.
IPython - IPython provides a rich toolkit to help you make the most of using Python interactively.
torchaudio - an audio library for PyTorch. Data manipulation and transformation for audio signal processing, powered by PyTorch.
torch-audiomentations - Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.
PyTorch Audio Augmentations - Audio data augmentations library for PyTorch for audio in the time-domain.
Asteroid - Asteroid is a Pytorch-based audio source separation toolkit that enables fast experimentation on common datasets. It comes with a source code that supports a large range of datasets and architectures, and a set of recipes to reproduce some important papers.
Kapre - Kapre: Keras Audio Preprocessors. Keras Audio Preprocessors - compute STFT, InverseSTFT, Melspectrogram, and others on GPU real-time.
praudio - Audio preprocessing framework for Deep Learning audio applications.
DeepAFx - DeepAFx: Deep Audio Effects. Audio signal processing effects (FX) are used to manipulate sound characteristics across a variety of media. Many FX, however, can be difficult or tedious to use, particularly for novice users. In our work, we aim to simplify how audio FX are used by training a machine to use FX directly and perform automatic audio production tasks. By using familiar and existing tools for processing and suggesting control parameters, we can create a unique paradigm that blends the power of AI with human creative control to empower creators.
nnAudio - nnAudio is an audio processing toolbox using PyTorch convolutional neural network as its backend. By doing so, spectrograms can be generated from audio on-the-fly during neural network training and the Fourier kernels (e.g. or CQT kernels) can be trained.
WavEncoder - WavEncoder is a Python library for encoding audio signals, transforms for audio augmentation, and training audio classification models with PyTorch backend.
SciPy - SciPy (pronounced "Sigh Pie") is an open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more.
pyAudioAnalysis - Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
Mutagen - Mutagen is a Python module to handle audio metadata. It supports ASF, FLAC, MP4, Monkey’s Audio, MP3, Musepack, Ogg Opus, Ogg FLAC, Ogg Speex, Ogg Theora, Ogg Vorbis, True Audio, WavPack, OptimFROG, and AIFF audio files. All versions of ID3v2 are supported, and all standard ID3v2.4 frames are parsed. It can read Xing headers to accurately calculate the bitrate and length of MP3s. ID3 and APEv2 tags can be edited regardless of audio format. It can also manipulate Ogg streams on an individual packet/page level.
LibXtract - LibXtract is a simple, portable, lightweight library of audio feature extraction functions. The purpose of the library is to provide a relatively exhaustive set of feature extraction primatives that are designed to be 'cascaded' to create a extraction hierarchies.
dejavu - Audio fingerprinting and recognition in Python. Dejavu can memorize audio by listening to it once and fingerprinting it. Then by playing a song and recording microphone input or reading from disk, Dejavu attempts to match the audio against the fingerprints held in the database, returning the song being played.
Matchering - 🎚️ Open Source Audio Matching and Mastering. Matchering 2.0 is a novel Containerized Web Application and Python Library for audio matching and mastering.
TimeSide - TimeSide is a python framework enabling low and high level audio analysis, imaging, transcoding, streaming and labelling. Its high-level API is designed to enable complex processing on very large datasets of any audio or video assets with a plug-in architecture, a secure scalable backend and an extensible dynamic web frontend.
Meyda - Meyda is a Javascript audio feature extraction library. Meyda supports both offline feature extraction as well as real-time feature extraction using the Web Audio API. We wrote a paper about it, which is available here.
Audiomentations - A Python library for audio data augmentation. Inspired by albumentations. Useful for deep learning. Runs on CPU. Supports mono audio and multichannel audio. Can be integrated in training pipelines in e.g. Tensorflow/Keras or Pytorch. Has helped people get world-class results in Kaggle competitions. Is used by companies making next-generation audio products.
soundata - Python library for downloading, loading & working with sound datasets.

Audio Generation (AG)

Bark - Bark is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects.
ArchiSound - Audio generation using diffusion models, in PyTorch.
NeuralSound - Learning-based Modal Sound Synthesis with Acoustic Transfer.
RAVE - RAVE: Realtime Audio Variational autoEncoder. A variational autoencoder for fast and high-quality neural audio synthesis.
AudioLDM - AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.
Make-An-Audio - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models.
Moûsai - Moûsai: Text-to-Audio with Long-Context Latent Diffusion.
Im2Wav - Image Guided Audio Generation. We propose Im2Wav, an image guided open-domain audio generation system. Given an input image or a sequence of images, Im2Wav generates a semantically relevant sound.

Audio Signal Processing (ASP)

SouPyX - SouPyX is a very colourful space for audio exploration, suitable for research and exploration in a variety of audio fields. In SouPyX you can carry out research and exploration in audio processing, sound synthesis, audio effects, spatial audio, audio visualisation, AI audio and much more.
SoundFile - SoundFile is an audio library based on libsndfile, CFFI and NumPy.
Audio DSPy - audio_dspy is a Python package for audio signal processing tools.
pyAudioDspTools - pyAudioDspTools is a python 3 package for manipulating audio by just using numpy.
wave - The wave module provides a convenient interface to the WAV sound format. It does not support compression/decompression, but it does support mono/stereo.
Pedalboard - Pedalboard is a Python library for working with audio: reading, writing, adding effects, and more. It supports most popular audio file formats and a number of common audio effects out of the box, and also allows the use of VST3 and Audio Unit formats for third-party plugins.
PyAudio - PyAudio provides Python bindings for PortAudio v19, the cross-platform audio I/O library. With PyAudio, you can easily use Python to play and record audio on a variety of platforms, such as GNU/Linux, Microsoft Windows, and Apple macOS.
PortAudio - PortAudio is a free, cross-platform, open-source, audio I/O library. It lets you write simple audio programs in 'C' or C++ that will compile and run on many platforms including Windows, Macintosh OS X, and Unix (OSS/ALSA). It is intended to promote the exchange of audio software between developers on different platforms. Many applications use PortAudio for Audio I/O.
Pyo - pyo is a Python module written in C to help digital signal processing script creation.Python DSP module. With pyo, user will be able to include signal processing chains directly in Python scripts or projects, and to manipulate them in real time through the interpreter
tinytag - tinytag is a library for reading music meta data of most common audio files in pure python. Read audio and music meta data and duration of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA, Wave and AIFF files with python 2 or 3.
Friture - Friture is an application to visualize and analyze live audio data in real-time. Friture displays audio data in several widgets, such as a scope, a spectrum analyzer, or a rolling 2D spectrogram.
sounddevice - This Python module provides bindings for the PortAudio library and a few convenience functions to play and record NumPy arrays containing audio signals.
Pydub - Manipulate audio with a simple and easy high level interface.
SoundCard - SoundCard is a library for playing and recording audio without resorting to a CPython extension. Instead, it is implemented using the wonderful CFFI and the native audio libraries of Linux, Windows and macOS.
TarsosDSP - TarsosDSP is a Java library for audio processing. Its aim is to provide an easy-to-use interface to practical music processing algorithms implemented, as simply as possible, in pure Java and without any other external dependencies.
JUCE - JUCE is an open-source cross-platform C++ application framework for creating high quality desktop and mobile applications, including VST, VST3, AU, AUv3, AAX and LV2 audio plug-ins and plug-in hosts. JUCE can be easily integrated with existing projects via CMake, or can be used as a project generation tool via the Projucer, which supports exporting projects for Xcode (macOS and iOS), Visual Studio, Android Studio, Code::Blocks and Linux Makefiles as well as containing a source code editor.
Q - Q is a cross-platform C++ library for Audio Digital Signal Processing. Aptly named after the “Q factor”, a dimensionless parameter that describes the quality of a resonant circuit, the Q DSP Library is designed to be simple and elegant, as the simplicity of its name suggests, and efficient enough to run on small microcontrollers.
BasicDSP - BasicDSP - A tool for processing audio / experimenting with signal processing.
DaisySP - A Powerful, Open Source DSP Library in C++.
Speech Signal Processing Toolkit (SPTK) - The Speech Signal Processing Toolkit (SPTK) is a suite of speech signal processing tools for UNIX environments, e.g., LPC analysis, PARCOR analysis, LSP analysis, PARCOR synthesis filter, LSP synthesis filter, vector quantization techniques, and other extended versions of them.
eDSP - eDSP (easy Digital Signal Processing) is a digital signal processing framework written in modern C++ that implements some of the common functions and algorithms frequently used in digital signal processing, audio engineering & telecommunications systems.
KFR - KFR is an open source C++ DSP framework that focuses on high performance. Fast, modern C++ DSP framework, FFT, Sample Rate Conversion, FIR/IIR/Biquad Filters (SSE, AVX, AVX-512, ARM NEON).
MWEngine - Audio engine and DSP for Android, written in C++ providing low latency performance within a musical context, while providing a Java/Kotlin API. Supports both OpenSL and AAudio.
LabSound - LabSound is a C++ graph-based audio engine. The engine is packaged as a batteries-included static library meant for integration in many types of software: games, visualizers, interactive installations, live coding environments, VST plugins, audio editing/sequencing applications, and more.
Gist - Gist is a C++ based audio analysis library.
Realtime_PyAudio_FFT - Realtime audio analysis in Python, using PyAudio and Numpy to extract and visualize FFT features from streaming audio.
Spectrum - Spectral Analysis in Python. Spectrum is a Python library that contains tools to estimate Power Spectral Densities based on Fourier transform, Parametric methods or eigenvalues analysis. The Fourier methods are based upon correlogram, periodogram and Welch estimates. Standard tapering windows (Hann, Hamming, Blackman) and more exotic ones are available (DPSS, Taylor, …).

Sound Synthesis (SS)

Csound - Csound is a sound and music computing system which was originally developed by Barry Vercoe in 1985 at MIT Media Lab. Since the 90s, it has been developed by a group of core developers.
Pure Data - Pure Data ( Pd ) is a visual programming language developed by Miller Puckette in the 1990s for creating interactive computer music and multimedia works. While Puckette is the main author of the program, Pd is an open-source project with a large developer base working on new extensions. It is released under BSD-3-Clause. It runs on Linux, MacOS, iOS, Android "Android (operating system)") and Windows. Ports exist for FreeBSD and IRIX.
plugdata - A visual programming environment for audio experimentation, prototyping and education.
Max/MSP/Jitter - Max , also known as Max/MSP/Jitter, is a visual programming language for music and multimedia developed and maintained by San Francisco-based software company Cycling '74. Over its more than thirty-year history, it has been used by composers, performers, software designers, researchers, and artists to create recordings, performances, and installations.
Kyma (sound design language) - Kyma is a visual programming language for sound design used by musicians, researchers, and sound designers. In Kyma, a user programs a multiprocessor DSP by graphically connecting modules on the screen of a Macintosh or Windows computer.
SuperCollider - SuperCollider is a platform for audio synthesis and algorithmic composition, used by musicians, artists, and researchers working with sound. An audio server, programming language, and IDE for sound synthesis and algorithmic composition.
Sonic Pi - Sonic Pi is a live coding environment based on Ruby "Ruby (programming language)"), originally designed to support both computing and music lessons in schools, developed by Sam Aaron in the University of Cambridge Computer Laboratory in collaboration with Raspberry Pi Foundation.
Reaktor - Reaktor is a graphical modular software music studio developed by Native Instruments (NI). It allows musicians and sound specialists to design and build their own instruments, samplers "Sampler (musical instrument)"), effects and sound design tools. It is supplied with many ready-to-use instruments and effects, from emulations of classic synthesizers to futuristic sound design tools.
RTcmix - RTcmix is a real-time software "language" for doing digital sound synthesis and signal-processing. It is written in C/C++, and is distributed open-source, free of charge.
ChucK - ChucK is a programming language for real-time sound synthesis and music creation. ChucK offers a unique time-based, concurrent programming model that is precise and expressive (we call this strongly-timed), dynamic control rates, and the ability to add and modify code on-the-fly. In addition, ChucK supports MIDI, OpenSoundControl, HID device, and multi-channel audio. It is open-source and freely available on MacOS X, Windows, and Linux. It's fun and easy to learn, and offers composers, researchers, and performers a powerful programming tool for building and experimenting with complex audio synthesis/analysis programs, and real-time interactive music.
Faust - Faust (Functional Audio Stream) is a functional programming language for sound synthesis and audio processing with a strong focus on the design of synthesizers, musical instruments, audio effects, etc. Faust targets high-performance signal processing applications and audio plug-ins for a variety of platforms and standards.
SOUL - The SOUL programming language and API. SOUL (SOUnd Language) is an attempt to modernise and optimise the way high-performance, low-latency audio code is written and executed.
Cmajor - Cmajor is a programming language for writing fast, portable audio software. You've heard of C, C++, C#, objective-C... well, Cmajor is a C-family language designed specifically for writing DSP signal processing code.
Gwion - Gwion is a programming language, aimed at making music. strongly inspired by ChucK, but adding a bunch high-level features; templating, first-class functions and more. It aims to be simple, small, fast, extendable and embeddable.
Elementary Audio - Elementary is a JavaScript framework and high performance audio engine that helps you build quickly and ship confidently. Declarative, functional framework for writing audio software on the web or for native apps.
Sound2Synth - Sound2Synth: Interpreting Sound via FM Synthesizer Parameters Estimation.
JSyn - JSyn is a modular audio synthesizer for Java by Phil Burk. JSyn allows you to develop interactive computer music programs in Java. It can be used to generate sound effects, audio environments, or music. JSyn is based on the traditional model of unit generators which can be connected together to form complex sounds.
Midica - Midica is an interpreter for a Music Programming Language. It translates source code to MIDI. But it can also be used as a MIDI Player, MIDI compiler or decompiler, Karaoke Player, ALDA Player, ABC Player, LilyPond Player or a MIDI File Analyzer. You write music with one of the supported languages (MidicaPL, ALDA or ABC).
Mercury - Mercury is a minimal and human-readable language for the live coding of algorithmic electronic music. All elements of the language are designed around making code more accessible and less obfuscating for the audience. This motivation stretches down to the coding style itself which uses clear descriptive names for functions and a clear syntax.
Alda - Alda is a text-based programming language for music composition. It allows you to write and play back music using only a text editor and the command line. The language’s design equally favors aesthetics, flexibility and ease of use.
Platonic Music Engine - The Platonic Music Engine is an attempt to create computer algorithms that superficially simulate the entirety of creative human culture, past, present, and future. It does so in an interactive manner allowing the user to choose various parameters and settings such that the final result will be unique to the user while still preserving the cultural idea that inspired the work.
pyo-tools - Repository of ready-to-use python classes for building audio effects and synths with pyo.
py-modular - Modular and experimental audio programming framework for Python. py-modular is a small, experimental audio programming environment for python. It is intended to be a base for exploration of new audio technologies and workflows. Most everything in py-modular is built around a node-based workflow, meaning small classes do small tasks and can be patched together to create full synthesizers or larger ideas.
Bach: Automated Composer's Helper - a cross-platform set of patches and externals for Max, aimed to bring the richness of computer-aided composition into the real-time world.
AudioKit - AudioKit is an audio synthesis, processing, and analysis platform for iOS, macOS (including Catalyst), and tvOS.
Twang - Library for pure Rust advanced audio synthesis.
Gensound - Pythonic audio processing and generation framework. The Python way to audio processing & synthesis.
OTTO - The OTTO is a digital hardware groovebox, with synths, samplers, effects and a sequencer with an audio looper. The interface is flat, modular and easy to use, but most of all, it aims to encourage experimentation.
Loris - Loris is a library for sound analysis, synthesis, and morphing, developed by Kelly Fitz and Lippold Haken at the CERL Sound Group. Loris includes a C++ class library, Python module, C-linkable interface, command line utilities, and documentation.
IanniX - IanniX is a graphical open-source sequencer, based on Iannis Xenakis works, for digital art. IanniX syncs via Open Sound Control (OSC) events and curves to your real-time environment.
Leipzig - A music composition library for Clojure and Clojurescript.
Nyquist - Nyquist is a sound synthesis and composition language offering a Lisp syntax as well as an imperative language syntax and a powerful integrated development environment.. Nyquist is an elegant and powerful system based on functional programming.
OpenMusic (OM) - OpenMusic (OM) is a visual programming language based on Lisp. Visual programs are created by assembling and connecting icons representing functions and data structures. Most programming and operations are performed by dragging an icon from a particular place and dropping it to an other place. Built-in visual control structures (e.g. loops) are provided, that interface with Lisp ones. Existing CommonLisp/CLOS code can easily be used in OM, and new code can be developed in a visual way.
ORCΛ - Orca is an esoteric programming language designed to quickly create procedural sequencers, in which every letter of the alphabet is an operation, where lowercase letters operate on bang, uppercase letters operate each frame.
Overtone - Overtone is an open source audio environment designed to explore new musical ideas from synthesis and sampling to instrument building, live-coding and collaborative jamming. We combine the powerful SuperCollider audio engine, with Clojure, a state of-the-art lisp, to create an intoxicating interactive sonic experience.
SEAM - Sustained Electro-Acoustic Music - Base. Sustained Electro-Acoustic Music is a project inspired by Alvise Vidolin and Nicola Bernardini.
Glicol - Glicol (an acronym for "graph-oriented live coding language") is a computer music language with both its language and audio engine written in Rust programming language, a modern alternative to C/C++. Given this low-level nature, Glicol can run on many different platforms such as browsers, VST plugins and Bela board. Glicol's synth-like syntax and powerful audio engine also make it possible to combine high-level synth or sequencer control with low-level sample-accurate audio synthesis, all in real-time.
PaperSynth - Handwritten text to synths! PaperSynth is a project that aims to read keywords you've written on a piece of paper and convert it into synthesizers you can play on the phone.
Neural Resonator VST - This is a VST plugin that uses a neural network to generate filters based on arbitrary 2D shapes and materials. It is possible to use midi to trigger simple impulses to excite these filters. Additionally any audio signal can be used as input to the filters.
Scyclone - Scyclone is an audio plugin that utilizes neural timbre transfer technology to offer a new approach to audio production. The plugin builds upon RAVE methodology, a realtime audio variational auto encoder, facilitating neural timbre transfer in both single and couple inference mode.
mlinmax - ML for sound generation and processing in Cycling '74's Max programming language.
ADLplug - FM Chip Synthesizer — OPL & OPN — VST/LV2/Standalone.
Surge - Synthesizer plug-in (previously released as Vember Audio Surge).
cStop - cStop is a tape stop audio effect plugin available in AU & VST3 for Mac (Windows coming soon).

Spatial Audio (SA)

spaudiopy - Spatial Audio Python Package. The focus (so far) is on spatial audio encoders and decoders. The package includes e.g. spherical harmonics processing and (binaural renderings of) loudspeaker decoders, such as VBAP and AllRAD.
Spatial_Audio_Framework (SAF) - The Spatial_Audio_Framework (SAF) is an open-source and cross-platform framework for developing spatial audio related algorithms and software in C/C++. Originally intended as a resource for researchers in the field, the framework has gradually grown into a rather large and well-documented codebase comprising a number of distinct modules; with each module targeting a specific sub-field of spatial audio (e.g. Ambisonics encoding/decoding, spherical array processing, amplitude-panning, HRIR processing, room simulation, etc.).
HO-SIRR - Higher-order Spatial Impulse Response Rendering (HO-SIRR) is a rendering method, which can synthesise output loudspeaker array room impulse responses (RIRs) using input spherical harmonic (Ambisonic/B-Format) RIRs of arbitrary order. A MATLAB implementation of the Higher-order Spatial Impulse Response Rendering (HO-SIRR) algorithm; an alternative approach for reproducing Ambisonic RIRs over loudspeakers.
SpatGRIS - SpatGRIS is a sound spatialization software that frees composers and sound designers from the constraints of real-world speaker setups. With the ControlGRIS plugin distributed with SpatGRIS, rich spatial trajectories can be composed directly in your DAW and reproduced in real-time on any speaker layout. It is fast, stable, cross-platform, easy to learn and works with the tools you already know. SpatGRIS supports any speaker setup, including 2D layouts like quad, 5.1 or octophonic rings, and 3D layouts like speaker domes, concert halls, theatres, etc. Projects can also be mixed down to stereo using a binaural head-related transfer function or simple stereo panning.
Steam Audio - Steam Audio delivers a full-featured audio solution that integrates environment and listener simulation. HRTF significantly improves immersion in VR; physics-based sound propagation completes aural immersion by consistently recreating how sound interacts with the virtual environment.
libmysofa - Reader for AES SOFA files to get better HRTFs.
Omnitone - Omnitone: Spatial Audio Rendering on the Web. Omnitone is a robust implementation of ambisonic decoding and binaural rendering written in Web Audio API. Its rendering process is powered by the fast native features from Web Audio API (GainNode and Convolver), ensuring the optimum performance. The implementation of Omnitone is based on the Google spatial media specification and SADIE's binaural filters. It also powers Resonance Audio SDK for web.
Mach1 Spatial - Mach1 Spatial SDK includes APIs to allow developers to design applications that can encode or pan to a spatial audio render from audio streams and/or playback and decode Mach1Spatial 8channel spatial audio mixes with orientation to decode the correct stereo output sum of the user's current orientation. Additionally the Mach1 Spatial SDK allows users to safely convert surround/spatial audio mixes to and from the Mach1Spatial or Mach1Horizon VVBP formats.
SoundSpaces - SoundSpaces is a realistic acoustic simulation platform for audio-visual embodied AI research. From audio-visual navigation, audio-visual exploration to echolocation and audio-visual floor plan reconstruction, this platform expands embodied vision research to a broader scope of topics.
Visual Acoustic Matching - We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.

Web Audio Processing (WAP)

WebRTC Audio Processing - Python binding of WebRTC Audio Processing.
MIDI.js - 🎹 Making life easy to create a MIDI-app on the web. Includes a library to program synesthesia into your app for memory recognition or for creating trippy effects. Convert soundfonts for Guitar, Bass, Drums, ect. into code that can be read by the browser. MIDI.js ties together, and builds upon frameworks that bring MIDI to the browser. Combine it with jasmid to create a web-radio MIDI stream similar to this demo, or with Three.js, Sparks.js, or GLSL to create Audio/visual experiments.
Web Voice Processor - A library for real-time voice processing in web browsers.
Tone.js - Tone.js is a Web Audio framework for creating interactive music in the browser. The architecture of Tone.js aims to be familiar to both musicians and audio programmers creating web-based audio applications. On the high-level, Tone offers common DAW (digital audio workstation) features like a global transport for synchronizing and scheduling events as well as prebuilt synths and effects. Additionally, Tone provides high-performance building blocks to create your own synthesizers, effects, and complex control signals.
audio.js - audiojs is a drop-in javascript library that allows HTML5's <audio> tag to be used anywhere. It uses native <audio> where available and falls back to an invisible flash player to emulate it for other browsers. It also serves a consistent html player UI to all browsers which can be styled used standard css.
howler.js - Javascript audio library for the modern web. howler.js makes working with audio in JavaScript easy and reliable across all platforms. howler.js is an audio library for the modern web. It defaults to Web Audio API and falls back to HTML5 Audio. This makes working with audio in JavaScript easy and reliable across all platforms.
CoffeeCollider - CoffeeCollider is a language for real time audio synthesis and algorithmic composition in HTML5. The concept of this project is designed as "write CoffeeScript, and be processed as SuperCollider."
pico.js - Audio processor for the cross-platform.
timbre.js - Timbre.js provides a functional processing and synthesizing audio in your web apps with modern JavaScript's way like jQuery or node.js. It has many T-Object (formally: Timbre Object) that connected together to define the graph-based routing for overall audio rendering. It is a goal of this project to approach the next generation audio processing for web.
Rythm.js - A javascript library that makes your page dance.
p5.sound - p5.sound extends p5 with Web Audio functionality including audio input, playback, analysis and synthesis.
Ableton.js - Ableton.js lets you control your instance or instances of Ableton using Node.js. It tries to cover as many functions as possible.
Sound.js - "Sound.js" is micro-library that lets you load, play, and generate sound effects and music for games and interactive applications. It's very small: less than 800 lines of code and no dependencies. Click here to try an interactive demo. You can use it as-as, or integrate it into your existing framework.
tuna - An audio effects library for the Web Audio API.
XSound - XSound gives Web Developers Powerful Audio Features Easily !
Pizzicato - A web audio Javascript library. Pizzicato aims to simplify the way you create and manipulate sounds via the Web Audio API. Take a look at the demo site here. Library to simplify the way you create and manipulate sounds with the Web Audio API.
AudioMass - Free full-featured web-based audio & waveform editing tool.
WebPd - Run your Pure Data patches on the web. WebPd is a compiler for the Pure Data audio programming language allowing to run .pd patches in web pages.
DX7 Synth JS - DX7 FM synthesis using the Web Audio and Web MIDI API. Works in Chrome and Firefox. Use a MIDI or QWERTY keyboard to play the synth.
WEBMIDI.js - WEBMIDI.js makes it easy to interact with MIDI instruments directly from a web browser or from Node.js. It simplifies the control of physical or virtual MIDI instruments with user-friendly functions such as playNote(), sendPitchBend() or sendControlChange(). It also allows reacting to inbound MIDI messages by adding listeners for events such as "noteon", "pitchbend" or "programchange".

Music Information Retrieval (MIR)

Madmom - Madmom is an audio signal processing library written in Python with a strong focus on music information retrieval (MIR) tasks.
Beets - Beets is the media library management system for obsessive music geeks. music library manager and MusicBrainz tagger.
Mido - MIDI Objects for Python. Mido is a library for working with MIDI messages and ports.
mirdata - Python library for working with Music Information Retrieval (MIR) datasets.
Midifile - C++ classes for reading/writing Standard MIDI Files.
MSAF - Music Structure Analysis Framework. A Python framework to analyze music structure. MSAF is a python package for the analysis of music structural segmentation algorithms. It includes a set of features, algorithms, evaluation metrics, and datasets to experiment with.
mxml - MusicXML parsing and layout library. mxml is a C++ parser and layout generator for MusicXML files.
Open-Unmix - Open-Unmix, Music Source Separation for PyTorch. Open-Unmix , is a deep neural network reference implementation for music source separation, applicable for researchers, audio engineers and artists. Open-Unmix provides ready-to-use models that allow users to separate pop music into four stems: vocals , drums , bass and the remaining other instruments.
Spleeter - Spleeter is Deezer source separation library with pretrained models written in Python and uses Tensorflow. It makes it easy to train source separation model (assuming you have a dataset of isolated sources), and provides already trained state of the art model for performing various flavour of separation.
AMPACT - Automatic Music Performance Analysis and Comparison Toolkit.
crema - convolutional and recurrent estimators for music analysis.
MIDIcontroller - A library for creating Teensy MIDI controllers with support for hold or latch buttons, potentiometers, encoders, capacitive sensors, Piezo transducers and other velocity sensitive inputs with aftertouch.
MIDI Explorer - Yet another MIDI monitor, analyzer, debugger and manipulation tool.

Music Generation (MG)

isobar - isobar is a Python library for creating and manipulating musical patterns, designed for use in algorithmic composition, generative music and sonification. It makes it quick and easy to express complex musical ideas, and can send and receive events from various different sources including MIDI, MIDI files, and OSC.
MusPy - MusPy is an open source Python library for symbolic music generation. It provides essential tools for developing a music generation system, including dataset management, data I/O, data preprocessing and model evaluation.
music21 - music21 is a Toolkit for Computational Musicology.
Msanii - Msanii: High Fidelity Music Synthesis on a Shoestring Budget.
MusicLM - MusicLM: Generating Music From Text.
SingSong - SingSong: Generating musical accompaniments from singing.
Riffusion App - Riffusion is an app for real-time music generation with stable diffusion.
Mozart - An optical music recognition (OMR) system. Converts sheet music to a machine-readable version. The aim of this project is to develop a sheet music reader. This is called Optical Music Recognition (OMR). Its objective is to convert sheet music to a machine-readable version. We take a simplified version where we convert an image of sheet music to a textual representation that can be further processed to produce midi files or audio files like wav or mp3.
Muzic - Muzic: Music Understanding and Generation with Artificial Intelligence. Muzic is a research project on AI music that empowers music understanding and generation with deep learning and artificial intelligence. Muzic is pronounced as [ˈmjuːzeik] and '谬贼客' (in Chinese).
Jukebox - Code for the paper "Jukebox: A Generative Model for Music". We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We’re releasing the model weights and code, along with a tool to explore the generated samples.
MidiTok - A convenient MIDI / symbolic music tokenizer for Deep Learning networks, with multiple strategies .🎶
SCAMP - SCAMP is an computer-assisted composition framework in Python designed to act as a hub, flexibly connecting the composer-programmer to a wide variety of resources for playback and notation. SCAMP allows the user to manage the flow of musical time, play notes either using FluidSynth or via MIDI or OSC messages to an external synthesizer, and ultimately quantize and export the result to music notation in the form of MusicXML or Lilypond. Overall, the framework aims to address pervasive technical challenges while imposing as little as possible on the aesthetic choices of the composer-programmer.
Facet - Facet is an open-source live coding system for algorithmic music. With a code editor in the browser and a NodeJS server running locally on your machine, Facet can generate and sequence audio and MIDI data in real-time.Facet is a live coding system for algorithmic music.
Mingus - Mingus is a music package for Python. Mingus is a package for Python used by programmers, musicians, composers and researchers to make and analyse music.
Audeo - Audeo is a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video. Generation of music from visual cues is a challenging problem and it is not clear whether it is an attainable goal at all. Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events. To achieve the transformation we built a full pipeline named Audeo containing three components. We first translate the video frames of the keyboard and the musician hand movements into raw mechanical musical symbolic representation Piano-Roll (Roll) for each video frame which represents the keys pressed at each time step. We then adapt the Roll to be amenable for audio synthesis by including temporal correlations. This step turns out to be critical for meaningful audio generation. As a last step, we implement Midi synthesizers to generate realistic music. Audeo converts video to audio smoothly and clearly with only a few setup constraints.
libatm - libatm is a library for generating and working with MIDI files. It was purpose-built for All the Music, LLC to assist in its mission to enable musicians to make all of their music without the fear of frivolous copyright lawsuits. All code is released into the public domain via the Creative Commons Attribution 4.0 International License. If you're looking for a command line tool to generate and work with MIDI files, check out the atm-cli project that utilizes this library. For more information on All the Music, check out allthemusic.info. For more detailed library documentation, check out the crate documentation here.
Davidic - A minimalist procedural music creator. Randomly generate musical scale, MIDI instrument(s), chord progression, and rhythm, then lock-in what you like and regenerate to refine. Advanced controls: chord progressions and rhythms can be manually specified after selecting the Advanced Controls toggle, but UI support is minimal. Suggested usage is restricted to tweaking randomly-generated starting points.
PyMusicLooper - A script for creating seamless music loops, with play/export support.
ChatGPT2midi - CLI Program for generating chord progressions with ChatGPT.

Speech Recognition (ASR)

Kaldi - Kaldi is a toolkit for speech recognition, intended for use by speech recognition researchers and professionals.
PaddleSpeech - Easy-to-use Speech Toolkit including SOTA/Streaming ASR with punctuation, influential TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting.
NVIDIA NeMo - NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech synthesis (TTS). The primary objective of NeMo is to help researchers from industry and academia to reuse prior work (code and pretrained models) and make it easier to create new conversational AI models.
Whisper - Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
Transformers - 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Julius - Open-Source Large Vocabulary Continuous Speech Recognition Engine. "Julius" is a high-performance, small-footprint large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. The main platform is Linux and other Unix-based system, as well as Windows, Mac, Androids and other platforms.
audino - audino is an open source audio annotation tool. It provides features such as transcription and labeling which enables annotation for Voice Activity Detection (VAD), Diarization, Speaker Identification, Automated Speech Recognition, Emotion Recognition tasks and more.
Wenet - Wenet is an tansformer-based end-to-end ASR toolkit.
SpeechBrain - SpeechBrain is an open-source and all-in-one conversational AI toolkit based on PyTorch. The goal is to create a single , flexible , and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies , including systems for speech recognition , speaker recognition , speech enhancement , speech separation , language identification , multi-microphone signal processing , and many others.
ESPnet - ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech. ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.
Espresso - Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq.
Leon - 🧠 Leon is your open-source personal assistant.
DeepSpeech - DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
annyang - annyang is a tiny javascript library that lets your visitors control your site with voice commands. annyang supports multiple languages, has no dependencies, weighs just 2kb and is free to use.
PocketSphinx - This is PocketSphinx, one of Carnegie Mellon University's open source large vocabulary, speaker-independent continuous speech recognition engines.
Kara - Open Source Voice Assistant. Simply put, Kara is a voice assistant that steals 0% of your data so you stay free! She is a actively maintained, modular, and designed to customize.
Voice Lab - Voice Lab is an automated voice analysis software. What this software does is allow you to measure, manipulate, and visualize many voices at once, without messing with analysis parameters. You can also save all of your data, analysis parameters, manipulated voices, and full colour spectrograms and power spectra, with the press of one button.

Speech Synthesis (TTS)

VALL-E - VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.
VITS - VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text.
NeuralSpeech - NeuralSpeech is a research project in Microsoft Research Asia focusing on neural network based speech processing, including automatic speech recognition (ASR), text to speech (TTS), etc.
Real-Time Voice Cloning - Clone a voice in 5 seconds to generate arbitrary speech in real-time. This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. SV2TTS is a deep learning framework in three stages. In the first stage, one creates a digital representation of a voice from a few seconds of audio. In the second and third stages, this representation is used as reference to generate speech given arbitrary text.
WaveNet - A TensorFlow implementation of DeepMind's WaveNet paper. The WaveNet neural network architecture directly generates a raw audio waveform, showing excellent results in text-to-speech and general audio generation (see the DeepMind blog post and paper for details).
FastSpeech 2 - An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech".
MelGAN - Generative Adversarial Networks for Conditional Waveform Synthesis.
HiFi-GAN - HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.
edge-tts - Use Microsoft Edge's online text-to-speech service from Python (without needing Microsoft Edge/Windows or an API key).
Vocode - Vocode is an open-source library for building voice-based LLM applications.
TTS-dataset-tools - Automatically generates TTS dataset using audio and associated text. Make cuts under a custom length. Uses Google Speech to text API to perform diarization and transcription or aeneas to force align text to audio.

Singing Voice Synthesis (SVS)

NNSVS - Neural network-based singing voice synthesis library for research.
Muskit - Muskit is an open-source music processing toolkit. Currently we mostly focus on benchmarking the end-to-end singing voice synthesis and expect to extend more tasks in the future. Muskit employs pytorch as a deep learning engine and also follows ESPnet and Kaldi style data processing, and recipes to provide a complete setup for various music processing experiments.
OpenUtau - Open singing synthesis platform / Open source UTAU successor.
so-vits-svc - SoftVC VITS Singing Voice Conversion.
Retrieval-based-Voice-Conversion-WebUI - An easy-to-use SVC framework based on VITS.
Sinsy - Sinsy is an HMM/DNN-based singing voice synthesis system. You can generate a singing voice sample by uploading the musical score (MusicXML) to this website.
DiffSinger - DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism.
lessampler - lessampler is a Singing Voice Synthesizer. It provides complete pitch shifting, time stretching and other functions. Support multiple interface calls such as UTAU, Library, and Shine.
Mellotron - Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.

Penny-Admixture / Audio-Development-Tools