mooneral / Copulas

A library to model multivariate data using copulas.

Home Page:https://sdv.dev/Copulas/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status PyPi Shield Downloads Unit Tests Coverage Status

Overview

Copulas is a Python library for modeling multivariate distributions and sampling from them using copula functions. Given a table containing numerical data, we can use Copulas to learn the distribution and later on generate new synthetic rows following the same statistical properties.

Important Links
💻 Website Check out the SDV Website for more information about the project.
📙 SDV Blog Regular publshing of useful content about Synthetic Data Generation.
📖 Documentation Quickstarts, User and Development Guides, and API Reference.
:octocat: Repository The link to the Github Repository of this library.
📜 License The entire ecosystem is published under the MIT License.
⌨️ Development Status This software is in its Pre-Alpha stage.
Community Join our Slack Workspace for announcements and discussions.
Tutorials Run the SDV Tutorials in a Binder environment.

Features

Some of the features provided by this library include:

  • A variety of distributions for modeling univariate data.
  • Multiple Archimedean copulas for modeling bivariate data.
  • Gaussian and Vine copulas for modeling multivariate data.
  • Automatic selection of univariate distributions and bivariate copulas.

Supported Distributions

Univariate

  • Beta
  • Gamma
  • Gaussian
  • Gaussian KDE
  • Log-Laplace
  • Student T
  • Truncated Gaussian
  • Uniform

Archimedean Copulas (Bivariate)

  • Clayton
  • Frank
  • Gumbel

Multivariate

  • Gaussian Copula
  • D-Vine
  • C-Vine
  • R-Vine

Install

Copulas is part of the SDV project and is automatically installed alongside it. For details about this process please visit the SDV Installation Guide

Optionally, Copulas can also be installed as a standalone library using the following commands:

Using pip:

pip install copulas

Using conda:

conda install -c conda-forge copulas

For more installation options please visit the Copulas installation Guide

Quickstart

In this short quickstart, we show how to model a multivariate dataset and then generate synthetic data that resembles it.

import warnings
warnings.filterwarnings('ignore')

from copulas.datasets import sample_trivariate_xyz
from copulas.multivariate import GaussianMultivariate
from copulas.visualization import compare_3d

# Load a dataset with 3 columns that are not independent
real_data = sample_trivariate_xyz()

# Fit a gaussian copula to the data
copula = GaussianMultivariate()
copula.fit(real_data)

# Sample synthetic data
synthetic_data = copula.sample(len(real_data))

# Plot the real and the synthetic data to compare
compare_3d(real_data, synthetic_data)

The output will be a figure with two plots, showing what both the real and the synthetic data that you just generated look like:

Quickstart

What's next?

For more details about Copulas and all its possibilities and features, please check the documentation site.

There you can learn more about how to contribute to Copulas in order to help us developing new features or cool ideas.

Credits

Copulas is an open source project from the Data to AI Lab at MIT which has been built and maintained over the years by the following team:




The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

About

A library to model multivariate data using copulas.

https://sdv.dev/Copulas/

License:MIT License


Languages

Language:Python 95.0%Language:Makefile 2.3%Language:R 1.5%Language:MATLAB 1.2%