wywywy01 / cleanlab

šŸµ Finding label errors in datasets and learning with noisy labels.

Home Page:https://pypi.org/project/cleanlab/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cleanlab is a machine learning python package for learning with noisy labels and finding label errors in datasets. cleanlab CLEANs LABels. It is is powered by the theory of confident learning.

pypi py_versions build_status coverage

cleanlab finds and cleans label errors in any dataset using state-of-the-art algorithms for learning with noisy labels by characterizing label noise. cleanlab is fast: its built on optimized algorithms and parallelized across CPU threads automatically. cleanlab implements the family of theory and algorithms called confident learning with provable guarantees of exact noise estimation and label error finding (even when model output probabilities are noisy/imperfect).

How does confident learning work? Find out here: TUTORIAL: confident learning with just numpy and for-loops.

cleanlab supports multi-label, multiclass, sparse matrices, and more.

Its called cleanlab because it CLEANs LABels.

cleanlab is:

  1. fast - Single-shot, non-iterative, parallelized algorithms (e.g. < 1 second to find label errors in ImageNet)
  2. robust - Provable generalization and risk minimimzation guarantees, including imperfect probability estimation.
  3. general - Works with any probablistic classifier: PyTorch, Tensorflow, MxNet, Caffe2, scikit-learn, etc.
  4. unique - The only package for multiclass learning with noisy labels or finding label errors for any dataset / classifier.

Find label errors with PyTorch, Tensorflow, MXNet, etc. in 1 line of code!

Learning with noisy labels in 3 lines of code!

Check out these examples and tests (includes how to use pyTorch, FastText, etc.).

Installation

Python 2.7, 3.4, 3.5, and 3.6 are supported.

Stable release:

Developer (unstable) release:

To install the codebase (enabling you to make modifications):

Citations and Related Publications

If you use this package in your work, please cite the following:

@misc{northcutt2019cleanlab,
  author = {Curtis Northcutt},
  title = {Clean Lab},
  year = {2019},
  howpublished = {\url{https://github.com/cgnorthcutt/cleanlab}},
  note = {commit xxxxxxx, version xxxx}
}

If you compare with, build on, or use confident learning (the theory and methods behind cleanlab), please cite the following. We will release a paper strictly on confident learning later this year, in addition to other related publications.

@inproceedings{northcutt2017rankpruning,
 author={Northcutt, Curtis G. and Wu, Tailin and Chuang, Isaac L.},
 title={Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels},
 booktitle = {Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence},
 series = {UAI'17},
 year = {2017},
 location = {Sydney, Australia},
 numpages = {10},
 url = {http://auai.org/uai2017/proceedings/papers/35.pdf},
 publisher = {AUAI Press},
} 

Collaboration

Most of the algorithms, theory, and results of cleanlab remain unpublished. If you'd like to work together, please reach out.

cleanlab on MNIST

We use cleanlab to automatically identify ~50 label errors in the MNIST dataset.

Label errors of the original MNIST train dataset identified algorithmically using the rankpruning algorithm. Depicts the 24 least confident labels, ordered left-right, top-down by increasing self-confidence (probability of belonging to the given label), denoted conf in teal. The label with the largest predicted probability is in green. Overt errors are in red.

cleanlab Generality: View performance across 4 distributions and 9 classifiers.

We use cleanlab to automatically learn with noisy labels regardless of dataset distribution or classifier.

Each figure depicts the decision boundary learned using cleanlab.classification.LearningWithNoisyLabels in the presence of extreme (~35%) label errors. Label errors are circled in green. Label noise is class-conditional (not simply uniformly random). Columns are organized by the classifier used, except the left-most column which depicts the ground-truth dataset distribution. Rows are organized by dataset used. A matrix characterizing the label noise for the first row is shown below.

Each figure depicts accuracy scores on a test set as decimal values:

  1. LEFT (in black): The classifier test accuracy trained with perfect labels (no label errors).
  2. MIDDLE (in blue): The classifier test accuracy trained with noisy labels using cleanlab.
  3. RIGHT (in white): The baseline classifier test accuracy trained with noisy labels.

As an example, this is the noise matrix (noisy channel) P(s | y) characterizing the label noise for the first dataset row in the figure. s represents the observed noisy labels and y represents the latent, true labels. The trace of this matrix is 2.6. A trace of 4 implies no label noise. A cell in this matrix is read like, "A random 38% of '3' labels were flipped to '2' labels."

p(s|y) y=0 y=1 y=2 y=3
s=0 0.55 0.01 0.07 0.06
s=1 0.22 0.87 0.24 0.02
s=2 0.12 0.04 0.64 0.38
s=3 0.11 0.08 0.05 0.54

The code to reproduce this figure is available here.

Get started with easy, quick examples.

New to cleanlab? Start with:

  1. Visualizing confident learning
  2. A simple example of learning with noisy labels on the multiclass Iris dataset.

These examples show how easy it is to characterize label noise in datasets, learn with noisy labels, identify label errors, estimate latent priors and noisy channels, and more.

Use cleanlab with any model (Tensorflow, caffe2, PyTorch, etc.)

All of the features of the cleanlab package work with any model. Yes, any model. Feel free to use PyTorch, Tensorflow, caffe2, scikit-learn, mxnet, etc. If you use a scikit-learn classifier, all cleanlab methods will work out-of-the-box. Itā€™s also easy to use your favorite model from a non-scikit-learn package, just wrap your model into a Python class that inherits the sklearn.base.BaseEstimator:

As you can see here, technically you donā€™t actually need to inherit from sklearn.base.BaseEstimator, as you can just create a class that defines .fit(), .predict(), and .predict_proba(), but inheriting makes downstream scikit-learn applications like hyper-parameter optimization work seamlessly. For example, the LearningWithNoisyLabels() model is fully compliant.

Note, some libraries exists to do this for you. For pyTorch, check out the skorch Python library which will wrap your pytorch model into a scikit-learn compliant model.

Documentation by Example

cleanlab Core Package Components

  1. cleanlab/classification.py - The LearningWithNoisyLabels() class for learning with noisy labels.
  2. cleanlab/latent_algebra.py - Equalities when noise information is known.
  3. cleanlab/latent_estimation.py - Estimates and fully characterizes all variants of label noise.
  4. cleanlab/noise_generation.py - Generate mathematically valid synthetic noise matrices.
  5. cleanlab/polyplex.py - Characterizes joint distribution of label noise EXACTLY from noisy channel.
  6. cleanlab/pruning.py - Finds the indices of the examples with label errors in a dataset.

Many of these methods have default parameters that wonā€™t be covered here. Check out the method docstrings for full documentation.

Multiclass learning with noisy labels (in 3 lines of code):

rankpruning is a fast, general, robust algorithm for multiclass learning with noisy labels. It adds minimal overhead, needing only O(nm2) time for n training examples and m classes, works with any classifier, and is easy to use. Here is the example from above, with added commments for clarity.

Estimate the confident joint, the latent noisy channel matrix, P(s | y) and inverse, P(y | s), the latent prior of the unobserved, actual true labels, p(y), and the predicted probabilities.

s denotes a random variable that represents the observed, noisy label and y denotes a random variable representing the hidden, actual labels. Both s and y take any of the m classes as values. The cleanlab package supports different levels of granularity for computation depending on the needs of the user. Because of this, we support multiple alternatives, all no more than a few lines, to estimate these latent distribution arrays, enabling the user to reduce computation time by only computing what they need to compute, as seen in the examples below.

Throughout these examples, youā€™ll see a variable called confident_joint. The confident joint is an m x m matrix (m is the number of classes) that counts, for every observed, noisy class, the number of examples that confidently belong to every latent, hidden class. It counts the number of examples that we are confident are labeled correctly or incorrectly for every pair of obseved and unobserved classes. The confident joint is an unnormalized estimate of the complete-information latent joint distribution, Ps,y. Most of the methods in the cleanlab package start by first estimating the confident_joint.

Option 1: Compute the confident joint and predicted probs first. Stop if thatā€™s all you need.

Option 2: Estimate the latent distribution matrices in a single line of code.

Option 3: Skip computing the predicted probabilities if you already have them.

Estimate label errors in a dataset:

With the cleanlab package, we can instantly fetch the indices of all estimated label errors, with nothing provided by the user except a classifier, examples, and their noisy labels. Like the previous example, there are various levels of granularity.

Estimate the latent joint probability distribution matrix of the noisy and true labels, P(s,y):

To compute P(s,y), the complete-information distribution matrix that captures the number of pairwise label flip errors when multipled by the total number of examples as n P(s,y). Using `cleanlab.latent_estimation.calibrate_confident_joint`, this method guarantees the rows ofP(s,y)* correctly sum to p(s), and np.sum(confident_joint) == n (the number of labels).

This method occurs when hyperparameter prune_count_method = ā€˜inverse_nm_dot_sā€™ in LearningWithNoisyLabels.fit() and get_noise_indices().

If you've already computed the confident joint, then you can estimate the complete joint distribution of label noise by:

Generate valid, class-conditional, unformly random noisy channel matrices:

Support for numerous weak supervision and learning with noisy labels functionalities:

The Polyplex

The key to learning in the presence of label errors is estimating the joint distribution between the actual, hidden labels ā€˜yā€™ and the observed, noisy labels ā€˜sā€™. Using cleanlab and the theory of confident learning, we can completely characterize the trace of the latent joint distribution, trace(P(s,y)), given p(y), for any fraction of label errors, i.e. for any trace of the noisy channel, trace(P(s|y)).

You can check out how to do this yourself here: 1. Drawing Polyplices 2. Computing Polyplices

License

Copyright (c) 2017-2019 Curtis Northcutt. Released under the MIT License. See LICENSE for details.

About

šŸµ Finding label errors in datasets and learning with noisy labels.

https://pypi.org/project/cleanlab/

License:Other


Languages

Language:Python 99.6%Language:Shell 0.4%