neilrjones / snorkel-extraction

A previous version of Snorkel focused on information extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This repository is in maintenance mode as of 15 Aug. 2019. See Project Status for details.

Snorkel Extraction

v0.7.0

Build Status License

Snorkel Extraction demonstrates how to perform information extraction with a previous version (v0.7.0) of Snorkel.

Contents

Project Status

The Snorkel project is more active than ever! With the release of v0.9 in Aug. 2019, we added support for new training data operators (transformation functions and slicing functions, in addition to labeling functions), a new more scalable algorithm under the hood for the label model, a Snorkel webpage with additional resources and fresh batch of tutorials, simplified installation options, etc.

Because that release was essentially a redesign of the project from the ground up, there were many significant API changes between v0.7 (this repository) and v0.9. Active development will continue in the main Snorkel repository, and for those beginning new Snorkel applications, we strongly recommend building on top of the main Snorkel project.

At the same time, we recognize that many users built successful applications and extensions on v0.7 and earlier of Snorkel, particularly for information extraction tasks, which early versions of Snorkel were especially geared toward. Consequently, we have renamed Snorkel v0.7 as Snorkel Extraction, and make that code available in this repository. However, this repository is officially in maintenance mode as of 15 Aug. 2019. We intend to keep the repository functioning with its current feature set to support existing applications built on it but will not be adding any new features or functionality.

If you would like to stay informed of progress in the Snorkel open source project, join the Snorkel email list for relatively rare announcements (e.g., major releases, new tutorials, etc.) or the Snorkel community forum on Spectrum for more regular discussion.

Quick Start

This section has the commands to quickly get started running Snorkel Extraction. For more detailed installation instructions, see the Installation section below. These instructions assume that you already have conda installed.

# Clone this repository
git clone https://github.com/snorkel-team/snorkel-extraction.git
cd snorkel-extraction

# Install the environment
conda env create --file=environment.yml

# Activate the environment
conda activate snorkel-extraction

# Install snorkel in the environment
pip install .

# Optionally: You may need to explicitly set the Jupyter Notebook kernel
python -m ipykernel install --user --name snorkel-extraction --display-name "Python (snorkel-extraction)"

# Activate jupyter widgets
jupyter nbextension enable --py widgetsnbextension

# Initiate a jupyter notebook server
jupyter notebook

Then a Jupyter notebook tab will open in your browser. From here you can run existing Snorkel Extraction tutorial notebooks or create your own.

Tutorials

From within the Jupyter browser, navigate to the tutorials directory and try out one of the existing notebooks!

The introductory tutorial in tutorials/intro covers the entire Snorkel Extraction workflow, showing how to extract spouse relations from news articles. You can also check out all the great materials from the 2017 Mobilize Center-hosted Snorkel workshop!

Installation

To manage its dependencies, Snorkel Extraction uses conda, which allows specifying an environment via an environment.yml file.

This documentation covers two common cases (usage and development) for setting up conda environments for Snorkel. In both cases, the environment can be activated using conda activate snorkel and deactivated using conda deactivate (for versions of conda prior to 4.4, replace conda with source in these commands). Users just looking to try out a Snorkel tutorial notebook should see the quick-start instructions above.

Using Snorkel Extraction as a Package

This setup is intended for users who would like to use Snorkel Extraction in their own applications by importing the package. In such cases, users should define a custom environment.yml to manage their project's dependencies. We recommend starting with the environment.yml in this repository. The below modifications can help customize it for your needs:

  1. Specifying versions for the listed packages, such as changing python to python=3.6.5. Versioned specification of your environment is critical to reproducibility and ensuring dependency updates do not break your pipeline. When first setting your package versions, you likely want to start with the latest versions available on the conda-forge channel, unless you have a reason to do otherwise.
  2. Adding other packages to your environment as required by your use case. Consider maintaining alphabetical sorting of packages in environment.yml to assist with maintainability. In addition, we recommend installing packages via pip, only if they are not available in the conda-forge channel.
  3. Add the snorkel package installation to your environment.yml, under the - pip section. Of course, we suggest versioning snorkel, which you can do via a release number or commit hash (to access more bleeding edge functionality)
  # Versioned via release tag
  - git+https://github.com/snorkel-team/snorkel-extraction@v0.7.0
  # Versioned via commit hash (commit hash below is fake to ensure you change it)
  - git+https://github.com/snorkel-team/snorkel-extraction@7eb7076f70078c06bef9752f22acf92fd86e616a

Finally, consider versioning the numbskull and treedlib pip dependencies by changing master to their latest commit hash on GitHub.

Development Environment

This setup is intended for users who have cloned this repository and would like to access the environment for development. This approach installs the snorkel package in development mode, meaning that changes you make to the source code will automatically be applied to the snorkel package in the environment.

# From the root direcectory of this repo run the following command.
conda env create --file=environment.yml

# Activate the conda environment (if using a version of conda below 4.4, use "source" instead of "conda")
conda activate snorkel

# Install snorkel in development mode
pip install --editable .

Additional installation notes

Snorkel can be installed directly from its GitHub repository via:

# WARNING: read installation section before running this command! This command
# does not install any dependencies. It installs the latest master version but
# you can change master to tag or commit
pip install git+https://github.com/snorkel-team/snorkel-extraction@master

Note: Currently the Viewer is supported on the following versions:

  • jupyter: 4.1
  • jupyter notebook: 4.2

Release Notes

Major changes in v0.7:

  • PyTorch classifiers
  • Installation now via Conda and pip
  • Now spaCy is the default parser (v1), with support for v2
  • And many more fixes, additions, and new material!

Older versions

Major changes in v0.6:

  • Support for categorical classification, including "dynamically-scoped" or blocked categoricals (see tutorial)
  • Support for structure learning (see tutorial, ICML 2017 paper)
  • Support for labeled data in generative model
  • Refactor of TensorFlow bindings; fixes grid search and model saving / reloading issues (see snorkel/learning)
  • New, simplified Intro tutorial (here)
  • Refactored parser class and support for spaCy as new parser
  • Support for easy use of the BRAT annotation tool (see tutorial)
  • Initial Spark integration, for scale out of LF application (see tutorial)
  • Tutorial on using crowdsourced data here
  • Integration with Apache Tika via the Tika Python binding.
  • And many more fixes, additions, and new material!

Acknowledgements

Sponsored in part by DARPA as part of the D3M program under contract No. FA8750-17-2-0095 and the SIMPLEX program under contract number N66001-15-C-4043, and also by the NIH through the Mobilize Center under grant number U54EB020405.

About

A previous version of Snorkel focused on information extraction

License:Apache License 2.0


Languages

Language:Jupyter Notebook 67.8%Language:Python 31.1%Language:JavaScript 0.9%Language:Shell 0.3%