seantangtao / deepanomaly4docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deep Anomaly Detection for Text Documents

Plan


  • ☑ Autoencoder loss (with progress on outlier f1)
  • ☑ Siamese Network (ivis)
  • ☐ other new DL approaches

  • ☐ Everything on real data

  • ☐ unsupervised vs weakly supervised
  • ☐ ensembles
  • ☐ Compare with computer vision approach

  • ☑ monitoring progress on doc2vec training
  • ☑ test if ivis unstable over runs

Currently:


  • ☐ try if current pipeline can predict new datapoints from data used for fitting but also from different data
  • ☐ try the current pipeline but semi-supervised (both UMAP and ivis support this)
  • ☐ try other deep learning approaches, see the work by Ruff et. al. (including Outlier Exposure ideas)

Ressources

Literature

Ruff et. al.

  • Ruff, Lukas, Robert Vandermeulen, et al. “Deep One-Class Classification.” International Conference on Machine Learning, 2018, pp. 4393–402. proceedings.mlr.press, http://proceedings.mlr.press/v80/ruff18a.html.
  • Ruff, Lukas, Robert A. Vandermeulen, Nico Görnitz, et al. “Deep Semi-Supervised Anomaly Detection.” ArXiv:1906.02694 [Cs, Stat], Feb. 2020. arXiv.org, http://arxiv.org/abs/1906.02694.
  • Ruff, Lukas, Robert A. Vandermeulen, Billy Joe Franks, et al. “Rethinking Assumptions in Deep Anomaly Detection.” ArXiv:2006.00339 [Cs, Stat], May 2020. arXiv.org, http://arxiv.org/abs/2006.00339.
  • Ruff, Lukas, Yury Zemlyanskiy, et al. “Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2019, pp. 4061–71. DOI.org (Crossref), doi:10.18653/v1/P19-1398.

Outlier Exposure

  • Hendrycks, Dan, Mantas Mazeika, and Thomas Dietterich. “Deep Anomaly Detection with Outlier Exposure.” ArXiv:1812.04606 [Cs, Stat], Jan. 2019. arXiv.org, http://arxiv.org/abs/1812.04606.
  • Hendrycks, Dan, and Kevin Gimpel. “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks.” ArXiv:1610.02136 [Cs], Oct. 2018. arXiv.org, http://arxiv.org/abs/1610.02136.

Deep Methods

  • Pang, Guansong, et al. “Deep Anomaly Detection with Deviation Networks.” ArXiv:1911.08623 [Cs, Stat], Nov. 2019. arXiv.org, http://arxiv.org/abs/1911.08623.
  • Pang, Guansong, et al. “Deep Weakly-Supervised Anomaly Detection.” ArXiv:1910.13601 [Cs, Stat], Jan. 2020. arXiv.org, http://arxiv.org/abs/1910.13601.


  • Golan, Izhak, and Ran El-Yaniv. “Deep Anomaly Detection Using Geometric Transformations.” ArXiv:1805.10917 [Cs, Stat], Nov. 2018. arXiv.org, http://arxiv.org/abs/1805.10917.
  • Hendrycks, Dan, Mantas Mazeika, Saurav Kadavath, et al. Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty. p. 12.

Autoencoder

  • Huang, Chaoqin, et al. “Attribute Restoration Framework for Anomaly Detection.” ArXiv:1911.10676 [Cs], June 2020. arXiv.org, http://arxiv.org/abs/1911.10676.
  • Cao, Van Loi, et al. “A Hybrid Autoencoder and Density Estimation Model for Anomaly Detection.” Parallel Problem Solving from Nature – PPSN XIV, edited by Julia Handl et al., vol. 9921, Springer International Publishing, 2016, pp. 717–26. DOI.org (Crossref), doi:10.1007/978-3-319-45823-6_67.
  • Schreyer, Marco, et al. “Detection of Anomalies in Large Scale Accounting Data Using Deep Autoencoder Networks.” ArXiv:1709.05254 [Cs], Aug. 2018. arXiv.org, http://arxiv.org/abs/1709.05254.

Doc2Vec

  • Le, Quoc V., and Tomas Mikolov. “Distributed Representations of Sentences and Documents.” ArXiv:1405.4053 [Cs], May 2014. arXiv.org, http://arxiv.org/abs/1405.4053.
  • Lau, Jey Han, and Timothy Baldwin. “An Empirical Evaluation of Doc2vec with Practical Insights into Document Embedding Generation.” ArXiv:1607.05368 [Cs], July 2016. arXiv.org, http://arxiv.org/abs/1607.05368.

UMAP

  • McInnes, Leland, et al. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” ArXiv:1802.03426 [Cs, Stat], Dec. 2018. arXiv.org, http://arxiv.org/abs/1802.03426.
  • Allaoui, Mebarka, et al. “Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study.” Image and Signal Processing, edited by Abderrahim El Moataz et al., Springer International Publishing, 2020, pp. 317–25. Springer Link, doi:10.1007/978-3-030-51935-3_34.
  • Sainburg, Tim, et al. “Parametric UMAP: Learning Embeddings with Deep Neural Networks for Representation and Semi-Supervised Learning.” ArXiv:2009.12981 [Cs, q-Bio, Stat], Sept. 2020. arXiv.org, http://arxiv.org/abs/2009.12981.

Code

Data




Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

About

License:Other


Languages

Language:Jupyter Notebook 98.9%Language:TeX 0.6%Language:Python 0.5%Language:Makefile 0.0%