- using test data (IMDB / 20 Newsgroup)
- ☑ TF-IDF Embeddings with UMAP and LOF
- ☑ Visualization
- ☑ enstop topic modelling
- ☑ HDBSCAN cluster analysis and outlier detection (GLOSH) https://hdbscan.readthedocs.io/en/latest/outlier_detection.html
- ☑ outlier detection algorithms from PyOD (LOF, HBOS, PCA, IForest) https://github.com/yzhao062/pyod
- ☑ test flair https://github.com/flairNLP/flair
- ☑ Transformer embeddings
- ☑ word embedding pooling - word2vec, glove, fasttext
- ☑ word embedding RNN/LSTM
- ☐ Autoencoder embeddings
- ☑ Autoencoder loss (with progress on outlier f1)
- ☑ Siamese Network (ivis)
- ☐ other new DL approaches
- ☐ Everything on real data
- ☐ unsupervised vs weakly supervised
- ☐ ensembles
- ☐ Compare with computer vision approach
- ☑ monitoring progress on doc2vec training
- ☑ test if ivis unstable over runs
Currently:
- ☐ try if current pipeline can predict new datapoints from data used for fitting but also from different data
- ☐ try the current pipeline but semi-supervised (both UMAP and ivis support this)
- ☐ try other deep learning approaches, see the work by Ruff et. al. (including Outlier Exposure ideas)
- Ruff, Lukas, Robert Vandermeulen, et al. “Deep One-Class Classification.” International Conference on Machine Learning, 2018, pp. 4393–402. proceedings.mlr.press, http://proceedings.mlr.press/v80/ruff18a.html.
- Ruff, Lukas, Robert A. Vandermeulen, Nico Görnitz, et al. “Deep Semi-Supervised Anomaly Detection.” ArXiv:1906.02694 [Cs, Stat], Feb. 2020. arXiv.org, http://arxiv.org/abs/1906.02694.
- Ruff, Lukas, Robert A. Vandermeulen, Billy Joe Franks, et al. “Rethinking Assumptions in Deep Anomaly Detection.” ArXiv:2006.00339 [Cs, Stat], May 2020. arXiv.org, http://arxiv.org/abs/2006.00339.
- Ruff, Lukas, Yury Zemlyanskiy, et al. “Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2019, pp. 4061–71. DOI.org (Crossref), doi:10.18653/v1/P19-1398.
- Hendrycks, Dan, Mantas Mazeika, and Thomas Dietterich. “Deep Anomaly Detection with Outlier Exposure.” ArXiv:1812.04606 [Cs, Stat], Jan. 2019. arXiv.org, http://arxiv.org/abs/1812.04606.
- Hendrycks, Dan, and Kevin Gimpel. “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks.” ArXiv:1610.02136 [Cs], Oct. 2018. arXiv.org, http://arxiv.org/abs/1610.02136.
- Pang, Guansong, et al. “Deep Anomaly Detection with Deviation Networks.” ArXiv:1911.08623 [Cs, Stat], Nov. 2019. arXiv.org, http://arxiv.org/abs/1911.08623.
- Pang, Guansong, et al. “Deep Weakly-Supervised Anomaly Detection.” ArXiv:1910.13601 [Cs, Stat], Jan. 2020. arXiv.org, http://arxiv.org/abs/1910.13601.
- Golan, Izhak, and Ran El-Yaniv. “Deep Anomaly Detection Using Geometric Transformations.” ArXiv:1805.10917 [Cs, Stat], Nov. 2018. arXiv.org, http://arxiv.org/abs/1805.10917.
- Hendrycks, Dan, Mantas Mazeika, Saurav Kadavath, et al. Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty. p. 12.
- Huang, Chaoqin, et al. “Attribute Restoration Framework for Anomaly Detection.” ArXiv:1911.10676 [Cs], June 2020. arXiv.org, http://arxiv.org/abs/1911.10676.
- Cao, Van Loi, et al. “A Hybrid Autoencoder and Density Estimation Model for Anomaly Detection.” Parallel Problem Solving from Nature – PPSN XIV, edited by Julia Handl et al., vol. 9921, Springer International Publishing, 2016, pp. 717–26. DOI.org (Crossref), doi:10.1007/978-3-319-45823-6_67.
- Schreyer, Marco, et al. “Detection of Anomalies in Large Scale Accounting Data Using Deep Autoencoder Networks.” ArXiv:1709.05254 [Cs], Aug. 2018. arXiv.org, http://arxiv.org/abs/1709.05254.
- Le, Quoc V., and Tomas Mikolov. “Distributed Representations of Sentences and Documents.” ArXiv:1405.4053 [Cs], May 2014. arXiv.org, http://arxiv.org/abs/1405.4053.
- Lau, Jey Han, and Timothy Baldwin. “An Empirical Evaluation of Doc2vec with Practical Insights into Document Embedding Generation.” ArXiv:1607.05368 [Cs], July 2016. arXiv.org, http://arxiv.org/abs/1607.05368.
- McInnes, Leland, et al. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” ArXiv:1802.03426 [Cs, Stat], Dec. 2018. arXiv.org, http://arxiv.org/abs/1802.03426.
- Allaoui, Mebarka, et al. “Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study.” Image and Signal Processing, edited by Abderrahim El Moataz et al., Springer International Publishing, 2020, pp. 317–25. Springer Link, doi:10.1007/978-3-030-51935-3_34.
- Sainburg, Tim, et al. “Parametric UMAP: Learning Embeddings with Deep Neural Networks for Representation and Semi-Supervised Learning.” ArXiv:2009.12981 [Cs, q-Bio, Stat], Sept. 2020. arXiv.org, http://arxiv.org/abs/2009.12981.
- Uniform Manifold Approximation and Projection (UMAP) - https://github.com/lmcinnes/umap
- Python Outlier Detection (PyOD) - https://github.com/yzhao062/pyod
- flair - https://github.com/flairNLP/flair (for word embedding pooling, RNNs and transformer embeddings)
- gensim - https://radimrehurek.com/gensim/index.html (Doc2Vec)
- ivis - https://bering-ivis.readthedocs.io/en/latest/ (siamese network dimensionality reduction used as outlier detector)
- Training doc2vec: All the news https://components.one/datasets/all-the-news-2-news-articles-dataset/
- Inlier data: IMDB Reviews https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
- Outlier data: 20 Newsgroup http://qwone.com/~jason/20Newsgroups/
- pretrained doc2vec models: https://github.com/jhlau/doc2vec (see Lau, Jey Han, and Timothy Baldwin above)
├── LICENSE ├── Makefile <- Makefile with commands like `make data` or `make train` ├── README.md <- The top-level README for developers using this project. ├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump. │ ├── docs <- A default Sphinx project; see sphinx-doc.org for details │ ├── models <- Trained and serialized models, model predictions, or model summaries │ ├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), │ the creator's initials, and a short `-` delimited description, e.g. │ `1.0-jqp-initial-data-exploration`. │ ├── references <- Data dictionaries, manuals, and all other explanatory materials. │ ├── reports <- Generated analysis as HTML, PDF, LaTeX, etc. │ └── figures <- Generated graphics and figures to be used in reporting │ ├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g. │ generated with `pip freeze > requirements.txt` │ ├── setup.py <- makes project pip installable (pip install -e .) so src can be imported ├── src <- Source code for use in this project. │ ├── __init__.py <- Makes src a Python module │ │ │ ├── data <- Scripts to download or generate data │ │ └── make_dataset.py │ │ │ ├── features <- Scripts to turn raw data into features for modeling │ │ └── build_features.py │ │ │ ├── models <- Scripts to train models and then use trained models to make │ │ │ predictions │ │ ├── predict_model.py │ │ └── train_model.py │ │ │ └── visualization <- Scripts to create exploratory and results oriented visualizations │ └── visualize.py │ └── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io