Build word embeddings based on community detection in graphs.
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
SINr is composed of two main modules :
- Cooccurrence : a cython based module to efficiently compute a cooccurrence matrix from a given corpus
- SINr : a module to compute cooccurence network based, sparse word embeddings
- Launch a job on a slurm node ->
srun -p gpu --gres "gpu:1" --time 1-0 --mem 5G --pty bash
- Install conda
- Clone repository ->
git clone --branch nfm_sparse https://git-lium.univ-lemans.fr/tprouteau/sinr.git && cd sinr
- Build conda environment ->
conda env create -f environment.yml
- Activate environment ->
conda activate sinr_release
- InstallSINr in development mode and SpaCy Transformer model for english ->
cd src && python setup.py cythonize && pip install -e . && python -m spacy download en_core_web_trf
- Use SINr!
- Activate your conda environment ->
conda activate sinr_release
- (upon first launch) install environment kernel in IPython ->
ipython kernel install --name sinr_release --user
- Launch a notebook on a node ->
srun -p gpu --gres "gpu:1" --mem 80G -c15 -w "gpu15" jlaunch jupyter-lab
#Use the -w option to choose the node one should not use a K20/K40 GPU as is it not supported by cupy anymore. - ctrl+click on the link displayed on the terminal and select the adequate kernel (sinr_release)
For additional examples see notebooks
from sinr.cooccurrence import Cooccurrence
from sinr.pmi import pmi_filter
# Load your corpus as list of lists of tokens
sentences = [["sinr", "is", "fun"], ["sinr", "is", "a", "python", "package"]]
# Build cooccurrence matrix
c = Cooccurrence()
c.fit(sentences, window=2)
#Normalise cooccurrence matrix using PPMI
c.matrix = pmi_filter(c.matrix)
c.save("/path_to_output/matrix.pk")
The extraction of the embedding is currently greedy in terms of memory. When working with large corpora, do not hesitate to ask for rather large amounts of RAM (>100G)... This is currently being fixed.
from sinr.graph_embeddings import SINr
model = SINr.sinr("/path_to_output/matrix.pickle", output_path="path_to_output", n_jobs=4)
#If an output_path is supplied, the model will be saved -- Embeddings are returned
#as a Model object comprised of a dictionnary for the vocabulary and a scipy.sparce.csr_matrix for the vectors
Pull requests are welcome. For major changes, please open an issue first to disccuss the changes to be made.
In order to compile and install SINr from source follow the procedure described below
git clone --branch nfm_sparse https://git-lium.univ-lemans.fr/tprouteau/sinr.git
cd sinr
conda env create -f environment.yml
conda activate sinr_release
python setup.py cythonize
pip install -e .
In order to evaluate the word embeddings on the similarity task you may use the library Word Embedding Benchmarks developped by Stanislaw Jastrzebski : https://github.com/kudkudak/word-embeddings-benchmarks
Scipy.sparse.csr_matrix
you will need to pass them as a dense matrix using the function
matrix = my_sparse_csr_matrix.todense()
Refer to the documentation and examples to know which format to use in input of the benchmarking library.
Project based on the