SimonGuillot / sinr_v3

The SINr approach to train word and graph embeddings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Build word embeddings based on community detection in graphs.

Project Organization

├── Makefile           <- Makefile with commands like `make data` or `make train`
├──          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
├── docs               <- A default Sphinx project; see for details
├── models             <- Trained and serialized models, model predictions, or model summaries
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
├──           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├──    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └──
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └──
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├──
│   │   └──
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └──
└── tox.ini            <- tox file with settings for running tox; see



SINr is composed of two main modules :

  • Cooccurrence : a cython based module to efficiently compute a cooccurrence matrix from a given corpus
  • SINr : a module to compute cooccurence network based, sparse word embeddings


  1. Launch a job on a slurm node -> srun -p gpu --gres "gpu:1" --time 1-0 --mem 5G --pty bash
  2. Install conda
  3. Clone repository -> git clone --branch nfm_sparse && cd sinr
  4. Build conda environment -> conda env create -f environment.yml
  5. Activate environment -> conda activate sinr_release
  6. InstallSINr in development mode and SpaCy Transformer model for english -> cd src && python cythonize && pip install -e . && python -m spacy download en_core_web_trf
  7. Use SINr!

Launch a Jupyter Notebook in jupyterlab

  1. Activate your conda environment -> conda activate sinr_release
  2. (upon first launch) install environment kernel in IPython -> ipython kernel install --name sinr_release --user
  3. Launch a notebook on a node -> srun -p gpu --gres "gpu:1" --mem 80G -c15 -w "gpu15" jlaunch jupyter-lab #Use the -w option to choose the node one should not use a K20/K40 GPU as is it not supported by cupy anymore.
  4. ctrl+click on the link displayed on the terminal and select the adequate kernel (sinr_release)


For additional examples see notebooks


from sinr.cooccurrence import Cooccurrence
from sinr.pmi import pmi_filter

# Load your corpus as list of lists of tokens
sentences = [["sinr", "is", "fun"], ["sinr", "is", "a", "python", "package"]]
# Build cooccurrence matrix
c = Cooccurrence(), window=2)

#Normalise cooccurrence matrix using PPMI
c.matrix = pmi_filter(c.matrix)"/path_to_output/")


The extraction of the embedding is currently greedy in terms of memory. When working with large corpora, do not hesitate to ask for rather large amounts of RAM (>100G)... This is currently being fixed.

from sinr.graph_embeddings import SINr

model = SINr.sinr("/path_to_output/matrix.pickle", output_path="path_to_output", n_jobs=4)  
#If an output_path is supplied, the model will be saved -- Embeddings are returned
#as a Model object comprised of a dictionnary for the vocabulary and a scipy.sparce.csr_matrix for the vectors


Pull requests are welcome. For major changes, please open an issue first to disccuss the changes to be made.

Compile/Install from source

In order to compile and install SINr from source follow the procedure described below

git clone --branch nfm_sparse
cd sinr
conda env create -f environment.yml
conda activate sinr_release
python cythonize
pip install -e .

Evaluate Word Embeddings

In order to evaluate the word embeddings on the similarity task you may use the library Word Embedding Benchmarks developped by Stanislaw Jastrzebski :

⚠️ The embeddings returned by the model are of type Scipy.sparse.csr_matrix you will need to pass them as a dense matrix using the function

matrix = my_sparse_csr_matrix.todense()

Refer to the documentation and examples to know which format to use in input of the benchmarking library.


Project based on the


The SINr approach to train word and graph embeddings


Language:Python 69.3%Language:Jupyter Notebook 30.7%