Raphencoder / SEM

SEM, a free NLP tool relying on machine learning technologies, especially CRFs.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SEM v3.3.0

SEM (Segmenteur-Étiqueteur Markovien) is a free NLP tool relying on Machine Learning technologies, especially CRFs. SEM provides powerful and configurable preprocessing and postprocessing. SEM also has an online version.

Main SEM features

  1. A GUI for manual annotation (requires TkInter)
    1. from terminal: run python -m sem annotation_gui
    2. fast annotation: keyboard shortcuts and document-wide annotation broadcast
    3. can load pre-annotated files
    4. support for hierarchical tags (dot-separated, eg: "noun.common")
    5. handles multiple input format
    6. export in different formats
  2. A GUI for easier use (requires TkInter)
    1. on Linux: double-clic on sem_gui.sh
    2. on Windows: double-clic on sem_gui.bat
    3. from terminal: run python -m sem gui
  3. segmentation
    1. segmentation for: French, English
    2. easy creation and integration of new tokenisers
  4. feature generation
    1. XML file to write features without coding them
    2. single-token and multi-token dictionary features
    3. Regular expression features
    4. sequenced features
    5. train/label mode
    6. display option for features that are useful for generation, but not needed in output
  5. exporting output
    1. supported export formats: CoNLL, text, HTML (from plain text), two XML-TEI (one specific to NP-chunks and another one for the rest)
    2. easy creation and integration of new exporters
  6. extension of existing features
    1. automatic integration of new segmenters and exporters
    2. semi automatic integration of new feature functions
    3. easy creation of new CSS formats for HTML exports

First steps with SEM

  1. install SEM
    1. see install.md
    2. It will compile Wapiti and create necessary directories. Currently, SEM datas are located in ~/sem_data
  2. run tests
    1. run python -m sem --test in a terminal
  3. run SEM
    1. run GUI (see "main features" above) and annotate "non-regression/fr/in/segmentation.txt"
    2. or run: python -m sem tagger resources/master/fr/NER.xml ./non-regression/fr/in/segmentation.txt -o sem_output

External resources used by SEM

  1. French Treebank by Abeillé et al. (2003): corpus used for POS and chunking.
  2. NER annotated French Treebank by Sagot et al. (2012): corpus used for NER.
  3. Lexique des Formes Fléchies du Français (LeFFF) by Clément et al. (2004): french lexicon of inflected forms with various informations, such as their POS tag and lemmatization.
  4. Wapiti by Lavergne et al. (2010): linear-chain CRF library.
  5. setuptools: to install SEM.
  6. Tkinter: for GUI modules (they will not be installed if Tkinter is not present).
  7. Windows only: MinGW64: used to compile Wapiti on Windows.
  8. Windows only: POSIX threads for Windows: if you want to multithread Wapiti on Windows.
  9. GUI-specific: TkInter: if you want to launch SEM's GUI.

Planned changes (for latest changes, see changelog.md)

  1. Add a tutorial. Some of it done in section "retrain SEM" in manual.
  2. add lemmatiser.
  3. have more unit tests
  4. improve segmentation
    1. handle URLs starting with country indicator (ex: "en.wikipedia.org")
    2. handle URLs starting with subdomain (ex: "blog.[...]")

SEM references (with task[s] of interest)

  1. DUPONT, Yoann et PLANCQ, Clément. Un étiqueteur en ligne du Français. session démonstration de TALN-RECITAL, 2017, p. 15.
    1. Online interface
  2. (best RECITAL paper award) DUPONT, Yoann. Exploration de traits pour la reconnaissance d’entités nommées du Français par apprentissage automatique. RECITAL, 2017, p. 42.
    1. Named Entity Recognition (new, please use this one)
  3. TELLIER, Isabelle, DUCHIER, Denys, ESHKOL, Iris, et al. Apprentissage automatique d'un chunker pour le français. In : TALN2012. 2012. p. 431–438.
    1. Chunking
  4. TELLIER, Isabelle, DUPONT, Yoann, et COURMET, Arnaud. Un segmenteur-étiqueteur et un chunker pour le français. JEP-TALN-RECITAL 2012
    1. Part-Of-Speech Tagging
    2. chunking
  5. DUPONT, Yoann et TELLIER, Isabelle. Un reconnaisseur d’entités nommées du Français. session démonstration de TALN, 2014, p. 40.
    1. Named Entity Recognition (old, please do not use)

SEM references (bibtex format)

@inproceedings{dupont2017etiqueteur,
    title={Un {'e}tiqueteur en ligne du fran{\c{c}}ais},
    author={Dupont, Yoann and Plancq, Cl{'e}ment},
    booktitle={24e Conf{'e}rence sur le Traitement Automatique des Langues Naturelles (TALN)},
    pages={15--16},
    year={2017}
}
@inproceedings{dupont2018exploration,
  title={Exploration de traits pour la reconnaissance d’entit{'e}s nomm{'e}es du Fran{\c{c}}ais par apprentissage automatique},
  author={Dupont, Yoann},
  booktitle={24e Conf{'e}rence sur le Traitement Automatique des Langues Naturelles (TALN)},
  pages={42},
  year={2018}
}
@inproceedings{tellier2012apprentissage,
  title={Apprentissage automatique d'un chunker pour le fran{\c{c}}ais},
  author={Tellier, Isabelle and Duchier, Denys and Eshkol, Iris and Courmet, Arnaud and Martinet, Mathieu},
  booktitle={TALN2012},
  volume={2},
  pages={431--438},
  year={2012}
}
@inproceedings{tellier2012segmenteur,
  title={Un segmenteur-{'e}tiqueteur et un chunker pour le fran{\c{c}}ais (A Segmenter-POS Labeller and a Chunker for French)[in French]},
  author={Tellier, Isabelle and Dupont, Yoann and Courmet, Arnaud},
  booktitle={Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 5: Software Demonstrations},
  pages={7--8},
  year={2012}
}
@article{dupont2014reconnaisseur,
  title={Un reconnaisseur d’entit{'e}s nomm{'e}es du Fran{\c{c}}ais (A Named Entity recognizer for French)[in French]},
  author={Dupont, Yoann and Tellier, Isabelle},
  journal={Proceedings of TALN 2014 (Volume 3: System Demonstrations)},
  volume={3},
  pages={40--41},
  year={2014}
}

About

SEM, a free NLP tool relying on machine learning technologies, especially CRFs.

License:MIT License


Languages

Language:Python 76.8%Language:TeX 22.2%Language:CSS 1.0%Language:Batchfile 0.0%Language:Shell 0.0%