jd-coderepos / cl-shorttitles-parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CL-ShortTitles-Parser

About

CL-ShortTitles-Parser parses and types phrases from the titles of Computational Linguistics scholarly articles written in English as scientific entities. It types the entities as one of the following seven semantic concepts: research problem, solution, resource, language, tool, method and dataset. Further, it uses a set of cue words to chunk the phrases in the titles. The cues are 'to|of|on|for|from|with|by|via|through|using|in|as|over|against|within|under|at|including|towards|across|involving|representing|between'. And it parses only those titles with 0 or 1 such cues. Thus this parser is a simplified and more precise version of the earlier Titles Parser that parsed titles with any number of cues. Please see the History section for more information of the older parser.

This system is developed as part of the Open Research Knowledge Graph Project at TIB.

The code released in this repository is the standalone version of the parser.

History

The parser was originally developed to extract six semantic concepts from titles with any number of cues. That version of the parser is hosted at cl-titles-parser. The six types it extracted were: research problem, solution, resource, language, tool, and method.

Usage

CL-ShortTitles-Parser features a native Python implementation requiring minimal effort to set up. Please see usage instructions below.

  • Requirements

    • Python (3.7 or higher)

Clone this repository locally and run the parser as follows:

python parse_titles.py <input_titles_file> <output_data_dir>

where input_file is a file with the papers' titles to be parsed with a new title in each line and output_data_dir is a user-specified local directory where the parsed output from the program will be written. Sample data are provided in the data folder.

About


Languages

Language:Python 100.0%