korjani / name_extraction

extracting people name from text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Name Extraction

Use NLP methods to extract the names of all people mentions from text content and their roles (when available).

Output is tuple of People names, role, and name of the file

Requirements

beside install requirments.txt,

pip install -r requirements.txt

user should install spacy model:

python -m spacy download en_core_web_lg

How to run the code

python name_extraction.py --directory=[directory name] > [results.txt]

e.g.

python name_extraction.py --directory='data/' >results.txt

Methodology

1- Iterate through folder and grab documents 2- find sentences and tokenized each sentence 3- find entities inside sentences 4- if entity label is "person" check check for role of the person (it shoudld be PROPN) 5- chunking the sentence where entity label is selected and find appos or component of the name 6-

Evaluation and issues

In general finding the role has less accuracy comparing to finding a name The role is extracted by chunking the part of text where the name is extracted The model can be extended by considering POS for finding role and better preprocessing the text

No preprocessing has been done on the input text, e.g., sometimes “Don t” is considered as a name because of lack of punctuation and the model confuses with name “Don”

Adding hard rule may increase the accuracy of role detection. e.g., words between semicolon after name is a role!

Also it considered couple of brand names like “youtube” or “Android” as a name which can be solve by assigning a predefined dictionary to remove them.

By analyzing 7 text files (+8000 words) the model extracted 21 names, one of them are completely wrong (e.g. “Don t”) 1 names had added prefix, and 2 roles are not selected

Model accuracy ~ 79%

About

extracting people name from text

License:MIT License


Languages

Language:Python 100.0%