Use NLP methods to extract the names of all people mentions from text content and their roles (when available).
Output is tuple of People names, role, and name of the file
beside install requirments.txt,
pip install -r requirements.txt
user should install spacy model:
python -m spacy download en_core_web_lg
python name_extraction.py --directory=[directory name] > [results.txt]
e.g.
python name_extraction.py --directory='data/' >results.txt
1- Iterate through folder and grab documents 2- find sentences and tokenized each sentence 3- find entities inside sentences 4- if entity label is "person" check check for role of the person (it shoudld be PROPN) 5- chunking the sentence where entity label is selected and find appos or component of the name 6-
In general finding the role has less accuracy comparing to finding a name The role is extracted by chunking the part of text where the name is extracted The model can be extended by considering POS for finding role and better preprocessing the text
No preprocessing has been done on the input text, e.g., sometimes “Don t” is considered as a name because of lack of punctuation and the model confuses with name “Don”
Adding hard rule may increase the accuracy of role detection. e.g., words between semicolon after name is a role!
Also it considered couple of brand names like “youtube” or “Android” as a name which can be solve by assigning a predefined dictionary to remove them.
By analyzing 7 text files (+8000 words) the model extracted 21 names, one of them are completely wrong (e.g. “Don t”) 1 names had added prefix, and 2 roles are not selected