When I use multiprocessing pool.map, word._.negex throws an error as Can't retrieve unregistered extension attribute 'negex'
K-7 opened this issue · comments
Describe the bug
I am processing the medical texts written by nurses and doctors using spacy English() model and Negex to find the appropriate negations. The code works fine when i run it in single thread but when I use Multiprocessing to process texts simultaneously it raises an Exception as given below
File "../code/process_notes.py", line 154, in multiprocessing_finding_negation pool_results = pool.map(self.process, split_dfs) File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 364, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 771, in get raise self._value AttributeError: ("[E046] Can't retrieve unregistered extension attribute 'negex'. Did you forget to call the
set_extension method?", 'occurred at index 2')
To Reproduce
def load_spacy_model(self):
nlp = English()
nlp.add_pipe(self.nlp.create_pipe('sentencizer'))
ruler = EntityRuler(self.nlp)
# adding labels and patterns in the entity ruler
ruler.add_patterns(self.progressnotes_constant_objects.return_pattern_label_list(self.client))
# adding entityruler in spacy pipeline
nlp.add_pipe(ruler)
preceding_negations = en_clinical['preceding_negations']
following_negations = en_clinical['following_negations']
# adding custom preceding negations with the default preceding negations.
preceding_negations += self.progressnotes_constant_objects.return_custom_preceding_negations()
# adding custom following negations with the default following negations.
following_negations += self.progressnotes_constant_objects.return_custom_following_negations()
# negation words to see if a noun chunk has negation or not.
negation_words = self.progressnotes_constant_objects.return_negation_words()
negex = Negex(nlp, language='en_clinical', chunk_prefix=negation_words,
preceding_negations=preceding_negations,
following_negations=following_negations)
# adding negex in the spacy pipeline
# input----->|entityruler|----->|negex|---->entities
nlp.add_pipe(negex, last=True)
def process(self, split_dfs):
split_dfs = split_dfs
# this function is run under multiprocessing.
split_dfs = self.split_dfs.apply(self.lambda_func, axis=1)
def lambda_func(self, row):
"""
This is a lambda function which runs inside a multiprocessing pool.
It read single row of the dataframe.
Applies basic cleanup using replace dict.
Finds positive,their respective tart-end index and negative words.
positive words are the words mentioned in the keywords patterns.
"""
row['clean_note'] = row['notetext']
# passing the sentence from NLP pipeline.
doc = self.nlp(row['clean_note'])
neg_list = list()
pos_list, pos_index_list = list(), list()
for word in doc.ents:
# segregating positive and negative words.
if not word._.negex:
# populating positive and respective positive index list.
pos_list.append(word.text)
pos_index_list.append((word.start_char, word.end_char))
else:
neg_list.append(word.text)
p = os.cpu_count() - 1
pool = mp.Pool(processes=p)
split_dfs = np.array_split(notes_df, 25) # notes_df is a panda dataframe
pool_results = pool.map(self.process, split_dfs)
pool.close()
pool.join()
Expected behavior
pos_list & neg_list needs to get populated
Desktop (please complete the following information):
- OS: MacOS Catalina, 8GB RAM , 1.6 Ghz dual-core
Hey - I haven't run it using pool multiprocessing myself, have you come across this post? https://stackoverflow.com/questions/58294624/multiprocessing-with-textacy-or-spacy/58317741#58317741
My guess is that the negex extension isn't properly showing up in the different memory spaces.
Closing due to inactivity
http://sujitpal.blogspot.com/2020/10/entities-from-cord-19-using-dask.html
may be of interest to you