bootphon / phonemizer

Simple text to phones converter for multiple languages

Home Page:https://bootphon.github.io/phonemizer/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pandas apply: bound by invisible resource (not cpu or io)

jabowery opened this issue · comments

Describe the bug
While executing pandas series apply of phonemize the CPU utilization is stuck at around 35% (single core) and iotop shows virtually no disk activity.

Phonemizer version
phonemizer.version
Out[2]: '2.2.2'

System
cat /etc/debian_version
10.10

To reproduce

(2022) jabowery@ML:~/dev/2022$   $ sudo apt-get install festival espeak-ng mbrola
$: command not found
(2022) jabowery@ML:~/dev/2022$ sudo apt-get install festival espeak-ng mbrola
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  festlex-cmu festlex-poslex festvox-kallpc16k libestools2.5
Suggested packages:
  pidgin-festival festival-freebsoft-utils mbrola-voice cicero
The following NEW packages will be installed:
  espeak-ng festival festlex-cmu festlex-poslex festvox-kallpc16k libestools2.5 mbrola
0 upgraded, 7 newly installed, 0 to remove and 43 not upgraded.
Need to get 6,896 kB of archives.
After this operation, 23.2 MB of additional disk space will be used.
Do you want to continue? [Y/n] 
Get:1 http://us.archive.ubuntu.com/ubuntu focal/universe amd64 espeak-ng amd64 1.50+dfsg-6 [322 kB]
Get:2 http://us.archive.ubuntu.com/ubuntu focal/universe amd64 libestools2.5 amd64 1:2.5.0-8build1 [901 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu focal/universe amd64 festival amd64 1:2.5.0-4build1 [805 kB]
Get:4 http://us.archive.ubuntu.com/ubuntu focal/universe amd64 festlex-cmu all 2.4-2 [895 kB]
Get:5 http://us.archive.ubuntu.com/ubuntu focal/universe amd64 festlex-poslex all 2.4-1 [186 kB]
Get:6 http://us.archive.ubuntu.com/ubuntu focal/multiverse amd64 mbrola amd64 3.02b+dfsg-5 [173 kB]
Get:7 http://us.archive.ubuntu.com/ubuntu focal/universe amd64 festvox-kallpc16k all 2.4-1 [3,614 kB]
Fetched 6,896 kB in 1s (8,323 kB/s)          
Selecting previously unselected package espeak-ng.
(Reading database ... 336101 files and directories currently installed.)
Preparing to unpack .../0-espeak-ng_1.50+dfsg-6_amd64.deb ...
Unpacking espeak-ng (1.50+dfsg-6) ...
Selecting previously unselected package libestools2.5:amd64.
Preparing to unpack .../1-libestools2.5_1%3a2.5.0-8build1_amd64.deb ...
Unpacking libestools2.5:amd64 (1:2.5.0-8build1) ...
Selecting previously unselected package festival.
Preparing to unpack .../2-festival_1%3a2.5.0-4build1_amd64.deb ...
Unpacking festival (1:2.5.0-4build1) ...
Selecting previously unselected package festlex-cmu.
Preparing to unpack .../3-festlex-cmu_2.4-2_all.deb ...
Unpacking festlex-cmu (2.4-2) ...
Selecting previously unselected package festlex-poslex.
Preparing to unpack .../4-festlex-poslex_2.4-1_all.deb ...
Unpacking festlex-poslex (2.4-1) ...
Selecting previously unselected package mbrola.
Preparing to unpack .../5-mbrola_3.02b+dfsg-5_amd64.deb ...
Unpacking mbrola (3.02b+dfsg-5) ...
Selecting previously unselected package festvox-kallpc16k.
Preparing to unpack .../6-festvox-kallpc16k_2.4-1_all.deb ...
Unpacking festvox-kallpc16k (2.4-1) ...
Setting up mbrola (3.02b+dfsg-5) ...
Setting up libestools2.5:amd64 (1:2.5.0-8build1) ...
Setting up festival (1:2.5.0-4build1) ...
Setting up espeak-ng (1.50+dfsg-6) ...
Processing triggers for sgml-base (1.29.1) ...
Processing triggers for install-info (6.7.0.dfsg.2-5) ...
Processing triggers for doc-base (0.10.9) ...
Processing 1 added doc-base file...
Processing triggers for libc-bin (2.31-0ubuntu9.2) ...
Processing triggers for man-db (2.9.1-1) ...
Setting up festlex-poslex (2.4-1) ...
Setting up festlex-cmu (2.4-2) ...
Setting up festvox-kallpc16k (2.4-1) ...
(2022) jabowery@ML:~/dev/2022$ pip install names-generator
Collecting names-generator
  Downloading names_generator-0.1.0-py3-none-any.whl (26 kB)
Collecting cmdkit>=2.1.2
  Downloading cmdkit-2.6.0-py3-none-any.whl (28 kB)
Installing collected packages: cmdkit, names-generator
Successfully installed cmdkit-2.6.0 names-generator-0.1.0
(2022) jabowery@ML:~/dev/2022$ pip install phonemizer
Collecting phonemizer
  Downloading phonemizer-2.2.2-py3-none-any.whl (49 kB)
     |████████████████████████████████| 49 kB 3.2 MB/s 
Collecting segments
  Downloading segments-2.2.0-py2.py3-none-any.whl (15 kB)
Collecting joblib
  Using cached joblib-1.0.1-py3-none-any.whl (303 kB)
Collecting attrs>=18.1
  Using cached attrs-21.2.0-py2.py3-none-any.whl (53 kB)
Collecting regex
  Downloading regex-2021.8.28-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB)
     |████████████████████████████████| 759 kB 8.2 MB/s 
Collecting clldutils>=1.7.3
  Downloading clldutils-3.9.0-py2.py3-none-any.whl (195 kB)
     |████████████████████████████████| 195 kB 23.8 MB/s 
Collecting csvw>=1.5.6
  Downloading csvw-1.11.0-py2.py3-none-any.whl (35 kB)
Collecting colorlog
  Downloading colorlog-6.4.1-py2.py3-none-any.whl (11 kB)
Collecting tabulate>=0.7.7
  Using cached tabulate-0.8.9-py3-none-any.whl (25 kB)
Requirement already satisfied: python-dateutil in /home/jabowery/anaconda3/envs/2022/lib/python3.9/site-packages (from clldutils>=1.7.3->segments->phonemizer) (2.8.2)
Collecting uritemplate>=3.0.0
  Downloading uritemplate-3.0.1-py2.py3-none-any.whl (15 kB)
Collecting rfc3986
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl (31 kB)
Collecting isodate
  Downloading isodate-0.6.0-py2.py3-none-any.whl (45 kB)
     |████████████████████████████████| 45 kB 8.4 MB/s 
Requirement already satisfied: six in /home/jabowery/anaconda3/envs/2022/lib/python3.9/site-packages (from isodate->csvw>=1.5.6->segments->phonemizer) (1.16.0)
Installing collected packages: uritemplate, rfc3986, isodate, attrs, tabulate, csvw, colorlog, regex, clldutils, segments, joblib, phonemizer
Successfully installed attrs-21.2.0 clldutils-3.9.0 colorlog-6.4.1 csvw-1.11.0 isodate-0.6.0 joblib-1.0.1 phonemizer-2.2.2 regex-2021.8.28 rfc3986-1.5.0 segments-2.2.0 tabulate-0.8.9 uritemplate-3.0.1
(2022) jabowery@ML:~/dev/2022$ ipython
Python 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:39:48) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from names_generator import generate_name
   ...: import pandas as pd
   ...: import re
   ...: df = pd.DataFrame(pd.Series(data=[re.sub(r'_','',generate_name()) for x in range(0,500000)],name="FULLNAME"))
   ...: from phonemizer import phonemize
   ...: df['FULLNAME_PHONEMES'] = df.FULLNAME.apply(phonemize)
   ...: 

Then look at both top and iotop

Expected behavior
Not to be bound by an invisible resource.

Hi,

I cannot reproduce your bug. I tested both with apply(phonemize) which is using festival backend, and apply(lambda text: phonemize(text, backend='espeak')) with 500 names (500000 is a bit huge for a single test).

The festival backend is slow, are you sure, by doing htop that you don't see festival processes?

Finally, it will be really really faster if you call phonemize a single time on the whole dataset, with parallelization: df['FULL_PHONEMES'] = phonemize(df.FULLNAME.to_list(), njobs=4)