r9y9 / nnmnkwii

Library to build speech synthesis systems designed for easy and fast prototyping.

Home Page:https://r9y9.github.io/nnmnkwii/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Frame shift is 100ns, not microseconds

PluieElectrique opened this issue · comments

Throughout io/hts.py and frontend/merlin.py, there are lots of variables called frame_shift_in_micro_sec. But, in the HTK book (page 87), it mentions that:

start denotes the start time of the labelled segment in 100ns units, end denotes the end time in 100ns units

Indeed, if we look at the example audio and label files, the units are in 100ns:

>>> from nnmnkwii.io import hts
>>> from nnmnkwii.util import example_label_file, example_audio_file
>>> from scipy.io import wavfile
>>> label = hts.load(example_label_file())
>>> fs, y = wavfile.read(example_audio_file())
>>> len(y) / fs               # Length of file in seconds
3.095
>>> label[-1][1] / 1e6        # Assuming microseconds
30.75
>>> label[-1][1] * 100 / 1e9  # Assuming 100ns
3.075

If the labels were in microseconds, they would be off by a factor of 10. I'm happy to submit a PR to fix the variables and docs. I'm just not sure what the variable should be called. frame_shift_in_hundred_nano_sec is too awkward. Maybe just call it frame_shift and note the unit in the docs?

Hi, I am sorry for the late reply. And thank you for catching this! You are completely right. I was aware of the wrong variable name but I was lazy enough not to fix it :(

I think calling the variable to frame_shift and adding a note about the unit is a reasonable choice. I would appreciate it if you can make a PR!