common-voice / commonvoice-fr

Tooling for producing French dataset for Common Voice

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Numbers translated into words and other irreversible pre processing steps

funboarder13920 opened this issue · comments

Hello,

I have an issue with some preprocessing steps applied in the common voice fr dataset.

If it is the right place to ask about it:
My issue is mainly about numbers being translated to words. This operation cannot be reverted safely. Handling numbers is not simple, I would rather use an e2e approach than preprocessing and postprocessing the data.

People needing this step can always apply it later on.
There are some other irreversible preprocessing steps but the num2words is the most annoying one for ASR.

Is converting numbers to words usual in commonvoice datasets ?
Is it possible to remove this step or is there another way to solve my issue ? Can I find a version of the text without this preprocessing ?

Kind regards,

Hello,

I have an issue with some preprocessing steps applied in the common voice fr dataset.

If it is the right place to ask about it: My issue is mainly about numbers being translated to words. This operation cannot be reverted safely. Handling numbers is not simple, I would rather use an e2e approach than preprocessing and postprocessing the data.

People needing this step can always apply it later on. There are some other irreversible preprocessing steps but the num2words is the most annoying one for ASR.

Is converting numbers to words usual in commonvoice datasets ? Is it possible to remove this step or is there another way to solve my issue ? Can I find a version of the text without this preprocessing ?

Kind regards,

Unfortunately, it was decided not just for Common Voice FR but for any common voice language that it was better to perform this step because otherwise you end up with numbers being ambiguous in how they can be spoken, and this leads to a poor contributor experience, as well as degrades the quality of the dataset.

Thank you.

I understand that a few years ago it could have been an issue. Current models can handle the ambiguity.
Contributors can also handle the ambiguity with enough context.
Whether it's degrading the quality of the dataset is very dependent on the task.
The initial sentence contains more information than the processed one and there is no way of going back accurately to the initial sentence whereas there is no issue with applying num2words on the unprocessed sentence. Keeping the unprocessed sentence and the processed one could have been interesting.

I don't think discussing that here will change any of the choices made for common voice datasets

Is there a way to rebuild the text part from scratch and find the matching between the segments (with the client_id) from the tsv, on my side ?

Thank you.

I understand that a few years ago it could have been an issue. Current models can handle the ambiguity. Contributors can also handle the ambiguity with enough context. Whether it's degrading the quality of the dataset is very dependent on the task.

Maybe some models can. For contributors, I can assure you that this is actually not true, and you can't be sure how 1892 will be said, e.g.

The initial sentence contains more information than the processed one and there is no way of going back accurately to the initial sentence whereas there is no issue with applying num2words on the unprocessed sentence. Keeping the unprocessed sentence and the processed one could have been interesting.

Unfortunately, a choice needed to be done, while I agree keeping raw and transformed would have helped, that ship sailed a long time ago.

I don't think discussing that here will change any of the choices made for common voice datasets

Is there a way to rebuild the text part from scratch and find the matching between the segments (with the client_id) from the tsv, on my side ?

This is one of the reason that the early dataset submission we performed were done via scripts in that repo, so you should be able to reproduce based on that. I hope this can help you recover some of the data there.

For other data, like those directly submitted by contributors on Sentence Collector or those collected via Wikipedia scrapping, I can't help there.

Aside, I'm really curious to know why it is such a big deal in your case. Maybe you should also try and get that directly to the Common Voice team, they might be able to make the tool evolve?

You are welcome to join our CommonVoice-FR or 🐸/STT room on matrix to talk about this issue more easely.

Hello, it is not a big deal as I have other data sources with untransformed numbers. At least punctuation and casing are preserved in common voice.

I'm working on ASR, it's important for me to keep numbers unprocessed in the training dataset because of the difficulty of re-converting numbers from words to digits.
With the context, ASR models can understand the meaning of numbers and write/present them accordingly. In my case, the information on how a human presents a spoken number is important, ambiguity is part of the problem models need to solve.
In the speech to text direction, it doesn't really matter that 1892 can be pronounced differently as long as the model understands that all the ways give the same number.
I could do it with another model but I prefer to avoid stacking models.

For now, I will filter out segments containing numbers. It seems that only ~6% of the segments contain numbers requiring a conversion.

Thank you for your help.