BUG: Short-term analysis window setting

Question

BUG: Short-term analysis window setting

hokiedsp opened this issue 2 years ago · comments

There is a floating-point truncation error in setting up the analysis window in Sampled.cpp and Sound_to_Formant.cpp (there may be others but we came across these two).

The number of analysis frames are computed by (L77 in Sampled.cpp):

*numberOfFrames = Melder_ifloor ((myDuration - windowDuration) / timeStep) + 1;

This produces the correct outcome as long as mod(myDuration - windowDuration, timeStep)!=0. When the analysis windows perfectly lines up with the duration of the audio data, the division may yield slightly smaller than the theoretical value, resulting in one less frame.

The same issue arises in L303 in Sound_to_Formant.cpp

@YannickJadoul

Yannick Jadoul · Answer 1 · Wed Feb 23 2022 23:22:35 GMT+0800 (China Standard Time)

Let me add the concrete example that @hokiedsp and I were debugging:
Starting out with a 44100 Hz fragment, extracting 0.4 seconds from it, results in 17640 samples with sample period 2.2675736961451248e-05. 17640 * 2.2675736961451248e-05 == 0.4, no problem there.
However, when calculating formants with the default formant ceiling (5500 Hz) the audio gets resampled to 11 kHz, and a fragment of 0.4 s will have 4400 samples, and 4400 * 9.09090909090909e-05 == 0.39999999999999997 != 0.4.

I can understand where all of this is coming from, of course. The issue here is that the line that @hokiedsp mentioned floors a calculation using this length and this rounding error loses the "predictability" where the samples will be located.
In our example with 0.04 s (half) window length and 0.02 s steps, this means that there will be 16 formant estimate samples (0.05, 0.07, 0.09, 0.11, 0.13, ..., 0.33, 0.35) instead of the 17 that would perfectly fit based on the parameters (0.04, 0.06, 0.08, 0.10, 0.12, ..., 0.32, 0.34, 0.36).

Yannick Jadoul · Answer 2 · Thu Feb 24 2022 03:29:24 GMT+0800 (China Standard Time)

the default formant ceiling (5500 Hz) the audio gets resampled to 11 kHz,

Just to clarify, @hokiedsp pointed out to me that this resampling does not inherently cause the rounding error. But it does show that even though the original fragment didn't suffer from this, the resampling can cause this, and kind of hides this from the user aiming for a certain sampling.

Paul Boersma · Answer 3 · Thu Jul 28 2022 03:39:19 GMT+0800 (China Standard Time)

You will have to live with this. The number 0.4 cannot be represented accurately in 64 binary digits. A piece of sound that you think is 0.4 seconds long will contain either 4400 or 4401 samples. This is no mistake. If you worry about reproducibility, don't use a duration that looks like it's an integer number of samples long. You say "perfectly fit", but the first and/or last of your 17 frames may lie outside [0.04, 0.36] seconds. If you think there exists a solution to this problem, please tell me what it is.

Kesh Ikuma · Answer 4 · Thu Jul 28 2022 10:27:18 GMT+0800 (China Standard Time)

@PaulBoersma - Thanks for your response. I was afraid that ("You have to live with this") would be the answer as there is no easy solution so long as Praat processes time variables in the continuous-time domain.

If you think there exists a solution to this problem, please tell me what it is.

The best solution IMHO is to implement all temporal operations in discrete time (i.e., with integer sample indices instead of time in seconds).

Here is my rough sketch of the Sampled_shortTermAnalysis() function (I'm cutting out assert calls for ease of reading):

void Sampled_shortTermAnalysis (Sampled me, double windowDuration, double timeStep, integer *numberOfFrames, double *firstTime) {

  // convert parameters to discrete-time duration in samples
  int nwin = round(windowDuration / my dx);
  int noffset = round(timeStep / my dx);

  // this is now all integer arithmetic w/out any numerical issue
  *numberOfFrames = (my nx - nwin) / noffset + 1;

  // centering of the analysis overall span in discrete time as well 
  int ntotal = (*numberOfFrames - 1) * noffset + nwin; // roughly equiv to thyDuration
  int n1st = (my nx - ntotal) / 2 // first sample index (zero-based)

  // convert to continuous time
  *firstTime = (n1st + 0.5) * my dx;
}

Paul Boersma · Answer 5 · Thu Jul 28 2022 19:15:40 GMT+0800 (China Standard Time)

Working in discrete time would void the physics. Would you like users to see and compute with sample numbers rather than seconds?

Your solution works with seconds, though, and that is good. But is has the same rounding problem.

Paul Boersma · Answer 6 · Thu Jul 28 2022 19:24:46 GMT+0800 (China Standard Time)

OK, I clicked on "Reopen" while typing... Your solution has several rounding problems. One of them is that the time step is rounded to a whole number of samples, which is very imprecise, as the timing error adds up over the course of the sound. This adds to the smaller rounding problem, which is that you can no longer centre analysis windows in between sample points (this is violated in Praat here and there anyway, namely where it hardly hurts). The typical analysis in Praat would become 10000 times less acurate with your proposal than it is now. On top of that, I don't see what problem your proposal solves.

Paul Boersma · Answer 7 · Thu Jul 28 2022 19:33:11 GMT+0800 (China Standard Time)

There is also an inconsistency in your remarks. You write "Starting out with a 44100 Hz fragment, extracting 0.4 seconds from it, results in 17640 samples with sample period 2.2675736961451248e-05." whereas you also write "the 17 that would perfectly fit based on the parameters (0.04, 0.06, 0.08, 0.10, 0.12, ..., 0.32, 0.34, 0.36)". If you find the first correct, then the second would have (0.36 - 0.04) / 0.02 = 16 frames, and if you find the second correct, then the first would have 17641 samples.

Paul Boersma · Answer 8 · Thu Jul 28 2022 19:45:25 GMT+0800 (China Standard Time)

The cause of your problem seems to be "decimal hallucination", the idea being that in the closed interval [0.3 seconds, 0.7 seconds] we can fit five times that are spaced by 0.1 seconds, i.e. 0.3, 0.4, 0.5, 0.6 and 0.7 seconds. However, in the open interval (0.3 seconds, 0.7 seconds), we can fit only four times, namely 0.35, 0.45, 0.55 and 0.65 seconds. Now, neither 0.3 seconds nor 0.7 seconds can be represented accurately for floating-point numbers, so that the computation 0.3 + 4*0.1 can turn up less than, equal to, or greater than 0.7 seconds, which means that the distinction between open and closed intervals is moot. If a candidate time lies approximately on the edge of the interval, then it has 50 percent chance of fitting inside or not fitting inside. Such edge cases will continue to exist as long as time is measured in seconds. The decimal hallucination makes this somehow feel worse for numbers that we tend to write like "0.4", a representation that suggests more precision than "3.1415926535897932385". Fortunately, the "short term analysis" computation in Praat is typically used for windowing, and those windows go to zero at the edges, so that it never makes a difference. In fact, we actively employ methods for which these edge decisions don't matter, i.e. where integer consequences of floating-point rounding cannot make differences.

Kesh Ikuma · Answer 9 · Thu Jul 28 2022 23:36:37 GMT+0800 (China Standard Time)

One of them is that the time step is rounded to a whole number of samples, which is very imprecise, as the timing error adds up over the course of the sound.

Good point. I forgot that users do provide timeStep (while windowDuration is usually obscured) and I agree staying accurate to their setup is important. Here is the second take:

void Sampled_shortTermAnalysis (Sampled me, double windowDuration, double timeStep, integer *numberOfFrames, double *firstTime) {

  // convert parameters to discrete-time duration in samples
  int nwin = round(windowDuration / my dx);
  double noffset = timeStep / my dx;

  // evaluate the number of frames
  int navail = my nx - nwin + 1; // range of samples to place the first sample of the frames
  int nframes = int(navail / noffset); // conservative estimate
  if (nframes*noffset <= navail-noffset) nframes++; // make sure max number of samples are covered
  *numberOfFrames = nframes;

  // centering of the analysis overall span in discrete time as well 
  int ntotal = nframes * noffset + nwin - 1; // roughly equiv to thyDuration
  int n1st = (my nx - ntotal) / 2; // first sample index (zero-based)

  // convert to continuous time
  *firstTime = (n1st + 0.5) * my dx;
}

The key line to resolve my issue is the if statement. Without it, there is a chance of leaving samples on the table unanalyzed due to the floating-point precision issue. I think the same safeguarding could be implemented in continuous time, but it's easier for me to illustrate the issue with samples rather than with abstract time intervals. The translated version should look something like this:

  double tavail = myDuration - windowDuration + my dx;
  int nframes = int(tavail  / timeStep);
  if (nframes*timeStep <= tavail -timeStep ) nframes++;
  *numberOfFrames = nframes;

This adds to the smaller rounding problem, which is that you can no longer centre analysis windows in between sample points (this is violated in Praat here and there anyway, namely where it hardly hurts).

I'm not quite following this. If it is on how firstTime is set, then please ignore that part of the code.

On top of that, I don't see what problem your proposal solves.

I hope this version better illustrates what the problem is and how to solve it. (I think not maintaining timeStep in the first take unfocused the discussion, my bad.)

Paul Boersma · Answer 10 · Thu Jul 28 2022 23:38:40 GMT+0800 (China Standard Time)

In case none of the above convinces you, let's try to give two examples. Imagine a sound of 1 second length, sampled at 10 kHz, i.e. nx = 10000 and dx = 0.0001. Case 1: suppose we need a window duration of 4 milliseconds and a time step of 1 millisecond, so that your nwin is 40 and your noffset is 10. Your number of frames is then computed as int ((10000 - 40) / 10) + 1, which is 997. Praat computes floor ((1.0 - 0.004) / 0.001) + 1, which is 996 or 997 depending on floating-point rounding. This is the case for which you can claim better reproducibility; it is an edge case for Praat and a very safe rounding example for your proposal.

Now consider Case 2, which, by contrast, comes with safe rounding for Praat and constitutes an edge case for your example: a window duration of 4.05 milliseconds and a time step of 1.05 milliseconds. Praat always computes the number of frames as floor ((1.0 - 0.00405) / 0.00105) + 1, which is 949, independent of rounding. In your proposal, nwin becomes 40 or 41 (depending on rounding), and noffset becomes 10 or 11 (depending on rounding). This leads to a number of samples of ((10000 - 40 or 41) / 10 or 11) + 1, which is 906 in 50 percent of the cases, 996 in 25 percent of the cases, and 997 in the remaining 25 percent of the cases.

The resulting integer number of frames in Praat is therefore never more than 0.5 away from its theoretical floating-point value (996.5 and 949.0[230952]*), whereas in your proposal it can be up to 43.023 away from that value. Hence, Praat's computation in the edge cases of this example is 86 times more accurate than yours. A detailed analysis like this contributed to why we chose to implement Sampled_shortTermAnalysis() the way we did in 1992. We are open to improvements, of course.

Paul Boersma · Answer 11 · Thu Jul 28 2022 23:41:18 GMT+0800 (China Standard Time)

My example above applied to your first version. Keeping the time step real will change the story. I will get back about that.

Paul Boersma · Answer 12 · Fri Jul 29 2022 00:14:16 GMT+0800 (China Standard Time)

Consider Case 2 again. Praat always computes 949 frames, because that is how many fit, independent of rounding. Your computation yields an nwin of 40 or 41, an noffset of 10.5apx, an navail of 9960 or 9959, hence int (9959 or 9960 / 10.5apx) = 948 frames, and 948*10.5apx = 9954apx is not smaller than 9960 or 9959 - 10apx. So you are losing a frame, even in a non-edge case?

Paul Boersma · Answer 13 · Fri Jul 29 2022 00:35:47 GMT+0800 (China Standard Time)

To return to an earlier example: suppose you want to know how many frame centres can occur between 0.299999 and 0.700001 seconds, with distances of 0.1 seconds. Surely this has to be 5, namely 0.3, 0.4, 0.5, 0.6 and 0.7 seconds. Your proposal computes only 4: with dx=0.0001, duration=1.0, nx=10000, windowDuration=0.599998, timeStep=0.1, you get nwin=6000, noffset=1000apx, navail=10000-6000+1=4001, nframes=int(4001/1000apx)=4, and 4*1000apx is not less than 4001-1000apx. Now, it could be that you do want 4?

Kesh Ikuma · Answer 14 · Fri Jul 29 2022 00:42:25 GMT+0800 (China Standard Time)

So you are losing a frame, even in a non-edge case?

It is an edge-case that I'm reporting here. I typically set nwin==noffset and provide just right number of samples, i.e., my nx == nwin*nframes. Praat sometimes configure its analyses with only nframes-1 frames of the given data.

I'm willing to bet Praat works properly as long as the condition my nx != noffset*nframes + nwin - 1 holds, which is likely 99%+ of the use cases, I suspect.

Paul Boersma · Answer 15 · Fri Jul 29 2022 01:12:46 GMT+0800 (China Standard Time)

But your proposal tended to give fewer frames than Praat, not more, in the examples we looked at, at least in your second proposed algorithm. Can you give a precise example (with numbers, please) of a call to Sampled_shortTermAnalysis() where the Praat version yields fewer frames than your version?

Kesh Ikuma · Answer 16 · Fri Jul 29 2022 01:21:09 GMT+0800 (China Standard Time)

But your proposal tended to give fewer frames than Praat, not more,

Hmmm, If it's fewer, I missed +1 somewhere in my logic. My intent was to maximize the number of samples to be analyzed. Let me dig up my code and get back to you later with actual numbers.

Kesh Ikuma · Answer 17 · Sat Aug 06 2022 11:03:08 GMT+0800 (China Standard Time)

OK, I got a failing case with autocorrelation pitch analysis:

nsamples = 7720 @ 4000 samples/sec
nwin = 200 => pitch_floor = 60.0 Hz => windowDuration = 0.05 seconds
noffset = 20 => timeStep = 0.005 seconds

I expect 377 windows using all samples: floor((nsamples - nwin) / noffset) + 1 (this should be the correct equation).

But it returns only 376 pitch samples.

Paul Boersma · Answer 18 · Mon Aug 08 2022 00:52:02 GMT+0800 (China Standard Time)

In your version of Sampled_shortTermAnalysis, with a windowDuration of 0.05 and a timeStep 0f 0.005, you will get nwin=round(0.05apx*4000apx) = 200 samples, but an noffset of 0.05apx * 4000apx = 200apx, then navail=7720-200+1 = 7521, and nframes = int(navail / noffset) = int(7521 / 200apx) = 376, not 377. So your version does the same as Praat here (and goes wrong in the simpler cases discussed above).

Are you proposing a fourth alternative version of Sampled_shortTermAnalysis, though, that does give 377, and doesn't go wrong in the simpler cases I mentioned?

Kesh Ikuma · Answer 19 · Mon Aug 08 2022 04:26:25 GMT+0800 (China Standard Time)

Yes, I expect Sampled_shortTermAnalysis to return 377 in this case, but Praat is currently returning 376 due to the int() truncation.

Sorry, I should've fixed my code as well. This one should be correct:

void Sampled_shortTermAnalysis (Sampled me, double windowDuration, double timeStep, integer *numberOfFrames, double *firstTime) {

  // convert parameters to discrete-time duration in samples
  int nwin = round(windowDuration / my dx);
  double noffset = timeStep / my dx;

  // evaluate the number of frames
  int nframes = int((my nx - nwin)/ noffset) + 1; // conservative estimate
  double ntotal = (nframes-1) * noffset + nwin; // roughly equiv to thyDuration

  // check if there is more samples available
  if (int(ntotal + noffset) <= my nx) { 
    nframes++; // make sure max number of samples are covered
    ntotal += noffset; // update with the new nframes
  }
  *numberOfFrames = nframes;

  // centering of the analysis overall span in discrete time as well 
  int n1st = (my nx - ntotal) / 2; // first sample index (zero-based)

  // convert to continuous time (the middle of the first window
  *firstTime = (n1st + nwin/2) * my dx + my x1;
}

Granted that I have not tested the numerical stability of if (int(ntotal+noffset) < my nx) but worth a try imo.

Again, not sure 100% if this firstTime is exactly the way you have it. numberOfFrames is my primary concern.

doesn't go wrong in the simpler cases I mentioned?

Let me see...

Assuming my dx = 0.0001, my nx = 10000

Case 1
a window duration of 4 milliseconds
a time step of 1 millisecond
Praat computes floor ((1.0 - 0.004) / 0.001) + 1, which is 996 or 997

nwin = round(40.0) => 40
noffset = 10.0
nframes = int((10000-40)/10.0 + 1 = 997.0 => 997/996 (assume 996)
ntotal = (996-1) * 10.0 + 40 = 9990.0
if int(9990.0 + 10.0) <= 10000 = True // as 10000.0 should be converted perfectly to int
nframes++ => 997

Case 2
a window duration of 4.05 milliseconds
a time step of 1.05 milliseconds
Praat always computes the number of frames as floor ((1.0 - 0.00405) / 0.00105) + 1, which is 949, independent of rounding.

nwin = 40.5 => 41
noffset = 10.5
nframes = int((10000-41)/10.5 + 1 = 949.48 => 949 // matches Praat
ntotal = (949-1) * 10.5 + 41 = 9995.0
if int(9995.0 + 10.5) <= 10000 = False
// skipped

Case 3: suppose you want to know how many frame centres can occur between 0.299999 and 0.700001 seconds, with distances of 0.1 seconds. Surely this has to be 5, namely 0.3, 0.4, 0.5, 0.6 and 0.7 seconds. Your proposal computes only 4: with dx=0.0001, duration=1.0, nx=10000, windowDuration=0.599998, timeStep=0.1, you get nwin=6000, noffset=1000apx

nwin = 5999.8 => 6000
noffset = 1000.0
nframes = int((10000-6000)/1000 + 1 = 5 (4 if truncated)
// assume nframes = 4
ntotal = (4-1) * 1000.0 + 6000 = 9000.0
if nframes = 4, check: int(9000.0 + 1000.0) <= 10000 = True // as 10000.0 should be converted perfectly to int
  nframes++ => 5
  ntotal += 1000.0 = 10000.0

So all the cases seem to work out fine with the latest version.

Paul Boersma · Answer 20 · Mon Aug 08 2022 20:14:43 GMT+0800 (China Standard Time)

Your assertion in Cases 1 and 3 that "10000.0 should be converted perfectly to int" is not generally true, because it could be 9999.99999999999, in which case it truncates to 9999. But in that case the original ntotal in Case 1 might have been 997 already. So in fact we are hoping that we can beat floating-point rounding issues in Case 1 by having the rounding error in 10.0 (which can be 9.9999999999999 or 10.0000000000001, but can only be 10.0000000000001 if we are to get an initial nframes of 996) compensated by a rounding error in int(995*10.0+40 + 10.0)-10000. That looks reasonable, but let's check. In Case 1, noffset can be either 10-, or 10, or 10+ (in another notation). Those are three cases, and the initial nframes will be 997, 997 and (996 or 997), respectively (assuming that if the double is in fact integer, then integer division will apply). Ntotal will then be 996 * 10- + 40 = (10000- or 10000), 996 * 10 + 40 = 10000, and (995 or 996) * 10+ + 40 = ((9990 or 9990+) or (10000 or 10000+)), respectively. The testing number will be int((10000- or 10000) + 10-) > 10000, int(10000+10) > 10000, and int(((9990 or 9990+) or (10000 or 10000+)) + 10+) = (10000 or 10010). So the condition works, and does so in Case 3 as well. We are left with Case 2, where nwin can be 40 or 41, and noffset=10.5apx. In the 41 case, nframes will be (according to your new formula) int((10000-41)/10.5) + 1, which is 949 (truncated from 949.48), matching Praat. In the 40 case, nframes will initially become int((10000-40)/10.5) + 1, which is 949 as well (truncated from 949.57), so that ntotal becomes 9995apx; the condition becomes "int(9995apx+10.5apx) < 10000", which is false. So my three cases seem to work, but your solution relies on the correct propagation of the direction of rounding error, i.e. the assumption that in two steps the rounding error can go from + to 0 (or the reverse), but not from + to -.

Kesh Ikuma · Answer 21 · Mon Aug 08 2022 22:50:20 GMT+0800 (China Standard Time)

That looks reasonable, but let's check.
...
So my three cases seem to work, but your solution relies on the correct propagation of the direction of rounding error, i.e. the assumption that in two steps the rounding error can go from + to 0 (or the reverse), but not from + to -.

Good! (and sigh of relief for finally getting the math right) And 100% correct on "not from + to -". This condition need not to be accounted for because the preceding code (nframes = int(...) imposes the "from -" condition).

Your assertion in Cases 1 and 3 that "10000.0 should be converted perfectly to int" is not generally true, because it could be 9999.99999999999, in which case it truncates to 9999.

My comment was not quite appropriate for the context. It should read something like "10000+ converts to <=10000" (using your +/- notation). This logic works under the assertion that if the sum is 10000.999999999, then you still only need 10000 samples to process the (nframes+1)-st frame (that is, that first sample indices of frames are obtained via floor operations). Only when the sum is >=10001.0, there isn't enough samples to add another frame.

What would you like to do from here? If you're in on this mod, I'd be happy to create a PR (although my Praat repo clone hasn't been set up to compile). As I said in the OP, there is another code segment in Praat that uses this same logic.

Paul Boersma · Answer 22 · Tue Aug 09 2022 14:30:23 GMT+0800 (China Standard Time)

You seem to be concluding that your fourth algorithm is correct now. That would be great, but it would mean that you might have solved floating-point rounding in general, which would be a major achievement that could revolutionize the field, which has lived for decades with disbelief that floating-point rounding could be solved somehow (e.g. that 0.7 minus 0.3, divided by 0.1, could yield exactly 4). In reality, while Case 2 and Case 3 now do what we want, the problem just seem to have shifted to a different location, so let's discuss Case 4, which comes with not one, but with two problems. While Case 3 looked at a window length of 0.599998 seconds, Case 4 has a window length 0.600002 seconds, still with a sample rate of 10000.0 Hz and a time step of 0.1 seconds. Praat here computes a number of frames of floor ((10000/10000.0 - 0.600002)/0.1) + 1, which is floor (3.99998) + 1, which is always 4, as it should, because one just cannot fit 5 frame centers between 0.300001 and 0.699999 seconds at distances of 0.1 seconds. Your fourth version computes floor((10000 - round(0.600002*10000.0))/(0.1*10000.0))+1, which is floor(4.0)+1, which can come out as 4 (if 0.1 happens to be 0.1+) or 5. This is the first problem, as the answer should never be 5. We could solve this first problem by not rounding, i.e. by computing floor((10000 - (0.600002*10000.0))/(0.1*10000.0))+1, which yields 4 correctly, which is no surprise, because this formula is equivalent to Praat's. So let's assume that the initial number of frames is 4, and proceed to the condition. The condition computes floor((4-1)*(0.1*10000)+round(0.600002*10000)+0.1*10000), which is floor(10000+) if 0.1 happens to be 0.1+ or even just exactly 0.1. This is <=10000, so one is added to the number of frames, which becomes 5. The cause of this second problem is the truncation and the rounding: without rounding and truncation, the condition would be false, but with only truncation it will always be true, and with only rounding it will sometimes be true. The bottom line is that while your previous algorithm failed for non-controversial cases 2 and 3, this new algorithm fails in several ways for the equally non-controversial case 4.

Kesh Ikuma · Answer 23 · Wed Aug 10 2022 11:56:22 GMT+0800 (China Standard Time)

Wait, are you saying that Praat's backend actually uses non-integer-sample window lengths?!? As in, its analyses are performed with variable window size (or a new windowing functions is computed for each window).

For example, a user-specified window length of 0.600002 s @ 10 kS/s yields the actual window length of 6000 samples (0.6 s) or 6001 samples (0.6001 s).

I made the assertion that Praat picks one of these 2 configurations. I chose the one closest to the user's spec by rounding it. This seems to me the most sensible thing to do. (If Praat always uses the larger of the 2, then it's a lot harder to enforce the condition that I'm after.)

Obviously, this assertion goes out of the window if Praat does use variable window length. The user-specified window length is then a (loosely enforced) average window length, and windows have either 6000 or 6001 samples depending on their locations. Your last case is an example of "unenforced" case. The 0.1-s time step and the sample selection scheme always impose 6001 sample window size; thus, the actual window length is 0.6001 s, not the user-specified 0.600002 s.

I'd love to hear how Praat sets windows after this stage (and I'll certainly peek at the source later when I have a bit more time), and I need to rethink my proposed mods accordingly.

Thanks!

Paul Boersma · Answer 24 · Wed Aug 10 2022 19:16:37 GMT+0800 (China Standard Time)

Ideally, analysis windows are Gaussian or Gaussianlike, and ideally, the mu and stdev parameters of such a Gaussian window are both in seconds, and those seconds are used to compute the shape of the window. Both the width and the centre of such a window are sometimes rounded to a sample if that hardly makes a difference in accuracy and it does make a big difference in speed (e.g. the intensity analysis was changed many years ago to round frame centres to samples in order not to have to recompute a costly Bessel function for each frame any longer). Also, window lengths are usually a constant number of samples if they are used in a loop over frames. But all of these are approximations of the physical frame centres and window lengths in cases where it does not hurt to approximate them. The golden standard is always that samples and frames represent a continuous signal, and that measurements reported for specific time points represent the values at those precise time points, not the values at sample centres. This is important in the waveform itself (sinc interpolation), in pitch analysis, and in the determination of periods. For instance, accurate pitch analysis requires periods to be known with a precision of better than 0.0001 sample duration.

In your example above, the computation in Sampled_shortTermAnalysis determines only where the logical frame centres lie, not how the analysis window is implemented after that. The function only determines the five parameters of the resulting output Sampled: xmin and xmax (the logical domain) are presumably copied from the input Sampled, dx is the required timeStep, and nx and x1 are computed. This computation is independent, and should be independent, from any shortcuts made in the subsequent determination of the contents of the frames.

Paul Boersma · Answer 25 · Wed Aug 10 2022 19:27:40 GMT+0800 (China Standard Time)

Sorry, the 0.0001-sample example applies to harmonicity, not pitch. By not rounding anything to samples, we managed to make noise levels of -50 dB measurable, quite an improvement from the -20 dB levels that could be measured by earlier methods by others (or by some naive later methods). See Boersma (1993) for details (another trick was also important).

Kesh Ikuma · Answer 26 · Thu Aug 11 2022 08:44:35 GMT+0800 (China Standard Time)

Thanks for the detailed explanation.

Also, window lengths are usually a constant number of samples if they are used in a loop over frames.
...
that measurements reported for specific time points represent the values at those precise time points

Am I right to interpret this comment as Praat indeed resample for every window it analyzes? (Glancing at the source code, this appears to be the case.) So, with my interpretation, the 0.600002-s window is indeed approximated by a 6001-sample (0.6001-s) window with careful resampling and window function configuration. Very interesting.

the computation in Sampled_shortTermAnalysis determines only where the logical frame centres lie, not how the analysis window is implemented after that. ... This computation is independent, and should be independent.

Now, I don't know about this (well, maybe true depending on what you mean by "independent"). I see a clear dependency between the Sampled_shortTermAnalysis window size and the length of of the resampled windows in the later processes:

nwin = ceil(windowDuration / my dx);.

This relationship is implicitly imposed by the Melder_ifloor operation in *numberOfFrames = Melder_ifloor ((myDuration - windowDuration) / timeStep) + 1;.

So, now back to my issue with the number of frames. This actually is the worst case scenario I envisioned (on how to mitigate it) and I need to retract the previous proposals entirely. The only alternative that I can come up with is to set an (arbitrary) threshold. Something like:

const double eps = 1e-6; // a small fraction of samples

*numberOfFrames = Melder_ifloor ((myDuration - windowDuration) / timeStep) + 1;

// if additional frame increases the projected number of samples within (the available number of samples + eps)
// it's safe to add another frame
if ((*numberOfFrames*timeStep + windowDuration)/my dx - eps <= my nx)
  (*numberOfFrames)++

For my failed case above, the left-hand side is ~1e-12 samples over my nx, so eps = 1e-6 is a plenty large enough to address my issue. Meanwhile, this slack is extremely small in the absolute time scale. At 10000 S/s, the slack is 1e-10 seconds, which is absurdly small for audio processing.

What do you think?

Paul Boersma · Answer 27 · Thu Aug 11 2022 21:08:45 GMT+0800 (China Standard Time)

no, we don't resample (except for LPC measurements), but we use techniques that minimize the problems that finite sampling yields. This means sinc interpolation for finding peaks (Plan B: parabolic interpolation), and the fact that Gaussian-weighted averaging over discrete positive-valued samples approximates very well a Gaussian-weighted averaging over the underlying continuous curve; as another example, try creating a pulse train as a Sound (e.g. from a PitchTier) and see what it looks like when you zoom in...

You can consult the two manual pages about vector interpolation and peak interpolation to see some issues.

I see a clear dependency between the Sampled_shortTermAnalysis window size and the length of of the resampled windows in the later processes

No, Sampled_shortTermAnalysis doesn't change the duration of the analysis window, and doesn't determine the implementation of the analysis window in terms of the samples of the sound. Sampled_shortTermAnalysis only determines the sampling of the resulting analysis, i.e. the number of frames and the locations of the frame centres. If this is not what you mean, can you be more precise? E.g., what do you mean by "resampled windows"?

Your trick with eps is a possible way to bias Sampled_shortTermAnalysis toward choosing the higher number of frames in our edge cases, although the size of an effective eps would depend strongly on the other parameters. You can achieve the same effect by giving Sampled_shortTermAnalysis a slightly smaller window length. For instance, if you want your frame centres to end up at 0.3, 0.4, 0.5, 0.6 and 0.7 seconds, while the sound runs physically from 0 to 1 seconds, you could supply Sampled_shortTermAnalysis with a window length of 0.59999999 seconds, and you would be fine.

Kesh Ikuma · Answer 28 · Thu Aug 11 2022 22:05:31 GMT+0800 (China Standard Time)

No, Sampled_shortTermAnalysis doesn't change the duration of the analysis window

I didn't phase it correctly. You're right, and that's why it could under-report the frame count. What I meant to say was that how about making it dependent and constructing it's logic around what happens downstream (which leads to the proposed eps correction).

although the size of an effective eps would depend strongly on the other parameters

Well, yeah, but what is the lowest acceptable sampling rate for speech/voice analysis? I sometimes go down to 2000 S/s (working with glottal source signals) and that's already pushing it. And let's be conservative with eps and use 1e-6 to capture all possible FP error. This puts us at the actual eps to be 5e-10 s / 0.5 ns (or 2 MHz). This resolution should far exceed what's needed for Praat's use cases.

To me, this is a sensible thing to do, and it comes with a negligible price tag to pay to fix these edge cases.

This is about as much a sales pitch as I can give to this fix, and if I've failed, please close the issue.

you could supply Sampled_shortTermAnalysis with a window length of 0.59999999 seconds, and you would be fine

Obviously, this has always been the workaround on my end as an end user...

what do you mean by "resampled windows"?

Does "interpolated windows" fit better? I (carelessly) used to resample to mean to interpolate. It was my summary of my understanding of how Praat windows data samples.

Just FYI, there is no negative connotation to it. My hats off to your efforts to accommodate end-user's continuous-time specs. I would round them to nearest discrete-time specs from Day 1.

Paul Boersma · Answer 29 · Thu Aug 11 2022 22:16:12 GMT+0800 (China Standard Time)

OK, so the workaround works well for people like you who want to round up in the event of edge cases; other people may want to round down in those cases, and they would add an eps to the window length rather than subtract it.

As for the range of possible sample rates, EEG can have 128 Hz, but Praat makes a point of supporting 0.000000001 Hz as well, or 1000000000000000000 Hz, and the current Sampled_shortTermAnalysis works fine with those.

I will follow your suggestion to close this issue. There doesn't seem anything in need of repair.