NLnetLabs / nsd

The NLnet Labs Name Server Daemon (NSD) is an authoritative, RFC compliant DNS nameserver.

Home Page:https://nlnetlabs.nl/nsd

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nsd verification processing hangs, activity stopped for 20-30 minutes

ttyS4 opened this issue · comments

hi nsd folks,

There is a place where nsd is used for verification.
(Because of ixfr related issues it is on 4.9.1-1 now running on debian 12, compiled a package in a debian12 chroot using official debian packages, basically a backport.)

A new zone is generated every 10 minutes and knot signs the zone then nsd does verification and distributes the zone (notify-out + xfr).

nsd[32438]: notify for xy. from ::1 serial 1718515802
nsd[22942]: xfrd: zone xy committed "received update to serial 1718515802 at 2024-06-16T07:30:28 from ::1@52"
nsd[22943]: zone xy. received update to serial 1718515802 at 2024-06-16T07:30:28 from ::1@52 of 7204 bytes in 7.9e-05 seconds
nsd[22943]: verify: started verifier for zone xy (pid 35409)
...
nsd[22943]: verify: verifier for zone xy (pid 35409) exited with 0
nsd[22942]: zone xy serial 1718515202 is updated to 1718515802
nsd[35663]: ixfr for xy. from IP1
nsd[35663]: ixfr for xy. from IP2
...
nsd[22942]: xfrd: zone xy: received notify response error .... from IP6

However today we saw no follow-up after the verifier exited with 0.
We see nsd[4819]: handle_child_command: read: Connection reset by peer like 20 minutes after the verification finished.
Then normal activity is resumed and:

nsd[22942]: zone xy serial 1718516403 is updated to 1718517002

message follows.

Notify messages were received (and logged) while in this state, but no progress.

Would you think that upgrade to 4.10 could help?
Is this a known issue or something that needs further investigation?

Regards,
Tamás

Hi Tamas,
I don't think upgrading to 4.10 would make a difference in this case, but perhaps the 20 minutes timeout (in which NSD stays in reload mode) could be reduced by setting verifier-timeout: value to something reasonable; like 200% the time it takes the script to verify the zone or so.

But I still want to look into the specific case (by manual code instpection) that the process already exited, but that NSD is still reading what the verifier is writing to stdout and stderr.

If you need any info from us, just let us know.
(I can also try to collect data for you as long as it is considered safe.)