valeriansaliou / vigil

🚦 Microservices Status Page. Monitors a distributed infrastructure and sends alerts (Slack, SMS, etc.).

Home Page:https://crates.io/crates/vigil-server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`internal error` when pinging self

eKristensen opened this issue · comments

Hi!

Thanks for vigil, it is a great peace of software. I have however found an issue which is properly related to the ping library used by vigil.

While it may seem odd to ping an IP on the same machine that the uptime monitor is running I still think it is misleading that it says internal error. If anything you should just get a stupid monitor that would always be up (because when it is down the uptime monitor itself is not running).

I have confirmed this issue exists on several machines some with Rocky Linux others with Debian bookworm with ARM and Intel CPU architecture.

In order to eliminate possible sources of errors I did a clean install of the latest stable version of Debian and installed vigil with this service block (here the local machine has the IPv4 192.168.1.122 from DHCP):

[[probe.service.node]]

id = "router"
label = "Core main router"
mode = "poll"

replicas = [
  "icmp://192.168.1.222",
  "icmp://[::1]",
  "icmp://localhost"
]

Which results in errors such as:

Feb 03 18:19:11 vigil-test vigil[4100]: (INFO) - starting 4 workers
Feb 03 18:19:11 vigil-test vigil[4100]: (INFO) - Actix runtime found; starting in Actix runtime
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will fire for icmp host: 192.168.1.222 (1 targets)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will send icmp ping to target: 192.168.1.222 from host: 192.168.1.222
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll error for icmp target: 192.168.1.222 from host: 192.168.1.222 (error: internal error)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - will probe replica: ICMP("192.168.1.222") with retry count: 1
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will fire for icmp host: localhost (2 targets)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: localhost
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll error for icmp target: ::1 from host: localhost (error: internal error)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - will probe replica: ICMP("localhost") with retry count: 1
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will fire for icmp host: ::1 (1 targets)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: ::1
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll error for icmp target: ::1 from host: ::1 (error: internal error)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - will probe replica: ICMP("::1") with retry count: 1
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will fire for icmp host: 192.168.1.222 (1 targets)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will send icmp ping to target: 192.168.1.222 from host: 192.168.1.222
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll error for icmp target: 192.168.1.222 from host: 192.168.1.222 (error: internal error)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - will probe replica: ICMP("192.168.1.222") with retry count: 2
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will fire for icmp host: ::1 (1 targets)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: ::1
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll error for icmp target: ::1 from host: ::1 (error: internal error)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - will probe replica: ICMP("::1") with retry count: 2
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will fire for icmp host: localhost (2 targets)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: localhost
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - prober poll error for icmp target: ::1 from host: localhost (error: internal error)
Feb 03 18:19:12 vigil-test vigil[4100]: (DEBUG) - will probe replica: ICMP("localhost") with retry count: 2
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - prober poll will fire for icmp host: 192.168.1.222 (1 targets)
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - prober poll will send icmp ping to target: 192.168.1.222 from host: 192.168.1.222
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - prober poll error for icmp target: 192.168.1.222 from host: 192.168.1.222 (error: internal error)
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - replica probe result: web:router:icmp://192.168.1.222 => Dead
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - prober poll will fire for icmp host: ::1 (1 targets)
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: ::1
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - prober poll error for icmp target: ::1 from host: ::1 (error: internal error)
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - replica probe result: web:router:icmp://[::1] => Dead
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - prober poll will fire for icmp host: localhost (2 targets)
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: localhost
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - prober poll error for icmp target: ::1 from host: localhost (error: internal error)
Feb 03 18:19:13 vigil-test vigil[4100]: (DEBUG) - replica probe result: web:router:icmp://localhost => Dead
Feb 03 18:19:13 vigil-test vigil[4100]: (INFO) - replicas have been probed with 3/4 threads in 1.504252723s

Environment info:

root@vigil-test:/home/ek# ip -4 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    inet 192.168.1.222/24 brd 192.168.1.255 scope global dynamic enp1s0
       valid_lft 6368sec preferred_lft 6368sec
root@vigil-test:/home/ek# uname -a
Linux vigil-test 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
root@vigil-test:/home/ek# vigil --version
vigil-server 1.26.3

I had some issues with vigil and I spend quite a lot of time to figure out why vigil did not work. I used local IP's because then I would eliminate all kinds of other issues that could be with the network, not knowing that the system could not ping local IP's.

Thanks a lot in advance.

I hope this report contains enough info to make the next step to solve this issue actionable.

Best regards,
Emil Kristensen

Hi! Did you check if that could be a system-level permission issue when opening ICMP raw sockets? https://github.com/valeriansaliou/vigil?tab=readme-ov-file#children_crossing-troubleshoot-issues

I can ping anything that is up but the IP's that the machine I'm running vigil is on.

The issue also occur on the debian package from your repo where the setcap is included in the service file.

Just to prove my point, here I added 1.1.1.1 to the config file, which results in a debug log where 1.1.1.1 works just fine:

Feb 04 01:56:24 vigil-test vigil[611]: (INFO) - starting 4 workers
Feb 04 01:56:24 vigil-test vigil[611]: (INFO) - Actix runtime found; starting in Actix runtime
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: 192.168.1.222 (1 targets)
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: 192.168.1.222 from host: 192.168.1.222
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll error for icmp target: 192.168.1.222 from host: 192.168.1.222 (error: internal error)
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - will probe replica: ICMP("192.168.1.222") with retry count: 1
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: localhost (2 targets)
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: localhost
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll error for icmp target: ::1 from host: localhost (error: internal error)
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - will probe replica: ICMP("localhost") with retry count: 1
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: ::1 (1 targets)
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: ::1
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll error for icmp target: ::1 from host: ::1 (error: internal error)
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - will probe replica: ICMP("::1") with retry count: 1
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: 1.1.1.1 (1 targets)
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: 1.1.1.1 from host: 1.1.1.1
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - got prober poll response for icmp target: 1.1.1.1 from host: 1.1.1.1
Feb 04 01:56:24 vigil-test vigil[611]: (DEBUG) - replica probe result: web:router:icmp://1.1.1.1 => Healthy
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: 192.168.1.222 (1 targets)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: 192.168.1.222 from host: 192.168.1.222
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll error for icmp target: 192.168.1.222 from host: 192.168.1.222 (error: internal error)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - will probe replica: ICMP("192.168.1.222") with retry count: 2
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: localhost (2 targets)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: localhost
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll error for icmp target: ::1 from host: localhost (error: internal error)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - will probe replica: ICMP("localhost") with retry count: 2
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: ::1 (1 targets)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: ::1
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll error for icmp target: ::1 from host: ::1 (error: internal error)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - will probe replica: ICMP("::1") with retry count: 2
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: 192.168.1.222 (1 targets)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: 192.168.1.222 from host: 192.168.1.222
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll error for icmp target: 192.168.1.222 from host: 192.168.1.222 (error: internal error)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - replica probe result: web:router:icmp://192.168.1.222 => Dead
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: localhost (2 targets)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: localhost
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll error for icmp target: ::1 from host: localhost (error: internal error)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - replica probe result: web:router:icmp://localhost => Dead
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will fire for icmp host: ::1 (1 targets)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll will send icmp ping to target: ::1 from host: ::1
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - prober poll error for icmp target: ::1 from host: ::1 (error: internal error)
Feb 04 01:56:25 vigil-test vigil[611]: (DEBUG) - replica probe result: web:router:icmp://[::1] => Dead
Feb 04 01:56:25 vigil-test vigil[611]: (INFO) - replicas have been probed with 4/4 threads in 1.504180746s
Feb 04 01:56:25 vigil-test vigil[611]: (INFO) - ran poll probe operation

I think this is more an OS level routing issue than a Vigil issue.

No, ping works:

ek@vigil-test:~$ ping 192.168.1.222
PING 192.168.1.222 (192.168.1.222) 56(84) bytes of data.
64 bytes from 192.168.1.222: icmp_seq=1 ttl=64 time=0.078 ms
^C
--- 192.168.1.222 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.078/0.078/0.078/0.000 ms
ek@vigil-test:~$ ping ::1
PING ::1(::1) 56 data bytes
64 bytes from ::1: icmp_seq=1 ttl=64 time=0.010 ms
^C
--- ::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.010/0.010/0.010/0.000 ms
ek@vigil-test:~$ ping localhost
PING localhost(localhost (::1)) 56 data bytes
64 bytes from localhost (::1): icmp_seq=1 ttl=64 time=0.011 ms
^C
--- localhost ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.011/0.011/0.011/0.000 ms

vigil is doing something to make ping not work. My guess is that it is most likely because of https://crates.io/crates/ping but there is not enough information in the error message that vigil gives to say for certain where the issue lies. Potentially related to this issue: aisk/rust-ping#15

Is it possible to use normal ping instead of rust ping?

Could you report this to upstream to the library maybe?

I cannot make a proper bug report. Currently, I have no idea how vigil uses rust ping, and therefore I cannot make an actionable/usable bug report for them.

If you won't recognize the problem I'll try to figure out more at some point and try to figure out where the issue lies.

Again is it possible to use normal ping rather than the rust ping?

Also, I need this library since other methods I've tried in the past do not support ICMPv6, and I need them to monitor Crisp in production in a reliable way since everything is dual-stack there. Won't change, but if there's a bug it needs to be fixed upstream.

Also note that vigil-local uses the same library, w/ no issue pinging localhost and LAN hosts in my production environments: https://github.com/valeriansaliou/vigil-local/blob/master/src/probe/poll.rs#L147

Also, I need this library since other methods I've tried in the past do not support ICMPv6, and I need them to monitor Crisp in production in a reliable way since everything is dual-stack there. Won't change, but if there's a bug it needs to be fixed upstream.

Dual stack, which of course includes IPv6 support is important for me as well. I have vigil in running in test working really dual stack but I have faced issues moving forward with deployment.

Also note that vigil-local uses the same library, w/ no issue pinging localhost and LAN hosts in my production environments: https://github.com/valeriansaliou/vigil-local/blob/master/src/probe/poll.rs#L147

I'll test and see what I find.

Though doing all this on the side for to monitor it systems as a part of some volunteer work it might take some time between work to get a proper look, but that is life.

I did not expect you to respond so quickly in here 💪, keep up the good work :)

Preliminary findings:

As far as I can tell the issue is contained within ping. I'll keep this issue open as I think the solution will be to update the version of ping that vigil depends on once the issue has been fixed upstream.

My testing with IPv4 ping to 127.0.0.1 shows that ping reads the ICMP echo request as if it was the reply, and therefore the ICMPv4 Type is 8 (request) instead of 0 (reply) which makes this check fail: https://github.com/aisk/rust-ping/blob/e4b4432a1e488da6f94d05437d6aa0efe197d13b/src/packet/icmp.rs#L77

I'll need to investigate further to know why, and in order to report the issue upstream.

I have reported the issue upstream: aisk/rust-ping#16 :)

@valeriansaliou Have you considered https://crates.io/crates/surge-ping ?

It looks like that crate is a bit more active and well maintained than https://crates.io/crates/ping . They already have a implementation without the ping-to-self issue the issue that I'm seeing.

The API is different to call the ping is different. If you don't have time I would be happy to try to make a pull request to change the ping library - I think it would be doable. But I will not invest the time in it if you would reject it outright.

Though it is properly much less work to just fix the ping crate...

Thank you, leaving this open so that I can review it & change when I batch process task on Vigil (probably not in the immediate future to be fully transparent, as I have other priorities atm, so this might take quite some time).

FYI: The issue has been fixed upstream.