greearb / ath10k-ct

Stand-alone ath10k driver based on Candela Technologies Linux kernel.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

qca9980 GTK rekey fails on 4-addr bridges

CapitalF opened this issue · comments

I have a long-standing problem where a wireless bridged station fails it's daily rekey event almost every time. This problem only occurs on the ath10k CT firmware. Using the standard/OEM/QCA firmware does not produce the same problem.

My environment is a wireless bridge between an OpenWRT AP and the Linux station. The AP and station placement is static, -54 dBm average signal, and very few quality problems.

My AP is OpenWRT on a ipq8064-based system with qca9980 radios. The bridge gets dedicated use of the AP's 5GHz radio so typically the Linux station is the only thing associated to it. The OpenWRT configuration is very default except "option wds 1" is set to enable 4-address bridging on the wifi-iface.

The station is Debian Linux unstable on a small NUC-like box. I've tried Intel 3160, 7260, and ax200 802.11 radios installed and all three exhibit the same problem. I have no kernel module options and use iw to set "4addr on" and "power_save off". A bridge interface is configured with members eth0 and wlan0. It's a very simple bridging setup.

I first discovered this problem a few months ago and have been slowly performing various experiments and reading through old logs on the weekends. I've had this same setup since 2016 and have logs from both systems over that entire time.

The problem first appeared two years ago in January 2020 when I upgraded OpenWRT on the AP from 18.06.5 to 19.07.0. That was the first version that switched the default ath10k firmware to the CT variant. During those two years, all but two rekey events failed. I have no idea why those two succeeded. Prior years saw a tiny number of rekey failures.

During those two years many things have been changed on both the AP and station: The station OS/kernel and related files get upgraded every 90 days, three different station half-mini PCIE wifi cards have been tried (all Intel, models 3160, 7260, and ax200), I've gone through various configuration parameters on both the station and AP, and many other things have changed, but the problem remained constant during that time.

Changing out the firmware to the non-CT variant on the OpenWRT AP makes the problem immediately go away. Re-installing it makes it come back.

So far the only factor which I've found to influence the problem is that a higher hostapd wpa_group_rekey configuration increases the likelihood of a rekey failure. I have noticed the failure rate increases as wpa_group_rekey on hostapd is increased: 3% failure at 60, 25% at 600, 87% at 3600, and 99% at 86400. This data is mostly from the Intel 7260 card and I've been running experiments over several weeks to figure this out.

I suspect a certain amount of traffic is needed to trigger the intermittent problem, which would explain why a larger wpa_group_rekey period increases the likelihood of a rekey failure. I've been successful in reproducing the rekey problem on a second station and AP, but I couldn't get the rekey to fail unless I ran iperf3 in a while-loop at a sufficient level. Increasing the wpa_group_rekey time alone on a link with idle traffic won't reproduce the issue.

I think this problem also affects 2.4GHz. At some point in the past I moved the bridge to the AP's 2.4GHz radio and my logs show I still experienced the rekey problems there too.

I believe that 4-addr frames are required to induce this problem. I've run multiple day-long tests but have not yet been able to reproduce this issue without it.

Also worthy of note: For each rekey event, either it succeeds on the first rekey attempt or fails after 4 tries. There have been no successes where the first rekey attempt failed but one of the later three succeeded. It's always total success or total failure.

What can I do to help get this fixed?

Detailed facts below:

A typical rekey failure looks like this:
Fri Dec 24 21:42:38 2021 daemon.debug hostapd: wlan0: WPA rekeying GTK
Fri Dec 24 21:42:38 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: sending 1/2 msg of Group Key Handshake
Fri Dec 24 21:42:38 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 IEEE 802.1X: did not Ack EAPOL-Key frame (broadcast index=0)
Fri Dec 24 21:42:38 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: EAPOL-Key timeout
Fri Dec 24 21:42:38 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: sending 1/2 msg of Group Key Handshake
Fri Dec 24 21:42:38 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 IEEE 802.1X: did not Ack EAPOL-Key frame (broadcast index=0)
Fri Dec 24 21:42:39 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: EAPOL-Key timeout
Fri Dec 24 21:42:39 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: sending 1/2 msg of Group Key Handshake
Fri Dec 24 21:42:39 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 IEEE 802.1X: did not Ack EAPOL-Key frame (broadcast index=0)
Fri Dec 24 21:42:40 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: EAPOL-Key timeout
Fri Dec 24 21:42:40 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: sending 1/2 msg of Group Key Handshake
Fri Dec 24 21:42:40 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 IEEE 802.1X: did not Ack EAPOL-Key frame (broadcast index=0)
Fri Dec 24 21:42:41 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: EAPOL-Key timeout
Fri Dec 24 21:42:41 2021 daemon.info hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: group key handshake failed (RSN) after 4 tries
Fri Dec 24 21:42:41 2021 daemon.debug hostapd: wlan0: STA aa:bb:cc:11:22:33 WPA: WPA_PTK: sm->Disconnect
Fri Dec 24 21:42:41 2021 daemon.notice hostapd: wlan0: AP-STA-DISCONNECTED aa:bb:cc:11:22:33

Just to be clear, I have two entirely separate stations and APs demonstrating the problem, so it's not likely to be failed hardware.

To reproduce it, I run iperf in a loop from the station to the AP:
while true ; do iperf3 -c $MY_AP --bidir -b 50M ; iperf3 -c $MY_AP --bidir -b 50M -u ; sleep 30 ; done

Station info:
linux 5.15.0-2-amd64
The 802.11 adapter is an Intel 3160, 7260, and ax200 (multiple tested)
Intel firmware is Debian package 20210315-2. For the 3160 and 7260 it's -17 and for the ax200 it's 62.49eeb572.0
/etc/network/interfaces
auto eth0
iface eth0 inet manual
auto wlan0
iface wlan0 inet manual
auto br0
iface br0 inet static
pre-up iw dev wlan0 set 4addr on
pre-up iw dev wlan0 set power_save off
post-down iw dev wlan0 set 4addr off
address 192.168.0.10/24
broadcast 192.168.0.255
gateway 192.168.0.1
dns-nameserver 192.168.0.1
bridge_ports wlan0 eth0
bridge_stp off
bridge_waitport 5
bridge_fd 0
wpa-ssid whatever
wpa-psk some-key-goes-here
wpa-iface wlan0
wpa-bridge br0

AP info:
OpenWRT 21.02.1 official and custom builds
OpenWRT firmware packages:
ath10k-board-qca99x0 - 20201118-3
ath10k-firmware-qca99x0-ct - 2020-11-08-1
ethtool -i wlan0
driver: ath10k_pci
version: 5.4.154
firmware-version: 10.4b-ct-9980-fW-13-5ae337bb1
Tested two different ipq806x devices with 5GHz qca8890 radios (Linksys EA8500 and some Trendnet)
The only config change from stock needed is to set "option wds 1" on the wifi-iface.

I have not tried any custom firmware builds yet. It's been two years and can't be fixed if nobody knows about it. Let me know if you want me to try anyway.

If I build you a set of firmware images, would you be interested in trying to bisect the problem? The procedure is to copy the firmware blob onto your owrt system and reboot and test, you should not actually need to build new owrt image. If it is a bug I introduced, then the bisect should find it and probably I can fix it.

I am willing to do any testing you prescribe. I have my test environment ready to go.

My skill level is "overachieving bash hacker", which isn't a complement. I can build my own images, apply patches, and that sort of thing, but anything beyond a hello-word in C is a no-go and I don't know anything about device firmware.

I believe I am suffering from this same problem. R7800 as master (WDS, platform is ipq806x with qca9880), MR8300 as client (has a third radio so I don't run AP+STA, ipq40xx), WPA2 with 802.11w optional. I've 100% drops every 6000 seconds on my periodic 1s ping tests between both routers, exactly my GTK rekey period set. ct drivers on both ends, ath10k on both ends. I believe (but need to retest) I can replicate them on WPA3 as well.

Will try lowering the rekey period.

Noting as well https://patchwork.kernel.org/project/ath10k/patch/f8d6d13943ccac22f363d7d4f53d645c@codeaurora.org/ seems to be related?

Note my comment above:

I have noticed the failure rate increases as wpa_group_rekey on hostapd is increased: 3% failure at 60, 25% at 600, 87% at 3600, and 99% at 86400

You will need to change your wpa_group_rekey to sub-1-hour to be successful even some of the time. It might just be better to set it really high, though I don't know what the maximum is.

I am still willing to test this out, I just got busy and never really followed up on this.