Katran didn't route pkg to default gateway

Question

Katran didn't route pkg to default gateway

bienkma opened this issue 3 years ago · comments

Hi there,
I got an issues when use Katran for my services that was some packet (about 1%) did not route to default gateway. Sometimes client got an error messages "timeout connection to VIP, the timeout > 700ms". The service worked fine in < 10k pkts, the situation was only happened when LB reached over 10k pkts (client send request with UUID token debug, in the timeout request I can't lookup the request in real server). How can I check the problem? Can I use katran_server_grpc to setup a LB in production environment?

More information:
- katran version: 4cdc88b commit
- Model: client----->katran[VIP:443]----->6[haproxy-LB-Layer7]
- katran configuration:

katran_server_grpc -balancer_prog /opt/katran/balancer_kern.o -default_mac=00:00:0c:9f:ff:51 -forwarding_cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71 -numa_nodes=0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1 -hc_forwarding=false -intf=ens1f0 -ipip_intf=ipip0 -lru_size=8000000 -map_path=/sys/fs/bpf/jmp_ens1f0 -server=0.0.0.0:50051

sysctl

net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
net.core.bpf_jit_kallsyms = 0
net.core.bpf_jit_limit = 264241152

katran status:

summary: 6496 pkts/sec. lru hit: 99.03% lru miss: 0.97% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6419 pkts/sec. lru hit: 99.25% lru miss: 0.75% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6400 pkts/sec. lru hit: 99.34% lru miss: 0.66% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6549 pkts/sec. lru hit: 98.81% lru miss: 1.19% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6634 pkts/sec. lru hit: 99.22% lru miss: 0.78% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec

Numa

available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70
node 0 size: 31778 MB
node 0 free: 28580 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
node 1 size: 32246 MB
node 1 free: 7838 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

bienkma · Answer 1 · Wed Dec 08 2021 15:37:01 GMT+0800 (China Standard Time)

I've span port switch and capture the traffic on katran interface. It look like some packets are big length and have to retransmission. The problem is miss config MTU?

nikhildl12 · Answer 2 · Fri Dec 10 2021 02:10:07 GMT+0800 (China Standard Time)

@bienkma: what is the packet size used in the test?
Also do you see the xdp drop counter going up on the katran host using:

ethtool -S <interface> | grep rx_xdp_drop

bienkma · Answer 3 · Fri Dec 10 2021 10:35:29 GMT+0800 (China Standard Time)

@nikhildl12
There was not testing environment that's real request from my clients. Some times we captured TCP packet segment with length 8000 and jumbo frames in the switch device.
There did not have the rx_xdp_drop option. This was out put the command when I tried to get somethings dropped in the katran interface:

ethtool -S eno1 | grep drop
     rx_dropped: 0
     tx_dropped: 0
     port.rx_dropped: 0
     port.tx_dropped_link_down: 0

bienkma · Answer 4 · Fri Dec 10 2021 21:03:11 GMT+0800 (China Standard Time)

Look like this issue #82

bienkma · Answer 5 · Sat Dec 11 2021 23:22:14 GMT+0800 (China Standard Time)

The problem is solved. I added mss 1400 option on haproxy (LB layer7).