Katran didn't route pkg to default gateway
bienkma opened this issue · comments
Hi there,
I got an issues when use Katran for my services that was some packet (about 1%) did not route to default gateway. Sometimes client got an error messages "timeout connection to VIP, the timeout > 700ms". The service worked fine in < 10k pkts, the situation was only happened when LB reached over 10k pkts (client send request with UUID token debug, in the timeout request I can't lookup the request in real server). How can I check the problem? Can I use katran_server_grpc to setup a LB in production environment?
- More information:
- katran version: 4cdc88b commit
- Model: client----->katran[VIP:443]----->6[haproxy-LB-Layer7]
- katran configuration:
katran_server_grpc -balancer_prog /opt/katran/balancer_kern.o -default_mac=00:00:0c:9f:ff:51 -forwarding_cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71 -numa_nodes=0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1 -hc_forwarding=false -intf=ens1f0 -ipip_intf=ipip0 -lru_size=8000000 -map_path=/sys/fs/bpf/jmp_ens1f0 -server=0.0.0.0:50051
- sysctl
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
net.core.bpf_jit_kallsyms = 0
net.core.bpf_jit_limit = 264241152
- katran status:
summary: 6496 pkts/sec. lru hit: 99.03% lru miss: 0.97% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6419 pkts/sec. lru hit: 99.25% lru miss: 0.75% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6400 pkts/sec. lru hit: 99.34% lru miss: 0.66% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6549 pkts/sec. lru hit: 98.81% lru miss: 1.19% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6634 pkts/sec. lru hit: 99.22% lru miss: 0.78% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
- Numa
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70
node 0 size: 31778 MB
node 0 free: 28580 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
node 1 size: 32246 MB
node 1 free: 7838 MB
node distances:
node 0 1
0: 10 21
1: 21 10
@bienkma: what is the packet size used in the test?
Also do you see the xdp drop counter going up on the katran host using:
ethtool -S <interface> | grep rx_xdp_drop
@nikhildl12
There was not testing environment that's real request from my clients. Some times we captured TCP packet segment with length 8000 and jumbo frames in the switch device.
There did not have the rx_xdp_drop option. This was out put the command when I tried to get somethings dropped in the katran interface:
ethtool -S eno1 | grep drop
rx_dropped: 0
tx_dropped: 0
port.rx_dropped: 0
port.tx_dropped_link_down: 0
The problem is solved. I added mss 1400 option on haproxy (LB layer7).