facebookincubator / katran

A high performance layer 4 load balancer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Katran didn't route pkg to default gateway

bienkma opened this issue · comments

Hi there,
I got an issues when use Katran for my services that was some packet (about 1%) did not route to default gateway. Sometimes client got an error messages "timeout connection to VIP, the timeout > 700ms". The service worked fine in < 10k pkts, the situation was only happened when LB reached over 10k pkts (client send request with UUID token debug, in the timeout request I can't lookup the request in real server). How can I check the problem? Can I use katran_server_grpc to setup a LB in production environment?

  • More information:
    • katran version: 4cdc88b commit
    • Model: client----->katran[VIP:443]----->6[haproxy-LB-Layer7]
    • katran configuration:
katran_server_grpc -balancer_prog /opt/katran/balancer_kern.o -default_mac=00:00:0c:9f:ff:51 -forwarding_cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71 -numa_nodes=0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1 -hc_forwarding=false -intf=ens1f0 -ipip_intf=ipip0 -lru_size=8000000 -map_path=/sys/fs/bpf/jmp_ens1f0 -server=0.0.0.0:50051
  • sysctl
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
net.core.bpf_jit_kallsyms = 0
net.core.bpf_jit_limit = 264241152
  • katran status:
summary: 6496 pkts/sec. lru hit: 99.03% lru miss: 0.97% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6419 pkts/sec. lru hit: 99.25% lru miss: 0.75% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6400 pkts/sec. lru hit: 99.34% lru miss: 0.66% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6549 pkts/sec. lru hit: 98.81% lru miss: 1.19% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
summary: 6634 pkts/sec. lru hit: 99.22% lru miss: 0.78% (tcp syn: 0.01% tcp non-syn: 0.00% udp: 0.99%) fallback lru hit: 0 pkts/sec
  • Numa
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70
node 0 size: 31778 MB
node 0 free: 28580 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
node 1 size: 32246 MB
node 1 free: 7838 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

I've span port switch and capture the traffic on katran interface. It look like some packets are big length and have to retransmission. The problem is miss config MTU?
image

@bienkma: what is the packet size used in the test?
Also do you see the xdp drop counter going up on the katran host using:

ethtool -S <interface> | grep rx_xdp_drop

@nikhildl12
There was not testing environment that's real request from my clients. Some times we captured TCP packet segment with length 8000 and jumbo frames in the switch device.
There did not have the rx_xdp_drop option. This was out put the command when I tried to get somethings dropped in the katran interface:

ethtool -S eno1 | grep drop
     rx_dropped: 0
     tx_dropped: 0
     port.rx_dropped: 0
     port.tx_dropped_link_down: 0

Look like this issue #82

The problem is solved. I added mss 1400 option on haproxy (LB layer7).