Handling peer down/up events seems harder to parse in scaling tests bouncing bgp peers

Question

Handling peer down/up events seems harder to parse in scaling tests bouncing bgp peers

3fr61n opened this issue 6 years ago · comments

Before anything, excellent project! quite interesting.

I was doing some scaling tests, I noticed that for a router handling ~100 peers with ~3000 routes per peer, when I bounced all bgp sessions (or restart the bmp collector), it takes a lot of time (~40-50min) to the collector to dump all information on kafka. (bgp peer down/up event and bgp route updates)

Checking the logs on the openbmp collector it seems that bgp tear down/up events takes a lot longer to process than the bgp route updates. (is this expected?)

For instance, the following logs shows that it takes 10 seconds to process a peer down event...

Meanwhile route updates goes quite fast...

Checking the router side using logs and counters, the router dump all bmp events in just ~3-4 min, however until all peer down/up are being processes the collector does not begin with any route update processing. (this is also expected?)

Thanks in advance
and Regard

Tim Evens · Answer 1 · Tue Sep 18 2018 12:30:57 GMT+0800 (China Standard Time)

The 10 second gap between peer down events must be on the router side. The collector does not cache or store (eg. maintain a rib). The collector is just a real time pass though of bmp/bgp messages, The delay that we would see is on the consumer side (eg. DB such as Postgres or MySQL). The openbmp log messages indicating a 10 second gap must be some router/sender causing that. Which router/version are you using? Can you send me a pcap trace at tim@openbmp.org?