SNAS / openbmp

OpenBMP Server Collector

Home Page:www.openbmp.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handling peer down/up events seems harder to parse in scaling tests bouncing bgp peers

3fr61n opened this issue · comments

Hi @TimEvens

Before anything, excellent project! quite interesting.

I was doing some scaling tests, I noticed that for a router handling ~100 peers with ~3000 routes per peer, when I bounced all bgp sessions (or restart the bmp collector), it takes a lot of time (~40-50min) to the collector to dump all information on kafka. (bgp peer down/up event and bgp route updates)

Checking the logs on the openbmp collector it seems that bgp tear down/up events takes a lot longer to process than the bgp route updates. (is this expected?)

For instance, the following logs shows that it takes 10 seconds to process a peer down event...

2018-07-20T11:07:24.445246 | NOTICE | parsePeerDownEventHdr | sock=16 : 10.0.0.5: BGP peer down notification with reason code: 1
2018-07-20T11:07:34.456889 | NOTICE | parsePeerDownEventHdr | sock=16 : 10.0.0.7: BGP peer down notification with reason code: 1
2018-07-20T11:07:44.464163 | NOTICE | parsePeerDownEventHdr | sock=16 : 10.0.0.37: BGP peer down notification with reason code: 1

Meanwhile route updates goes quite fast...

Checking the router side using logs and counters, the router dump all bmp events in just ~3-4 min, however until all peer down/up are being processes the collector does not begin with any route update processing. (this is also expected?)

Thanks in advance
and Regard

The 10 second gap between peer down events must be on the router side. The collector does not cache or store (eg. maintain a rib). The collector is just a real time pass though of bmp/bgp messages, The delay that we would see is on the consumer side (eg. DB such as Postgres or MySQL). The openbmp log messages indicating a 10 second gap must be some router/sender causing that. Which router/version are you using? Can you send me a pcap trace at tim@openbmp.org?