A while back, before I stared Ameliorate Labs in 2018 (ok, a long while back), I had started writing a series on monitoring BGP. Once I launched the company I took down the other blog in the hopes of starting another one here. Now that we have one up and running again, it’s time to re-publish and continue the series. I’ve learned a lot since I originally wrote part one and have some exciting changes planned. However, since part one was mostly a general overview of the project, I will post the original version here and make changes as needed in future posts. Without further ado, I present:
Logging the Core: Part 1, An experiment in BGP monitoring.
We’re excited, this is our first blog post! Not only that, but also the first in a series on BGP monitoring! Isn’t that a great start? Throughout this series we will talk about BGP, take a look at important indicators, and research monitoring techniques with the end goal of building a scalable and powerful BGP monitoring and analysis platform.
The internet functions because of Border Gateway Protocol, aka BGP. We won’t go into detail on the uses, configuration, or concepts of BGP in this post, but if you are not familiar with it I suggest you read some of the following articles before continuing.
Why Monitor BGP?
BGP is an extremely important part of the internet. Companies may peer with their internet service provider (ISP) to better control inbound or outbound traffic, and ISPs almost always use BGP to communicate with each other. If your BGP session gets torn down for any reason, or you stop advertising/receiving prefixes, your connectivity can suffer immediately. Because of this, you really need to know when something abnormal happens. It could make or break you business, career, and reputation.
In addition to simple announcement monitoring, you also need to watch for whats known as BGP hijacking. In short, that’s when another (unauthorized) entity advertises your routes with a shorter AS path or more specific prefixes. This causes traffic to flow through them rather than your organization. You can see why this would be bad.
Picking what to Monitor
There are a lot of features within BGP, so to properly plan our project we need to pick what exactly we want to monitor. Based on the previous paragraphs, we can set an absolute minimum of prefixes and origin ASNs. This would allow us to ensure advertisements are making it to the global routing table, and check that they are being announced by the proper ASN. While that might be fine for basic monitoring, our policy when it comes to data is to log as much as possible. You can trim data later if required, but you won’t have another chance to capture it.
BGP Message Types
As defined in section 4 of the BGP4 RFC, there are four message types. Open, Update, Notification, and Keepalive. RFC 2918 adds another type, but we’ll touch on that later
The OPEN message is sent after the TCP connection is opened, and includes the BGP version (generally v4), ASN of the sender, hold time (how long before the connection times out after the last keepalive or update), BGP ID, and optional parameters. The optional parameters are defined in RFC 3392, which was obsoleted by RFC 5492. This information is really only of use to the maintainer of the monitoring system. It would be good to log, but either at a verbose level or in a separate data store.
Arguably the most important message type, the UPDATE message sends announcements to advertise or withdraw routes and path attributes. Suffice to say, most of our logging will be done on this message.
We won’t go into more detail on this now. Part 2 of this series will be dedicated to the UPDATE message and the information we can gather from it.
NOTIFICATION messages are sent to peers when an error occurs. The defined errors are: Message Header Error, OPEN Message Error, UPDATE Message Error, Hold Timer Expired, Finite State Machine Error, and Cease. Again, this information would only be of value to the monitoring system maintainer, so we treat it similarly to the OPEN messages.
The exception here is the Cease error code. Cease is a general purpose message to close the peering connection when a non-fatal condition is met. The RFC allows peers to close connections if a locally configured prefix limit is reached. This could indicate either an outdated configuration, or a route leak scenario. Human intervention is often required in these cases, so we’ve decided to log Cease events in the general data store for easier analysis.
The KEEPALIVE message completes a simple but important task. It is sent within the Hold Time window to prevent BGP time-outs. The RFC suggests that it’s sent every 1⁄3 of the Hold Time seen in the OPEN message. As stated in the RFC, BGP doesn’t use TCP based keep-alive mechanisms. This is likely to ensure the BGP service itself is still operable, not just the TCP connection. Since there is a specific NOTIFICATION error code for Keepalive time-outs (Hold Timer Expired), we will not log these at all.
So far we’ve discussed BGP session metadata. We wouldn’t really have any useful BGP logs at this point, but that will come in part 2. Right now we’ve decided to log the OPEN messages to a separate data store that we will refer to as the Verbose Data Store (VDS) from now on. We will also log all NOTIFICATION messages to the VDS, and duplicate the Cease code messages to the General Data Store (GDS). At this point, there is no practical use for the KEEPALIVE messages, so we are not going to log them at all.