In BGP mode, each node in your cluster establishes a BGP peering session with your network routers, and uses that peering session to advertise the IPs of external cluster services.
Assuming your routers are configured to support multipath, this enables true load-balancing: the routes published by MetalLB are equivalent to each other, except for their nexthop. This means that the routers will use all nexthops together, and load-balance between them.
Once the packets arrive at the node, kube-proxy
is responsible for
the final hop of traffic routing, to get the packets to one specific
pod in the service.
The exact behavior of the load-balancing depends on your specific router model and configuration, but the common behavior is to balance per-connection, based on a packet hash. What does this mean?
Per-connection means that all the packets for a single TCP or UDP session will be directed to a single machine in your cluster. The traffic spreading only happens between different connections, not for packets within one connection.
This is a good thing, because spreading packets across multiple cluster nodes would result in poor behavior on several levels:
Packet hashing is how high performance routers can statelessly spread connections across multiple backends. For each packet, they extract some of the fields, and use those as a “seed” to deterministically pick one of the possible backends. If all the fields are the same, the same backend will be chosen.
The exact hashing methods available depend on the router hardware and
software. Two typical options are 3-tuple and 5-tuple
hashing. 3-tuple uses (protocol, source-ip, dest-ip)
as the key,
meaning that all packets between two unique IPs will go to the same
backend. 5-tuple hashing adds the source and destination ports to the
mix, which allows different connections from the same clients to be
spread around the cluster.
In general, it’s preferable to put as much entropy as possible into the packet hash, meaning that using more fields is generally good. This is because increased entropy brings us closer to the “ideal” load-balancing state, where every node receives exactly the same number of packets. We can never achieve that ideal state because of the problems we listed above, but what we can do is try and spread connections as evenly as possible, to try and prevent hotspots from forming.
Using BGP as a load-balancing mechanism has the advantage that you can use standard router hardware, rather than bespoke load-balancers. However, this comes with downsides as well.
The biggest is that BGP-based load balancing does not react gracefully to changes in the backend set for an address. What this means is that when a cluster node goes down, you should expect all active connections to your service to be broken (users will see “Connection reset by peer”).
BGP-based routers implement stateless load-balancing. They assign a given packet to a specific next hop by hashing some fields in the packet header, and using that hash as an index into the array of available backends.
The problem is that the hashes used in routers are usually not stable, so whenever the size of the backend set changes (for example when a node’s BGP session goes down), existing connections will be rehashed effectively randomly, which means that the majority of existing connections will end up suddenly being forwarded to a different backend, one that has no knowledge of the connection in question.
The consequence of this is that any time the IP→Node mapping changes for your service, you should expect to see a one-time hit where most active connections to the service break. There’s no ongoing packet loss or blackholing, just a one-time clean break.
Depending on what your services do, there are a couple of mitigation strategies you can employ: