dfd3954c04
Type: docs Signed-off-by: Neale Ranns <nranns@cisco.com> Change-Id: I3dfde4520a48c945ca9707accabbe1735c1a8799
577 lines
27 KiB
ReStructuredText
577 lines
27 KiB
ReStructuredText
.. _fastconvergence:
|
|
|
|
Fast Convergence
|
|
------------------------------------
|
|
|
|
This is an excellent description of the topic:
|
|
|
|
'FIB <https://tools.ietf.org/html/draft-ietf-rtgwg-bgp-pic-12>'_
|
|
|
|
but if you're interested in my take keep reading...
|
|
|
|
First some definitions:
|
|
|
|
- Convergence; When a FIB is forwarding all packets correctly based
|
|
on the network topology (i.e. doing what the routing control plane
|
|
has instructed it to do), then it is said to be 'converged'.
|
|
Not being in a converged state is [hopefully] a transient state,
|
|
when either the topology change (e.g. a link failure) has not been
|
|
observed or processed by the routing control plane, or that the FIB
|
|
is still processing routing updates. Convergence is the act of
|
|
getting to the converged state.
|
|
- Fast: In the shortest time possible. There are no absolute limits
|
|
placed on how short this must be, although there is one number often
|
|
mentioned. Apparently the human ear can detect loss/delay/jitter in
|
|
VOIP of 50ms, therefore network failures should last no longer than
|
|
this, and some technologies (notably link-free alternate fast
|
|
reroute) are designed to converge in this time. However, it is
|
|
generally accepted that it is not possible to converge a FIB with
|
|
tens of millions of routes in this time scale, the industry
|
|
'standard' is sub-second.
|
|
|
|
Converging the FIB quickly is thus a matter of:
|
|
|
|
- discovering something is down
|
|
- updating as few objects as possible
|
|
- to determine which objects to update as efficiently as possible
|
|
- to update each object as quickly as possible
|
|
|
|
we'll discuss each in turn.
|
|
All output came from VPP version 21.01rc0. In what follows I use IPv4
|
|
prefixes, addresses and IPv4 host length masks, however, exactly the
|
|
same applies to IPv6.
|
|
|
|
|
|
Failure Detection
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
The two common forms (we'll see others later on) of failure detection
|
|
are:
|
|
|
|
- link down
|
|
- BFD
|
|
|
|
The FIB needs to hook into these notifications to trigger
|
|
convergence.
|
|
|
|
Whenever an interface goes down, VPP issues a callback to all
|
|
registerd clients. The adjacency code is such a client. The adjacency
|
|
is a leaf node in the FIB control-plane graph (containing fib_path_t,
|
|
fib_entry_t etc). A back-walk from the adjacnecy will trigger a
|
|
re-resolution of the paths.
|
|
|
|
FIB is a client of BFD in order to receive BFD notifications. BFD
|
|
comes in two flavours; single and multi hop. Single hop is to protect
|
|
a specific peer on an interface, such peers are modelled by an
|
|
adjacency. Multi hop is to protect a peer on an unspecified interface
|
|
(i.e. a remote peer), this peer is represented by a host-prefix
|
|
**fib_entry_t**. In both case FIB will add a delegate to the
|
|
**ip_adjacency_t** or **fib_entry_t** that represents the association
|
|
to the BFD session. If the BFD session signals up/down then a backwalk
|
|
can be triggered from the object to trigger re-resolution and hence
|
|
convergence.
|
|
|
|
|
|
Few Updates
|
|
^^^^^^^^^^^
|
|
|
|
In order to talk about what 'a few' is we have to leave the realm of
|
|
the FIB as an abstract graph based object DB and move into the
|
|
concrete representation of forwarding in a large network. Large
|
|
networks are built in layers, it's how you scale them. We'll take
|
|
here a hypothetical service provider (SP) network, but the concepts
|
|
apply equally to data center leaf-spines. This is a rudimentary
|
|
description, but it should serve our purpose.
|
|
|
|
An SP manages a BGP autonomous system (AS). The SP's goal is both to
|
|
attract traffic into its network to serve its customers, but also to
|
|
serve transit traffic passing through it, we'll consider the latter here.
|
|
The SP's network is all devices in that AS, these
|
|
devices are split into those at the edge (provider edge (PE) routers)
|
|
which peer with routers in other SP networks,
|
|
and those in the core (termed provider (P) routers). Both the PE and P
|
|
routers run the IGP (usually OSPF or ISIS). Only the reachability of the devices
|
|
in the AS are advertised in the IGP - thus the scale (i.e. the number
|
|
of routes) in the IGP is 'small' - only the number of
|
|
devices that the SP has (typically not more than a few 10k).
|
|
PE routers run BGP; they have external BGP sessions to devices in
|
|
other ASs and internal BGP sessions to devices in the same AS. BGP is
|
|
used to advertise the routes to *all* networks on the internet - at
|
|
the time of writing this number is approaching 900k IPv4 route, hopefully by
|
|
the time you are reading this the number of IPv6 routes has caught up ...
|
|
If we include the additional routes the SP carries to offering VPN service to its
|
|
customers the number of BGP routes can grow to the tens of millions.
|
|
|
|
BGP scale thus exceeds IGP scale by two orders of magnitude... pause for
|
|
a moment and let that sink in...
|
|
|
|
A comparison of BGP and an IGP is way way beyond the scope of this
|
|
documentation (and frankly beyond me) so we'll note only the
|
|
difference in the form of the routes they present to FIB. A routing
|
|
protocol will produce routes that specify the prefixes that are
|
|
reachable through its peers. A good IGP
|
|
is link state based, it forms peerings to other devices over these
|
|
links, hence its routes specify links/interfaces. In
|
|
FIB nomenclature this means an IGP produces routes that are
|
|
attached-nexthop, e.g.:
|
|
|
|
.. code-block:: console
|
|
|
|
ip route add 1.1.1.1/32 via 10.0.0.1 GigEthernet0/0/0
|
|
|
|
BGP on the other hand forms peerings only to neighbours, it does not
|
|
know, nor care, what interface is used to reach the peer. In FIB
|
|
nomenclature therefore BGP produces recursive routes, e.g.:
|
|
|
|
.. code-block:: console
|
|
|
|
ip route 8.0.0.0/16 via 1.1.1.1
|
|
|
|
where 1.1.1.1 is the BGP peer. It's no accident in this example that
|
|
1.1.1.1/32 happens to be the route the IGP advertised... BGP installs
|
|
routes for prefixes reachable via other BGP peers, and the IGP install
|
|
the routes to those BGP peers.
|
|
|
|
This has been a very long winded way of describing why the scale of
|
|
recursive routes is therefore 2 orders of magnitude greater than
|
|
non-recursive/attached-nexthop routes.
|
|
|
|
If we step back for a moment and recall why we've crawled down this
|
|
rabbit hole, we're trying to determine what 'a few' updates means,
|
|
does it include all those recursive routes, probably not ... let's
|
|
keep crawling.
|
|
|
|
We started this chapter with an abstract description of convergence,
|
|
let's now make that more real. In the event of a network failure an SP
|
|
is interested in moving to an alternate forwarding path as quickly as
|
|
possible. If there is no alternate path, and a converged FIB will drop
|
|
the packet, then who cares how fast it converges. In other words the
|
|
interesting convergence scenarios are the scenarios where the network has
|
|
alternate paths.
|
|
|
|
PIC Core
|
|
^^^^^^^^
|
|
|
|
First let's consider alternate paths in the IGP, e.g.;
|
|
|
|
.. code-block:: console
|
|
|
|
ip route add 1.1.1.1/32 via 10.0.0.2 GigEthernet0/0/0
|
|
ip route add 1.1.1.1/32 via 10.0.1.2 GigEthernet0/0/1
|
|
|
|
this gives us in the FIB:
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# sh ip fib 1.1.1.1/32
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, default-route:1, ]
|
|
1.1.1.1/32 fib:0 index:15 locks:2
|
|
API refs:1 src-flags:added,contributing,active,
|
|
path-list:[23] locks:2 flags:shared, uPRF-list:22 len:2 itfs:[1, 2, ]
|
|
path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
|
|
10.0.0.2 GigEthernet0/0/0
|
|
[@0]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800
|
|
path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
|
|
10.0.1.2 GigEthernet0/0/1
|
|
[@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
|
|
forwarding: unicast-ip4-chain
|
|
[@0]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]]
|
|
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800
|
|
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
|
|
There is ECMP across the two paths. Note that the instance/index of the
|
|
load-balance present in the forwarding graph is 17.
|
|
|
|
Let's add a BGP route via this peer;
|
|
|
|
.. code-block:: console
|
|
|
|
ip route add 8.0.0.0/16 via 1.1.1.1
|
|
|
|
in the FIB we see:
|
|
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# sh ip fib 8.0.0.0/16
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ]
|
|
8.0.0.0/16 fib:0 index:18 locks:2
|
|
API refs:1 src-flags:added,contributing,active,
|
|
path-list:[24] locks:2 flags:shared, uPRF-list:21 len:2 itfs:[1, 2, ]
|
|
path:[29] pl-index:24 ip4 weight=1 pref=0 recursive: oper-flags:resolved,
|
|
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
|
|
|
|
forwarding: unicast-ip4-chain
|
|
[@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:21 to:[0:0]]
|
|
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]]
|
|
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800
|
|
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
|
|
the load-balance object used by this route is index 20, but note that
|
|
the next load-balance in the chain is index 17, i.e. it is exactly
|
|
the same instance that appears in the forwarding chain for the IGP
|
|
route. So in the forwarding plane the packet first encounters
|
|
load-balance object 20 (which it will use in ip4-lookup) and then
|
|
number 17 (in ip4-load-balance).
|
|
|
|
What's the significance? Let's shut down one of those IGP paths:
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# set in state GigEthernet0/0/0 down
|
|
|
|
the resulting update to the IGP route is:
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# sh ip fib 1.1.1.1/32
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ]
|
|
1.1.1.1/32 fib:0 index:15 locks:4
|
|
API refs:1 src-flags:added,contributing,active,
|
|
path-list:[23] locks:2 flags:shared, uPRF-list:25 len:2 itfs:[1, 2, ]
|
|
path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop:
|
|
10.0.0.2 GigEthernet0/0/0
|
|
[@0]: arp-ipv4: via 10.0.0.2 GigEthernet0/0/0
|
|
path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
|
|
10.0.1.2 GigEthernet0/0/1
|
|
[@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
|
|
recursive-resolution refs:1 src-flags:added, cover:-1
|
|
|
|
forwarding: unicast-ip4-chain
|
|
[@0]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]]
|
|
[0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
|
|
|
|
notice that the path via 10.0.0.2 is no longer flagged as resolved,
|
|
and the forwarding chain does not contain this path as a
|
|
choice. However, the key thing to note is the load-balance
|
|
instance is still index 17, i.e. it has been modified not
|
|
exchanged. In the FIB vernacular we say it has been 'in-place
|
|
modified', a somewhat linguistically redundant expression, but one that serves
|
|
to emphasise that it was changed whilst still be part of the graph, it
|
|
was never at any point removed from the graph and re-added, and it was
|
|
modified without worker barrier lock held.
|
|
|
|
Still don't see the significance? In order to converge around the
|
|
failure of the IGP link it was not necessary to update load-balance
|
|
object number 20! It was not necessary to update the recursive
|
|
route. i.e. convergence is achieved without updating any recursive
|
|
routes, it is only necessary to update the affected IGP routes, this is
|
|
the definition of 'a few'. We call this 'prefix independent
|
|
convergence' (PIC) which should really be called 'recursive prefix
|
|
independent convergence' but it isn't...
|
|
|
|
How was the trick done? As with all problems in computer science, it
|
|
was solved by a layer of misdirection, I mean indirection. The
|
|
indirection is the load-balance that belongs to the IGP route. By
|
|
keeping this object in the forwarding graph and updating it in place,
|
|
we get PIC. The alternative design would be to collapse the two layers of
|
|
load-balancing into one, which would improve forwarding performance
|
|
but would come at the cost of prefix dependent convergence. No doubt
|
|
there are situations where the VPP deployment would favour forwarding
|
|
performance over convergence, you know the drill, contributions welcome.
|
|
|
|
This failure scenario is known as PIC core, since it's one of the IGP's
|
|
core links that has failed.
|
|
|
|
iBGP PIC Edge
|
|
^^^^^^^^^^^^^
|
|
|
|
Next, let's consider alternate paths in BGP, e.g:
|
|
|
|
.. code-block:: console
|
|
|
|
ip route add 8.0.0.0/16 via 1.1.1.1
|
|
ip route add 8.0.0.0/16 via 1.1.1.2
|
|
|
|
the 8.0.0.0/16 prefix is reachable via two BGP next-hops (two PEs).
|
|
|
|
Our FIB now also contains:
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# sh ip fib 8.0.0.0/16
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ]
|
|
8.0.0.0/16 fib:0 index:18 locks:2
|
|
API refs:1 src-flags:added,contributing,active,
|
|
path-list:[15] locks:2 flags:shared, uPRF-list:11 len:2 itfs:[1, 2, ]
|
|
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved,
|
|
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
|
|
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved,
|
|
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12]
|
|
|
|
forwarding: unicast-ip4-chain
|
|
[@0]: dpo-load-balance: [proto:ip4 index:20 buckets:2 uRPF:11 to:[0:0]]
|
|
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]]
|
|
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
|
|
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
[1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:13 to:[0:0]]
|
|
[0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
|
|
The first load-balance (LB) in the forwarding graph is index 20 (the astute
|
|
reader will note this is the same index as in the previous
|
|
section, I am adding paths to the same route, the load-balance is
|
|
in-place modified again). Each choice in LB 20 is another LB
|
|
contributed by the IGP route through which the route's paths recurse.
|
|
|
|
So what's the equivalent in BGP to a link down in the IGP? An IGP link
|
|
down means it loses its peering out of that link, so the equivalent in
|
|
BGP is the loss of the peering and thus the loss of reachability to
|
|
the peer. This is signaled by the IGP withdrawing the route to the
|
|
peer. But "Wait wait wait", i hear you say ... "just because the IGP
|
|
withdraws 1.1.1.1/32 doesn't mean I can't reach 1.1.1.1, perhaps there
|
|
is a less specific route that gives reachability to 1.1.1.1". Indeed
|
|
there may be. So a little more on BGP network design. I know it's like
|
|
a bad detective novel where the author drip feeds you the plot... When
|
|
describing iBGP peerings one 'always' describes the peer using one of
|
|
its GigEthernet0/0/back addresses. Why? A GigEthernet0/0/back interface
|
|
never goes down (unless you admin down it yourself), some muppet can't
|
|
accidentally cut through the GigEthernet0/0/back cable whilst digging up the
|
|
street. And what subnet mask length does a prefix have on a GigEthernet0/0/back
|
|
interface? it's 'always' a /32. Why? because there's no cable to connect
|
|
any other devices. This choice justifies there 'always' being a /32
|
|
route for the BGP peer. But what prevents there not being a less
|
|
specific - nothing.
|
|
Now clearly if the BGP peer crashes then the /32 for its GigEthernet0/0/back is
|
|
going to be removed from the IGP, but what will withdraw the less
|
|
specific - nothing.
|
|
|
|
So in order to make use of this trick of relying on the withdrawal of
|
|
the /32 for the peer to signal that the peer is down and thus the
|
|
signal to converge the FIB, we need to force FIB to recurse only via
|
|
the /32 and not via a less specific. This is called a 'recursion
|
|
constraint'. In this case the constraint is 'recurse via host'
|
|
i.e. for ipv4 use a /32.
|
|
So we need to update our route additions from before:
|
|
|
|
.. code-block:: console
|
|
|
|
ip route add 8.0.0.0/16 via 1.1.1.1 resolve-via-host
|
|
ip route add 8.0.0.0/16 via 1.1.1.2 resolve-via-host
|
|
|
|
checking the FIB output is left as an exercise to the reader. I hope
|
|
you're doing these configs as you read. There's little change in the
|
|
output, you'll see some extra flags on the paths.
|
|
|
|
Now let's add the less specific, just for fun:
|
|
|
|
|
|
.. code-block:: console
|
|
|
|
ip route add 1.1.1.0/28 via 10.0.0.2 GigEthernet0/0/0
|
|
|
|
nothing changes in resolution of 8.0.0.0/16.
|
|
|
|
Now withdraw the route to 1.1.1.2/32:
|
|
|
|
.. code-block:: console
|
|
|
|
ip route del 1.1.1.2/32 via 10.0.0.2 GigEthernet0/0/0
|
|
|
|
In the FIB we see:
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# sh ip fib 8.0.0.0/32
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ]
|
|
8.0.0.0/16 fib:0 index:18 locks:2
|
|
API refs:1 src-flags:added,contributing,active,
|
|
path-list:[15] locks:2 flags:shared, uPRF-list:13 len:2 itfs:[1, 2, ]
|
|
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
|
|
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
|
|
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: cfg-flags:resolve-host,
|
|
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0]
|
|
|
|
forwarding: unicast-ip4-chain
|
|
[@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:13 to:[0:0]]
|
|
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]]
|
|
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
|
|
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
|
|
the path via 1.1.1.2 is unresolved, because the recursion constraints
|
|
are preventing the the path resolving via 1.1.1.0/28. the LB index 20
|
|
has been updated to remove the unresolved path.
|
|
|
|
Job done? Not quite! Why not?
|
|
|
|
Let's re-examine the goals of this chapter. We wanted to update 'a
|
|
few' objects, which we have defined as not all the millions of
|
|
recursive routes. Did we do that here? We sure did, when we
|
|
modified LB index 20. So WTF?? Where's the indirection object that can
|
|
be modified so that the LBs for the recursive routes are not
|
|
modified - it's not there.... WTF?
|
|
|
|
OK so the great detective has assembled all the suspects in the
|
|
drawing room and only now does he drop the bomb; the FIB knows the
|
|
scale, we talked above about what the scale **can** be, worst case
|
|
scenario, but that's not necessarily what it is in this hypothetical
|
|
(your) deployment. It knows how many recursive routes there are that
|
|
depend on a /32, it can thus make its own determination of the
|
|
definition of 'a few'. In other words, if there are only 'a few'
|
|
recursive prefixes that depend on a /32 then it will update them
|
|
synchronously (and we'll discuss what synchronously means a bit more later).
|
|
|
|
So what does FIB consider to be 'a few'. Let's add more routes and
|
|
find out.
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# ip route add 8.1.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host
|
|
...
|
|
DBGvpp# ip route add 8.63.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host
|
|
|
|
and we see:
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# sh ip fib 8.8.0.0
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ]
|
|
8.8.0.0/16 fib:0 index:77 locks:2
|
|
API refs:1 src-flags:added,contributing,active,
|
|
path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ]
|
|
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
|
|
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
|
|
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
|
|
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12]
|
|
|
|
forwarding: unicast-ip4-chain
|
|
[@0]: dpo-load-balance: [proto:ip4 index:79 buckets:2 uRPF:28 flags:[uses-map] to:[0:0]]
|
|
load-balance-map: index:0 buckets:2
|
|
index: 0 1
|
|
map: 0 1
|
|
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]]
|
|
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
|
|
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
[1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:18 to:[0:0]]
|
|
[0] [@3]: arp-ipv4: via 10.0.1.2 GigEthernet0/0/0
|
|
|
|
|
|
Two elements to note here; the path-list has the 'popular' flag and
|
|
there is a load-balance map in the forwarding path.
|
|
|
|
'popular' in this case means that the path-list has passed the limit
|
|
of 'a few' in the number of children it has.
|
|
|
|
here are the children:
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# sh fib path-list 15
|
|
path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ]
|
|
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
|
|
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
|
|
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
|
|
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12]
|
|
children:{entry:18}{entry:21}{entry:22}{entry:23}{entry:25}{entry:26}{entry:27}{entry:28}{entry:29}{entry:30}{entry:31}{entry:32}{entry:33}{entry:34}{entry:35}{entry:36}{entry:37}{entry:38}{entry:39}{entry:40}{entry:41}{entry:42}{entry:43}{entry:44}{entry:45}{entry:46}{entry:47}{entry:48}{entry:49}{entry:50}{entry:51}{entry:52}{entry:53}{entry:54}{entry:55}{entry:56}{entry:57}{entry:58}{entry:59}{entry:60}{entry:61}{entry:62}{entry:63}{entry:64}{entry:65}{entry:66}{entry:67}{entry:68}{entry:69}{entry:70}{entry:71}{entry:72}{entry:73}{entry:74}{entry:75}{entry:76}{entry:77}{entry:78}{entry:79}{entry:80}{entry:81}{entry:82}{entry:83}{entry:84}
|
|
|
|
64 children makes it popular. The number is fixed (there is no API to
|
|
change it). Its choice is an attempt to balance the performance cost
|
|
of the indirection performance degradation versus the convergence
|
|
gain.
|
|
|
|
Popular path-lists contribute the load-balance map, this is the
|
|
missing indirection object. Its indirection happens when choosing the
|
|
bucket in the LB. The packet's flow-hash is taken 'mod number of
|
|
buckets' to give the 'candidate bucket' then the map will take this
|
|
'index' and convert it into the 'map'. You can see in the example above
|
|
that no change occurs, i.e. if the flow-hash mod n chooses bucket 1
|
|
then it gets bucket 1.
|
|
|
|
Why is this useful? The path-list is shared (you can convince
|
|
yourself of this if you look at each of the 8.x.0.0/16 routes we
|
|
added) and all of these routes use the same load-balance map, therefore, to
|
|
converge all the recursive routs, we need only change the map and
|
|
we're good; we again get PIC.
|
|
|
|
OK who's still awake... if you're thinking there's more to this story,
|
|
you're right. Keep reading.
|
|
|
|
This failure scenario is called iBGP PIC edge. It's 'edge' because it
|
|
refers to the loss of an edge device, and iBGP because the device was
|
|
a iBGP peer (we learn iBGP peers in the IGP). There is a similar eBGP
|
|
PIC edge scenario, but this is left for an exercise to the reader (hint
|
|
there are other recursion constraints - see the RFC).
|
|
|
|
Which Objects
|
|
^^^^^^^^^^^^^
|
|
|
|
The next topic on our list of how to converge quickly was to
|
|
effectively find the objects that need to be updated when a converge
|
|
event happens. If you haven't realised by now that the FIB is an
|
|
object graph, then can I politely suggest you go back and start from
|
|
the beginning ...
|
|
|
|
Finding the objects affected by a change is simply a matter of walking
|
|
from the parent (the object affected) to its children. These
|
|
dependencies are kept really for this reason.
|
|
|
|
So is fast convergence just a matter of walking the graph? Yes and
|
|
no. The question to ask yourself is this, "in the case of iBGP PIC edge,
|
|
when the /32 is withdrawn, what is the list of objects that need to be
|
|
updated and particularly what is the order they should be updated in
|
|
order to obtain the best convergence time?" Think breadth v. depth first.
|
|
|
|
... ponder for a while ...
|
|
|
|
For iBGP PIC edge we said it's the path-list that provides the
|
|
indirection through the load-balance map. Hence once all path-lists
|
|
are updated we are converged, thereafter, at our leisure, we can
|
|
update the child recursive prefixes. Is the breadth or depth first?
|
|
|
|
It's breadth first.
|
|
|
|
Breadth first walks are achieved by spawning an async walk of the
|
|
branch of the graph that we don't want to traverse. Withdrawing the /32
|
|
triggers a synchronous walk of the children of the /32 route, we want
|
|
a synchronous walk because we want to converge ASAP. This synchronous
|
|
walk will encounter path-lists in the /32 route's child dependent list.
|
|
These path-lists (and thier LB maps) will be updated. If a path-list is
|
|
popular, then it will spawn a async walk of the path-list's child
|
|
dependent routes, if not it will walk those routes. So the walk
|
|
effectively proceeds breadth first across the path-lists, then returns
|
|
to the start to do the affected routes.
|
|
|
|
Now the story is complete. The murderer is revealed.
|
|
|
|
Let's withdraw one of the IGP routes.
|
|
|
|
.. code-block:: console
|
|
|
|
DBGvpp# ip route del 1.1.1.2/32 via 10.0.1.2 GigEthernet0/0/1
|
|
|
|
DBGvpp# sh ip fib 8.8.0.0
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ]
|
|
8.8.0.0/16 fib:0 index:77 locks:2
|
|
API refs:1 src-flags:added,contributing,active,
|
|
path-list:[15] locks:128 flags:shared,popular, uPRF-list:18 len:2 itfs:[1, 2, ]
|
|
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
|
|
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
|
|
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: cfg-flags:resolve-host,
|
|
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0]
|
|
|
|
forwarding: unicast-ip4-chain
|
|
[@0]: dpo-load-balance: [proto:ip4 index:79 buckets:1 uRPF:18 to:[0:0]]
|
|
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]]
|
|
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
|
|
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
|
|
|
|
the LB Map has gone, since the prefix now only has one path. You'll
|
|
need to be a CLI ninja if you want to catch the output showing the LB
|
|
map in its transient state of:
|
|
|
|
.. code-block:: console
|
|
|
|
load-balance-map: index:0 buckets:2
|
|
index: 0 1
|
|
map: 0 0
|
|
|
|
but it happens. Trust me. I've got tests and everything.
|
|
|
|
On the final topic of how to converge quickly; 'make each update fast'
|
|
there are no tricks.
|
|
|
|
|
|
|