docs: better docs, mv doxygen to sphinx
This patch refactors the VPP sphinx docs in order to make it easier to consume for external readers as well as VPP developers. It also makes sphinx the single source of documentation, which simplifies maintenance and operation. Most important updates are: - reformat the existing documentation as rst - split RELEASE.md and move it into separate rst files - remove section 'events' - remove section 'archive' - remove section 'related projects' - remove section 'feature by release' - remove section 'Various links' - make (Configuration reference, CLI docs, developer docs) top level items in the list - move 'Use Cases' as part of 'About VPP' - move 'Troubleshooting' as part of 'Getting Started' - move test framework docs into 'Developer Documentation' - add a 'Contributing' section for gerrit, docs and other contributer related infos - deprecate doxygen and test-docs targets - redirect the "make doxygen" target to "make docs" Type: refactor Change-Id: I552a5645d5b7964d547f99b1336e2ac24e7c209f Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com> Signed-off-by: Andrew Yourtchenko <ayourtch@gmail.com>
This commit is contained in:

committed by
Dave Wallace

parent
f47122e07e
commit
9ad39c026c
50
docs/developer/corefeatures/fib/attachedexport.rst
Normal file
50
docs/developer/corefeatures/fib/attachedexport.rst
Normal file
@ -0,0 +1,50 @@
|
||||
.. _attachedexport:
|
||||
|
||||
Attached Export
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
Extranets make prefixes in table A also reachable from table B. Table A is the export table,
|
||||
B the import. Consider this route in the export table;
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# ip route add table 2 1.1.1.0/24 via 10.10.10.0 GigabitEthernet0/8/0
|
||||
|
||||
there are two ways one might consider representing this route in the import VRF:
|
||||
|
||||
#. ip route add table 3 1.1.1.0/24 via 10.10.10.0 GigabitEthernet0/8/0
|
||||
#. ip route add table 3 1.1.1.0/24 via lookup-in-table 2
|
||||
|
||||
where option 2) is an example of a de-aggregate route where a second lookup is
|
||||
performed in table 2, the export VRF. Option 2) is clearly less efficient, since
|
||||
the cost of the second lookup is high. Option 1) is therefore preferred. However,
|
||||
connected and attached prefixes, and specifically the adj-fibs that they cover,
|
||||
require special attention. The control plane is aware of the connected and
|
||||
attached prefixes that are required to be exported, but it is unaware of the
|
||||
adj-fibs. It is therefore the responsibility of FIB to ensure that whenever an
|
||||
attached prefix is exported, so are the adj-fibs and local prefixes that it
|
||||
covers, and only the adj-fibs and locals, not any covered more specific
|
||||
(sourced e.g. by API). The imported FIB entries are sourced as *attached-export*
|
||||
this is a low priority source, so if those prefixes already exist in the import
|
||||
table, sourced by the API, then they will continue to forward with that information.
|
||||
|
||||
.. figure:: /_images/fib20fig6.png
|
||||
|
||||
Figure 6: Attached Export Class diagram.
|
||||
|
||||
Figure 6 shows the data structures used to perform attached export.
|
||||
|
||||
- *fib_import_t*. A representation of the need to import covered prefixes. An instance is associated with the FIB entry in the import VRF. The need to import prefixes is recognised when an attached route is added to a table that is different to the table of the interface to which it t is attached. The creation of a *fib_import_t* will trigger the creation of a *fib_export_t*.
|
||||
- *fib_export_t*. A representation of the need to export prefixes. An instance is associated with the attached entry in the export VRF. A *fib_export_t* can have many associated *fib_import_t* objects representing multiple VRFs into which the prefix is exported.
|
||||
|
||||
.. figure:: /_images/fib20fig6.png
|
||||
|
||||
Figure 7: Attached Export object diagram
|
||||
|
||||
Figure 7 shows an object instance diagram for the export of a connected from table
|
||||
1 to two other tables. The /32 adj-fib and local prefix in the export VRF are
|
||||
exported into the import VRFs, where they are sourced as *attached-export* and
|
||||
inherit the forwarding information from the exported entry. The attached prefix
|
||||
in the import VRF also performs cover tracking with the connected prefix in the
|
||||
export VRF so that it can react to updates to that prefix that will require the
|
||||
removal the imported covered prefixes.
|
78
docs/developer/corefeatures/fib/barnacles.rst
Normal file
78
docs/developer/corefeatures/fib/barnacles.rst
Normal file
@ -0,0 +1,78 @@
|
||||
.. _barnacles:
|
||||
|
||||
Barnacles
|
||||
---------
|
||||
|
||||
Features that are stuck on the side of the FIB. Those that directly use
|
||||
the services that the FIB provides.
|
||||
|
||||
In the section on FIB fundamentals it was mentioned that there is a
|
||||
separation between what to match and how to forward. In an IP FIB what
|
||||
to match is the packet's destination address against a table of IP
|
||||
prefixes, and how to forward is described by a list of paths (the
|
||||
**fib_path_list_t**).
|
||||
|
||||
ACL Based Forwarding
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
ACL Based Forwarding (ABF) is also know as policy based routing
|
||||
(PBR). In ABF what to match is described by an ACL.
|
||||
|
||||
ABF uses two VPP services; ACL as a service, as provided by the ACL
|
||||
plugin and FIB path-lists. It just glues them together.
|
||||
|
||||
An ABF policy is the combination of an ACL with the forwarding
|
||||
description of a FIB path-list. An ABF attachment is the association
|
||||
of [an ordered set of] ABF policies to an interface. The attachment is
|
||||
consulted on the ingress path of the IP DP (as an input
|
||||
feature). If the ACL matches then the associated forwarding is
|
||||
followed, if not, the packet continues along the DP. Simple.
|
||||
|
||||
Layer 3 Cross Connect
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
An L3 cross-connect (L3XC) matches all packets
|
||||
that ingress the interface and then forwards using the supplied FIB
|
||||
path-list. Naturally it runs as an input feature in the IP
|
||||
path. Super simple.
|
||||
|
||||
IP Punt
|
||||
^^^^^^^
|
||||
|
||||
Matches all IP packets that VPP has punted. Why they are punted is not
|
||||
relevant. All IP punted packets are sent by VPP to the punt feature
|
||||
arc. This feature 'matches' all packets that it receives and forwards
|
||||
using the FIB path-list.
|
||||
|
||||
|
||||
Unicast Reverse Path Forwarding
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Unicast Reverse Path Forwarding (uRPF) is the process of ensuring that
|
||||
a packet has a conforming source address. It comes in two
|
||||
flavours:
|
||||
|
||||
- loose: The source address must be reachable, i.e. FIB must have a
|
||||
route that will forward to the source address. The default route
|
||||
counts as long as it does not drop.
|
||||
- strict: The source address is reachable via the interface on which
|
||||
the packet arrived, i.e. the FIB's route for the source address must
|
||||
include the input interface as an output interface.
|
||||
|
||||
The uRPF feature can run on either the input or output IP feature
|
||||
arc. In both cases it serves as an anti-spoofing check, though the
|
||||
semantics are slightly different. On the input arc it enforces that
|
||||
peers on that link are only using source addresses that they should -
|
||||
a network admin should employ at the access edge. On the output
|
||||
arc it enforces that a packet is sourced from a prefix that belongs to
|
||||
the network, i.e. that is has originated from within an SP's
|
||||
network, a network admin could use at its peering points.
|
||||
|
||||
To perform a uRPF check, the DP performs an IP FIB lookup on the
|
||||
source address, this always results in a load-balance (LB) object. If
|
||||
the LB has only 1 bucket and that bucket stacks on a drop DPO, then
|
||||
both a loose and strict check will fail, otherwise a loose check
|
||||
will pass. Each LB object has an associated uRPF list object. This
|
||||
object holds the list of interfaces through which the prefix is
|
||||
reachable. To pass the strict check, the input/output interface must
|
||||
be in this list.
|
23
docs/developer/corefeatures/fib/controlplane.rst
Normal file
23
docs/developer/corefeatures/fib/controlplane.rst
Normal file
@ -0,0 +1,23 @@
|
||||
.. _controlplane:
|
||||
|
||||
The Control Plane
|
||||
-----------------
|
||||
|
||||
The control plane follows a layered data representation. This document describes the
|
||||
model starting from the lowest layer. The description uses IPv4 addresses and
|
||||
protocols, but all concepts apply equally to the IPv6 equivalents. The diagrams
|
||||
all portray the CLI command to install the information in VPP and an
|
||||
[approximation of] a UML diagram [#f1]_ of the data structures used to represent that
|
||||
information.
|
||||
|
||||
.. toctree::
|
||||
|
||||
neighbors
|
||||
routes
|
||||
attachedexport
|
||||
graphwalks
|
||||
marknsweep
|
||||
|
||||
.. rubric:: Footnotes:
|
||||
|
||||
.. [#f1] The arrow indicates a ‘has-a’ relationship. The object attached to the arrow head ‘has-a’ instance of the other. The numbers next to the arrows indicate the multiplicity, i.e. object A has n to m instances of object B. The difference between a UML association and aggregation is not conveyed in any diagrams. To UML aficionados, I apologize. Word is not the best drawing tool.
|
100
docs/developer/corefeatures/fib/dataplane.rst
Normal file
100
docs/developer/corefeatures/fib/dataplane.rst
Normal file
@ -0,0 +1,100 @@
|
||||
.. _dataplane:
|
||||
|
||||
The Data Plane
|
||||
---------------
|
||||
|
||||
The data-plane data model is a directed, acyclic [#f16]_ graph of heterogeneous objects.
|
||||
A packet will forward walk the graph as it is switched. Each object describes
|
||||
the actions to perform on the packet. Each object type has an associated VLIB
|
||||
graph node. For a packet to forward walk the graph is therefore to move from one
|
||||
VLIB node to the next, with each performing the required actions. This is the
|
||||
heart of the VPP model.
|
||||
|
||||
The data-plane graph is composed of generic data-path objects (DPOs). A parent
|
||||
DPO is identified by the tuple:{type,index,next_node}. The *next_node* parameter
|
||||
is the index of the VLIB node to which the packets should be sent next, this is
|
||||
present to maximise performance - it is important to ensure that the parent does
|
||||
not need to be read [#f17]_ whilst processing the child. Specialisations [#f18]_ of the DPO
|
||||
perform distinct actions. The most common DPOs and briefly what they represent are:
|
||||
|
||||
- Load-balance: a choice in an ECMP set.
|
||||
- Adjacency: apply a rewrite and forward through an interface
|
||||
- MPLS-label: impose an MPLS label.
|
||||
- Lookup: perform another lookup in a different table.
|
||||
|
||||
The data-plane graph is derived from the control-plane graph by the objects
|
||||
therein 'contributing' a DPO to the data-plane graph. Objects in the data-plane
|
||||
contain only the information needed to switch a packet, they are therefore
|
||||
simpler, and in memory terms smaller, with the aim to fit one DPO on a single
|
||||
cache-line. The derivation from the control plane means that the data-plane
|
||||
graph contains only object whose current state can forward packets. For example,
|
||||
the difference between a *fib_path_list_t* and a *load_balance_t* is that the former
|
||||
expresses the control-plane's desired state, the latter the data-plane available
|
||||
state. If some paths in the path-list are unresolved or down, then the
|
||||
load-balance will not include them in the forwarding choice.
|
||||
|
||||
.. figure:: /_images/fib20fig8.png
|
||||
|
||||
Figure 8: DPO contributions for a non-recursive route
|
||||
|
||||
Figure 8 shows a simplified view of the control-plane graph indicating those
|
||||
objects that contribute DPOs. Also shown are the VLIB node graphs at which the DPO is used.
|
||||
|
||||
Each *fib_entry_t* contributes it own *load_balance_t*, for three reasons;
|
||||
|
||||
- The result of a lookup in a IPv[46] table is a single 32 bit unsigned integer. This is an index into a memory pool. Consequently the object type must be the same for each result. Some routes will need a load-balance and some will not, but to insert another object in the graph to represent this choice is a waste of cycles, so the load-balance object is always the result. If the route does not have ECMP, then the load-balance has only one choice.
|
||||
|
||||
- In order to collect per-route counters, the lookup result must in some way uniquely identify the *fib_entry_t*. A shared load-balance (contributed by the path-list) would not allow this.
|
||||
- In the case the *fib_entry_t* has MPLS out labels, and hence a *fib_path_ext_t*, then the load-balance must be per-prefix, since the MPLS labels that are its parents are themselves per-fib_entry_t.
|
||||
|
||||
.. figure:: /_images/fib20fig9.png
|
||||
|
||||
Figure 9: DPO contribution for a recursive route.
|
||||
|
||||
Figure 9 shows the load-balance objects contributed for a recursive route.
|
||||
|
||||
.. figure:: /_images/fib20fig10.png
|
||||
|
||||
Figure 10: DPO Contributions from labelled recursive routes.
|
||||
|
||||
Figure 10 shows the derived data-plane graph for a labelled recursive route.
|
||||
There can be as many MPLS-label DPO instances as there are routes multiplied by
|
||||
the number of paths per-route. For this reason the mpls-label DPO should be as
|
||||
small as possible [#f19]_.
|
||||
|
||||
The data-plane graph is constructed by 'stacking' one
|
||||
instance of a DPO on another to form the child-parent relationship. When this
|
||||
stacking occurs, the necessary VLIB graph arcs are automatically constructed
|
||||
from the respected DPO type's registered graph nodes.
|
||||
|
||||
The diagrams above show that for any given route the full data-plane graph is
|
||||
known before any packet arrives. If that graph is composed of n objects, then the
|
||||
packet will visit n nodes and thus incur a forwarding cost of approximately n
|
||||
times the graph node cost. This could be reduced if the graph were *collapsed*
|
||||
into fewer DPOs and nodes. There are two ways we might consider doing
|
||||
this:
|
||||
|
||||
- write custom DPOs/nodes for combined functions, e.g. pop MPLS label
|
||||
and lookup in v4 table. This has the disadvantage that the number of
|
||||
such nodes would be, well, combinatorial, and resolving a path via
|
||||
a combined DPO would be more difficult as it would involve a
|
||||
forward walk of the graph to determine what the combination
|
||||
is. However, VPP power users might consider this option for a
|
||||
limited set of their use cases where performance is truly king.
|
||||
- collapse multiple levels of load-balancing into one. For example,
|
||||
if there were two levels of load-balancing each with two choices,
|
||||
this could equally be represented by one level with 4 choices.
|
||||
|
||||
In either case a disadvantage to collapsing the graph is that it
|
||||
removes the indirection objects that provide fast convergence (see
|
||||
section Fast Convergence). To collapse is then a trade-off between
|
||||
faster forwarding and fast convergence; VPP favours the latter.
|
||||
|
||||
|
||||
.. rubric:: Footnotes:
|
||||
|
||||
.. [#f16] Directed implies it cannot be back-walked. It is acyclic even in the presence of a recursion loop.
|
||||
.. [#f17] Loaded into cache, and hence potentially incurring a d-cache miss.
|
||||
.. [#f18] The engaged reader is directed to vnet/vnet/dpo/*
|
||||
.. [#f19] i.e. we should not re-use the adjacency structure.
|
||||
|
106
docs/developer/corefeatures/fib/debugging.rst
Normal file
106
docs/developer/corefeatures/fib/debugging.rst
Normal file
@ -0,0 +1,106 @@
|
||||
.. _debugging:
|
||||
|
||||
Debugging
|
||||
---------
|
||||
|
||||
the anatomy of a route:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
BGvpp# sh ip fib 1.1.1.3/32
|
||||
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ]
|
||||
1.1.1.0/24 fib:0 index:9 locks:2
|
||||
CLI refs:1 src-flags:added,contributing,active,
|
||||
path-list:[24] locks:4 flags:shared, uPRF-list:11 len:1 itfs:[1, ]
|
||||
path:[26] pl-index:24 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
|
||||
10.0.0.1 loop0
|
||||
[@0]: arp-ipv4: via 10.0.0.1 loop0
|
||||
|
||||
forwarding: unicast-ip4-chain
|
||||
[@0]: dpo-load-balance: [proto:ip4 index:11 buckets:1 uRPF:11 to:[0:0]]
|
||||
[0] [@3]: arp-ipv4: via 10.0.0.1 loop0
|
||||
|
||||
let's go line by line.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ]
|
||||
|
||||
Each field in turn:
|
||||
|
||||
- ipv4-VRF:0: the name of the table (as given by the user, or
|
||||
automatically generated by VPP).
|
||||
- fib-index:0; in the VPP pool of FIB objects, this is index 0
|
||||
- flow hash:[src dst sport dport proto ]: When calculating the flow
|
||||
hash to use for load-balancing, these are the fields in the packet
|
||||
that are used. There is an API to change this per-table.
|
||||
- epoch:0; Used during mark-n-sweep.
|
||||
- flags:none; use the force, to find the per-table flags.
|
||||
- locks: per-source reference counting, a table can only be deleted
|
||||
when all sources no longer reference it.
|
||||
|
||||
next line:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
1.1.1.0/24 fib:0 index:9 locks:2
|
||||
|
||||
this shows the route that matched the show request. note that it is not
|
||||
an exact match, it's an LPM. The route is in FIB index 0, its index
|
||||
(in the VPP pool of fib_entry_t objects) is nine and there are two
|
||||
references to the entry.
|
||||
You'll get the same output if you type "sh fib entry 9"
|
||||
|
||||
next line:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
CLI refs:1 src-flags:added,contributing,active,
|
||||
|
||||
the 'CLI' has sourced this route (it was added via CLI). This source
|
||||
has been added (well duh) it is 'active', meaning it is the best
|
||||
source, and it is contributing a forwarding object. There are some
|
||||
scenarios where sources other than the active source contribute,
|
||||
namely interpose sources.
|
||||
|
||||
next line:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
path-list:[24] locks:4 flags:shared, uPRF-list:11 len:1 itfs:[1, ]
|
||||
|
||||
This is path-list inex 24 (see "sh fib path-list 24" this will also
|
||||
show the children), it is 'shared',
|
||||
meaning that if other prefixes were to use the same set of paths,
|
||||
then they would also use this path-list object. It has uRPF list 11 of
|
||||
length 1 containing interface index 1 (which is loop0, see "sh int").
|
||||
|
||||
next line:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
path:[26] pl-index:24 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
|
||||
10.0.0.1 loop0
|
||||
[@0]: arp-ipv4: via 10.0.0.1 loop0
|
||||
|
||||
This is path 26 (see "sh fib path 26"). It's a member of
|
||||
path-list 24. It's ip4 has a weight of 1 and a preference of 0. It's
|
||||
of type 'attached-nexthop' and currently resolved - woohoo.
|
||||
It is a path 'via 10.0.0.1 loop0'. It is contributing an incomplete adjacency.
|
||||
|
||||
next line:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
forwarding: unicast-ip4-chain
|
||||
[@0]: dpo-load-balance: [proto:ip4 index:11 buckets:1 uRPF:11 to:[0:0]]
|
||||
[0] [@3]: arp-ipv4: via 10.0.0.1 loop0
|
||||
|
||||
This section describes how packets of type 'unicast-ip4' will be
|
||||
forwarded. It is the result of processing the path information from
|
||||
above.
|
||||
Here we see load-balance object 11, which has 1 bucket/choice. It is
|
||||
also linked to uRPF instance 11 (which it got from path-list 24).
|
||||
In bucket 0 there is the incomplete adjacency that was contributed by
|
||||
path 26.
|
||||
|
576
docs/developer/corefeatures/fib/fastconvergence.rst
Normal file
576
docs/developer/corefeatures/fib/fastconvergence.rst
Normal file
File diff suppressed because it is too large
Load Diff
34
docs/developer/corefeatures/fib/graphs.rst
Normal file
34
docs/developer/corefeatures/fib/graphs.rst
Normal file
@ -0,0 +1,34 @@
|
||||
.. _graphs:
|
||||
|
||||
Graphs
|
||||
^^^^^^
|
||||
|
||||
The FIB is essentially a collection of related graphs. Terminology from graph theory
|
||||
is often used in the sections that follow. From Wikipedia:
|
||||
|
||||
*... a graph is a representation of a set of objects where some pairs of objects are
|
||||
connected by links. The interconnected objects are represented by mathematical
|
||||
abstractions called vertices (also called nodes or points), and the links that
|
||||
connect some pairs of vertices are called edges (also called arcs or lines) ...
|
||||
edges may be directed or undirected.*
|
||||
|
||||
In a directed graph the edges can only be traversed in one direction - from child to
|
||||
parent. The names are chosen to represent the many to one relationship. A child has
|
||||
one parent, but a parent many children. In undirected graphs the edge traversal
|
||||
can be in either direction, but in FIB the parent child nomenclature remains to
|
||||
represent the many to one relationship. Children of the same parent are termed
|
||||
siblings. When the traversal is from child to parent it is considered to be a
|
||||
forward traversal, or walk, and from parent to the many children a back walk.
|
||||
Forward walks are cheap since they start from the many and move toward the few.
|
||||
Back walks are expensive as the start from the few and visit the many.
|
||||
|
||||
The many to one relationship between child and parent means that the lifetime of a
|
||||
parent object must extend to the lifetime of its children. If the control plane
|
||||
removes a parent object before its children, then the parent must remain, in an
|
||||
**incomplete** state, until the children are themselves removed. Likewise if a child
|
||||
is created before its parent, the parent is created in an *incomplete* state. These
|
||||
incomplete objects are needed to maintain the graph dependencies. Without them when
|
||||
the parent is added finding the affected children would require a search through many
|
||||
databases for those children. To extend the lifetime of parents all children thereof
|
||||
hold a **lock** on the parent. This is a simple reference count. Children then follow
|
||||
the add-or-lock/unlock semantics for finding a parent, as opposed to a malloc/free.
|
80
docs/developer/corefeatures/fib/graphwalks.rst
Normal file
80
docs/developer/corefeatures/fib/graphwalks.rst
Normal file
@ -0,0 +1,80 @@
|
||||
.. _graphwalks:
|
||||
|
||||
Graph Walks
|
||||
^^^^^^^^^^^^
|
||||
|
||||
All FIB object types are allocated from a VPP memory pool [#f13]_. The objects are thus
|
||||
susceptible to memory re-allocation, therefore the use of a bare "C" pointer to refer
|
||||
to a child or parent is not possible. Instead there is the concept of a *fib_node_ptr_t*
|
||||
which is a tuple of type,index. The type indicates what type of object it is
|
||||
(and hence which pool to use) and the index is the index in that pool. This allows
|
||||
for the safe retrieval of any object type.
|
||||
|
||||
When a child resolves via a parent it does so knowing the type of that parent. The
|
||||
child to parent relationship is thus fully known to the child, and hence a forward
|
||||
walk of the graph (from child to parent) is trivial. However, a parent does not choose
|
||||
its children, it does not even choose the type. All object types that form part of the
|
||||
FIB control plane graph all inherit from a single base class; *fib_node_t*. A *fib_node_t*
|
||||
identifies the object's index and its associated virtual function table provides the
|
||||
parent a mechanism to visit that object during the walk. The reason for a back-walk
|
||||
is to inform all children that the state of the parent has changed in some way, and
|
||||
that the child may itself need to update.
|
||||
|
||||
To support the many to one, child to parent, relationship a parent must maintain a
|
||||
list of its children. The requirements of this list are;
|
||||
|
||||
- O(1) insertion and delete time. Several child-parent relationships are made/broken during route addition/deletion.
|
||||
- Ordering. High priority children are at the front, low priority at the back (see section Fast Convergence)
|
||||
- Insertion at arbitrary locations.
|
||||
|
||||
To realise these requirements the child-list is a doubly linked-list, where each element
|
||||
contains a *fib_node_ptr_t*. The VPP pool memory model applies to the list elements, so
|
||||
they are also identified by an index. When a child is added to a list it is returned the
|
||||
index of the element. Using this index the element can be removed in constant time.
|
||||
The list supports 'push-front' and 'push-back' semantics for ordering. To walk the children
|
||||
of a parent is then to iterate this list.
|
||||
|
||||
A back-walk of the graph is a depth first search where all children in all levels of the
|
||||
hierarchy are visited. Such walks can therefore encounter all object instances in the
|
||||
FIB control plane graph, numbering in the millions. A FIB control-plane graph is cyclic
|
||||
in the presence of a recursion loop, so the walk implementation has mechanisms to detect
|
||||
this and exit early.
|
||||
|
||||
A back-walk can be either synchronous or asynchronous. A synchronous walk will visit the
|
||||
entire section of the graph before control is returned to the caller, an asynchronous
|
||||
walk will queue the walk to a background process, to run at a later time, and immediately
|
||||
return to the caller. To implement asynchronous walks a *fib_walk_t* object it added to
|
||||
the front of the parent's child list. As children are visited the *fib_walk_t* object
|
||||
advances through the list. Since it is inserted in the list, when the walk suspends
|
||||
and resumes, it can continue at the correct location. It is also safe with respect to
|
||||
the deletion of children from the list. New children are added to the head of the list,
|
||||
and so will not encounter the walk, but since they are new, they already have the up to
|
||||
date state of the parent.
|
||||
|
||||
A VLIB process 'fib-walk' runs to perform the asynchronous walks. VLIB has no priority
|
||||
scheduling between respective processes, so the fib-walk process does work in small
|
||||
increments so it does not block the main route download process. Since the main download
|
||||
process effectively has priority numerous asynchronous back-walks can be started on the
|
||||
same parent instance before the fib-walk process can run. FIB is a 'final state' application.
|
||||
If a parent changes n times, it is not necessary for the children to also update n
|
||||
times, instead it is only necessary that this child updates to the latest, or final,
|
||||
state. Consequently when multiple walks on a parent (and hence potential updates to a
|
||||
child) are queued, these walks can be merged into a single walk. This
|
||||
is the main reason the walks are designed this way, to eliminate (as
|
||||
much as possible) redundant work and thus converge the system as fast
|
||||
as possible.
|
||||
|
||||
Choosing between a synchronous and an asynchronous walk is therefore a trade-off between
|
||||
time it takes to propagate a change in the parent to all of its children, versus the
|
||||
time it takes to act on a single route update. For example, if a route update were to
|
||||
affect millions of child recursive routes, then the rate at which such updates could be
|
||||
processed would be dependent on the number of child recursive route which would not be
|
||||
good. At the time of writing FIB2.0 uses synchronous walk in all locations except when
|
||||
walking the children of a path-list, and it has more than 32 [#f15]_ children. This avoids the
|
||||
case mentioned above.
|
||||
|
||||
.. rubric:: Footnotes:
|
||||
|
||||
.. [#f13] Fast memory allocation is crucial to fast route update times.
|
||||
.. [#f14] VPP may be written in C and not C++ but inheritance is still possible.
|
||||
.. [#f15] The value is arbitrary and yet to be tuned.
|
68
docs/developer/corefeatures/fib/hacking.rst
Normal file
68
docs/developer/corefeatures/fib/hacking.rst
Normal file
@ -0,0 +1,68 @@
|
||||
.. _hacking:
|
||||
|
||||
Get Hacking
|
||||
-----------
|
||||
|
||||
The code's directory structure is trivial, FIB, mFIB, adj have their
|
||||
own directories.
|
||||
|
||||
for the most part, for all the FIB object types mentioned in this
|
||||
documentation there is a corresponding .h and .c file. As with any VPP
|
||||
component/sub-system a 'public' header file is any file that can be
|
||||
included by another sub-system and/or plugin. These must be specified
|
||||
in the build-system, so go look there. Public header files are always
|
||||
a good entry point to start reading.
|
||||
|
||||
FIB
|
||||
^^^
|
||||
|
||||
There is no direct [VPP's binary] API access to FIB, but FIB does
|
||||
expose types that can be used on the API by FIB and by other
|
||||
subsystems (e.g. :ref:`barnacles`). These types are specified in
|
||||
fib.api and the encoding and decoding thereof in fib_api.[ch].
|
||||
|
||||
Most operations on a FIB entry happen as a result of an operation on a
|
||||
FIB table; an entry does not exist in isolation. The APIs in
|
||||
fib_table.h are well doxygen documented you should be able to figure
|
||||
out what they do. Use this as a starting point to explore how entries
|
||||
are created and deleted and how the source priority scheme works.
|
||||
|
||||
FIB sources are defined in fib_source.h. Each source behaviour has its
|
||||
own file fib_entry_src_*.c These define the virtual functions that
|
||||
determine how the source behaves when actions on the FIB occur. For
|
||||
example, what the entry must do when its covering prefix's forwarding
|
||||
is updated.
|
||||
|
||||
When creating new paths/path-lists the main action required is to
|
||||
resolve them; see fib_path*_resolve, and once resolved to have them
|
||||
contribute a DPO for forwarding or for the uRPF list; see
|
||||
fib_*_contribute_forwarding and fib_*_contribute_urpf respectively.
|
||||
|
||||
The data-structures that used for entry lookup are protocol
|
||||
specific, they are implemented in separate files; ip4_fib.[ch],
|
||||
ip6_fib.[ch] and mpls_fib.[ch].
|
||||
|
||||
FIB extranet support is implemented in fib_attached_export.[ch].
|
||||
FIB tracking is implemented in fib_entry_track.[ch].
|
||||
FIB [back]walk is implemented in fib_walk.[ch].
|
||||
|
||||
Adjacency
|
||||
^^^^^^^^^
|
||||
|
||||
Not much to say here, each adjacency type has it own file; use the
|
||||
force, read the source.
|
||||
|
||||
|
||||
Testing
|
||||
^^^^^^^
|
||||
|
||||
the majority of FIB coverage comes from the C Unit tests in
|
||||
fib_test.c. I strongly encourage you to add code here. It's a much
|
||||
easier development cycle to fire up GDB, run VPP and iterate with
|
||||
'test fib', than it is work in the python UT. You still need to write
|
||||
python UT, don't get me wrong, it's just easier to do the FIB dev
|
||||
using C UT.
|
||||
|
||||
|
||||
|
||||
Enjoy!
|
21
docs/developer/corefeatures/fib/index.rst
Normal file
21
docs/developer/corefeatures/fib/index.rst
Normal file
@ -0,0 +1,21 @@
|
||||
.. _fib20:
|
||||
|
||||
The FIB
|
||||
===========================================
|
||||
|
||||
This describe the FIB (Forwarding information base) implementation :
|
||||
Hierarchical, Protocol, Independent
|
||||
|
||||
.. toctree::
|
||||
|
||||
prerequisites
|
||||
thedatamodel
|
||||
tunnels
|
||||
mplsfib
|
||||
multicast
|
||||
debugging
|
||||
fastconvergence
|
||||
scale
|
||||
barnacles
|
||||
hacking
|
||||
missing
|
68
docs/developer/corefeatures/fib/marknsweep.rst
Normal file
68
docs/developer/corefeatures/fib/marknsweep.rst
Normal file
@ -0,0 +1,68 @@
|
||||
.. _marknsweep:
|
||||
|
||||
Mark and Sweep
|
||||
--------------
|
||||
|
||||
The mark and sweep procedures, in FIB and in other subsystems, are
|
||||
built for the purpose of recovering from a control plane crash.
|
||||
|
||||
In routing if the control plane (CP) crashes, when it restarts, the network
|
||||
topology may have changed. This means that some of the routes that
|
||||
were programmed in the FIB may no longer be needed, and perhaps some
|
||||
new ones are. If the CP were simply to insert all the new routes it
|
||||
learned after it restarts, then FIB could be left with old routes that
|
||||
never get removed, this would be bigly bad.
|
||||
|
||||
At a high level the requirement is to delete routes from the old set
|
||||
that are not present in the new set; 'delete the diff' as it might
|
||||
be colloquially known.
|
||||
|
||||
How should the control plane determine the old set? It could
|
||||
conceivably read back the FIB from VPP. But this presents two
|
||||
problems, firstly, it could be a large set of routes, numbering in the
|
||||
millions, this is not an efficient mechanism and not one one wants to
|
||||
perform at a point when the router is trying to converge
|
||||
ASAP. Secondly it represents a 'source of truth' inversion. The
|
||||
routing plane is the source of truth, not forwarding. Routing should
|
||||
not receive its 'input' from the layers below. Thirdly, on a practical
|
||||
note, the reading of VPP data structures to glean this sort of
|
||||
accurate information, would only happen in this scenario, i.e. it's
|
||||
not well tested and therefore not particularly reliable (see point 2).
|
||||
|
||||
Enter 'mark and sweep' or m-n-s (not to be confused with the retail
|
||||
giant) as it's affectionately known.
|
||||
|
||||
The Mark and Sweep algorithm proceeds in three steps:
|
||||
|
||||
- Step 1; the CP declares to VPP that it wants to begin the process
|
||||
(i.e. it has just restarted). At this point VPP will iterate through
|
||||
all the objects that the CP owns and 'mark' then as being
|
||||
stale. This process effectively declares a new 'epoch', a barrier in
|
||||
time that separates the old objects from the new.
|
||||
- Step 2; The CP downloads all of its new objects. If one of these new
|
||||
CP objects matches (has the same key as) an existing object, then
|
||||
the CP add is considered an update, and the object's stale state is
|
||||
removed.
|
||||
- Step 3: The CP declares it has 'converged'; it has no more updates
|
||||
to give (at this time). VPP will then again iterate through all the
|
||||
CP's objects and remove those that do not belong to the new epoch,
|
||||
i.e. those that are still marked stale.
|
||||
|
||||
After step 3, the CP and VPP databases are in sync.
|
||||
|
||||
The cost of the process was to download all the new routes again. This
|
||||
is a highly-tuned and well-tested scenario.
|
||||
|
||||
In VPP we use the synonym 'replace' to describe the mark-n-sweep
|
||||
action in the API. We use this term because it refers to the goals of
|
||||
the algorithm at a high level - the CP wants to replace the old DB
|
||||
with a new one - but it does not specify the algorithm by which that
|
||||
is achieved. One could equally perform this task by constructing a
|
||||
brand new DB in VPP, and then swapping them when the CP
|
||||
converges. Other subsystems may employ that approach, but FIB does
|
||||
not. Updates are typically faster than adds, since the update is
|
||||
likely a no-op, whereas a separate add would require the memory
|
||||
allocator, which is the long pole in FIB additions. Additionally, it requires
|
||||
twice the memory for a moment in time, which could be prohibitive when
|
||||
the FIB is large.
|
||||
|
110
docs/developer/corefeatures/fib/missing.rst
Normal file
110
docs/developer/corefeatures/fib/missing.rst
Normal file
@ -0,0 +1,110 @@
|
||||
.. _missing:
|
||||
|
||||
Missing Functionality
|
||||
---------------------
|
||||
|
||||
A list of functionality that the FIB does not currently provide.
|
||||
|
||||
|
||||
PIC Edge Backup Paths
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
FIB supports the concept of path 'preference'. Only paths that have
|
||||
the best preference contribute to forwarding. Only once all the paths with
|
||||
the best preference go down do the paths with the next best preference
|
||||
contribute.
|
||||
|
||||
In BGP PIC edge, BGP would install the primary paths and the backup
|
||||
paths. With expectation that backups are only used once all primaries
|
||||
fail; this is the same behaviour that FIB's preference sets provide.
|
||||
|
||||
However, in order to get prefix independent convergence, one must be
|
||||
able to only modify the path-list's load-balance map (LBM) to choose the
|
||||
paths to use. Hence the paths must already be in the map, and
|
||||
conversely must be in the fib_entry's load-balance (LB). In other
|
||||
words, to use backup paths with PIC, the fib_entry's LB must include
|
||||
the backup paths, and the path-lists LBM must map from the backups to
|
||||
the primaries.
|
||||
|
||||
This is change that is reasonably easy w.r.t. to knowing what to
|
||||
change, but hard to get right and hard to test.
|
||||
|
||||
|
||||
Loop Free Alternate Paths
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Contrary to the BGP approach for path backups, an IGP could install a
|
||||
loop free alternate (LFA) path to achieve fast re-route (FRR).
|
||||
|
||||
Because of the way the LFA paths are calculated by the IGP an LFA backup
|
||||
path is always paired with a primary. VPP FIB does not support this
|
||||
primary-backup pair relationship.
|
||||
|
||||
In intent of LFA FRR is/was to get below the magic 50ms mark. To do
|
||||
this the expectation is/was that one would need in the forwarding
|
||||
graph an object that represents a path's state. This object would be
|
||||
checked for each packet being sent. If the path is up, the graph (an
|
||||
adjacency since it's the IGP) for the primary path is taken, if it's
|
||||
down the graph for the backup is taken. When a path goes down only
|
||||
this indirection object needs to be updated to affect all
|
||||
routes. Naturally, the indirection would incur a performance cost, but
|
||||
we know that there are many performance-convergence trade-offs in a
|
||||
FIB design.
|
||||
|
||||
Should VPP's FIB support this feature? It all depends on the
|
||||
50ms. LFA FRR comes from the era when routers ran on lower performance
|
||||
CPUs and interface down was an interrupt. VPP typically has plenty of
|
||||
gas but runs as a user space process. So, can it update all routes in
|
||||
under 50ms on a meaty CPU and can the OS deliver the interface down
|
||||
within the time requirements? I don't have the answers to either
|
||||
question.
|
||||
|
||||
|
||||
Extranets for Multicast
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
When a unicast prefix is present in two different tables, then it
|
||||
refers to a different set of devices. When the prefix is imported it
|
||||
refers to the same set of devices. If the set of paths to reach the
|
||||
prefix is different in the import and export table, it doesn't matter,
|
||||
since they both refer to the same devices, so either set can be
|
||||
used. Therefore, FIB's usual source preference rules can apply. The
|
||||
'import' source is lower priority.
|
||||
|
||||
When a multicast prefix is present in two different tables, then it's
|
||||
two different flows referring to two different set of receivers. When
|
||||
the prefix is imported, then it refers to the same flow and two
|
||||
different sets of receivers. In other words, the receiver set in the
|
||||
import table needs to be the super set of receivers.
|
||||
|
||||
There are two ways one might consider doing this; merging the
|
||||
path-lists or replicating the packet first into each table.
|
||||
|
||||
|
||||
Collapsing
|
||||
^^^^^^^^^^
|
||||
|
||||
Read :ref:`fastconvergence`
|
||||
|
||||
Collapsing the DPO graph for recursive routes doesn't have to be an
|
||||
all or nothing. Easy cases:
|
||||
|
||||
|
||||
- A recursive prefix with only one path and a path-list that is not
|
||||
popular, could stack directly on the LB of the via entry.
|
||||
- A recursive prefix with only multiple paths and a path-list that is not
|
||||
popular, could construct a new load balance using the choices
|
||||
present in each bucket of its via entries. The choices in the new LB
|
||||
though would need to reflect the relative weighting.
|
||||
|
||||
|
||||
The condition of an non-popular path-list means that the LB doesn't
|
||||
have an LB map and hence it needs to be updated for convergence to
|
||||
occur.
|
||||
|
||||
The more difficult cases come when the recursive prefix has labels
|
||||
which need to be stack on the via entries' choices.
|
||||
|
||||
You might also envision a global configuration that always collapses all
|
||||
chains, which could be used in deployments where convergence is not a
|
||||
priority.
|
220
docs/developer/corefeatures/fib/mplsfib.rst
Normal file
220
docs/developer/corefeatures/fib/mplsfib.rst
Normal file
@ -0,0 +1,220 @@
|
||||
.. _mplsfib:
|
||||
|
||||
MPLS FIB
|
||||
--------
|
||||
|
||||
Implementation
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
The MPLS FIB is implemented using exactly the same data structures as
|
||||
the IP FIB. The only difference is the implementation of the
|
||||
table. Whereas for IPv4 this is an mtrie and for IPv6 a hash table,
|
||||
for MPLS it is a flat array indexed by a 21 bit key (label & EOS
|
||||
bit). This implementation is chosen to favour packet forwarding speed.
|
||||
|
||||
It can be the case in MPLS forwarding that packets received with the
|
||||
EOS bit set in the MPLS label need to be forwarded differently from
|
||||
those without. The most common example of this is if the path set
|
||||
contains a path that does not have an output label. In this case the
|
||||
non-EOS packets cannot take this path, because to do so would expose
|
||||
the neighbouring router to a label that it did not allocate.
|
||||
|
||||
The design choice to make with an MPLS FIB table is therefore:
|
||||
- 20 bit key: label only. When the EOS and non-EOS actions differ the result is a 'EOS-choice' object.
|
||||
- 21 bit key: label and EOS-bit. The result is then the specific action based on EOS-bit.
|
||||
|
||||
20 bit key
|
||||
- Advantages:lower memory overhead, since there are few DB entries.
|
||||
- Disadvantages: slower DP performance in the case the path-lists
|
||||
differ, as more objects are encountered in the switch path
|
||||
|
||||
21 bit key
|
||||
- Advantages: faster DP performance
|
||||
Disadvantages: increased memory footprint.
|
||||
|
||||
Switching between schemes based on observed/measured action similarity
|
||||
is not considered on the grounds of complexity and flip-flopping.
|
||||
|
||||
VPP mantra - favour performance over memory. We choose a 21 bit key.
|
||||
|
||||
Basics
|
||||
^^^^^^
|
||||
|
||||
MPLS is not enabled by default. There are two steps to get
|
||||
started. First, create the default MPLS FIB:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mpls table add 0
|
||||
|
||||
With '0' being the magic number for the 'default' table (just like it
|
||||
is for IPv[46]). One can create other MPLS tables, but, unlike IP
|
||||
tables, one cannot 'bind' non-default MPLS tables to interfaces, in
|
||||
other words all MPLS packets received on an interface will always
|
||||
result in a lookup in the default table. One has to be more inventive
|
||||
to use the non-default tables...
|
||||
|
||||
Secondly, for *each* interface on which you wish to *receive* MPLS
|
||||
packets, that interface must be MPLS 'enabled'
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ set interface mpls GigEthernet0/0/0 enable
|
||||
|
||||
there is no equivalent enable for transmit, all that is required is to
|
||||
use an interface as an egress path.
|
||||
|
||||
Entries in the MPLS FIB can be displayed with:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ sh mpls fib [table X] [label]
|
||||
|
||||
There is a tight coupling between IP and MPLS forwarding. MPLS
|
||||
forwarding equivalence classes (FECs) are often an IP prefix – that is
|
||||
to say that traffic matching a given IP prefix is routed into a MPLS
|
||||
label switch path (LSP). It is thus necessary to be able to associate
|
||||
a given prefix/route with an [out-going] MPLS label that will be
|
||||
imposed when the packet is forwarded. This is configured as:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip route add 1.1.1.1/32 via 10.10.10.10 GigEthernet0/0/0 out-labels 33
|
||||
|
||||
packets matching 1.1.1.1/32 will be forwarded out GigEthernet0/0/0 and have
|
||||
MPLS label 33 imposed. More than one out-going label can be
|
||||
specified. Out-going MPLS labels can be applied to recursive and
|
||||
non-recursive routes, e.g;
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip route add 2.2.2.0/24 via 1.1.1.1 out-labels 34
|
||||
|
||||
packets matching 2.2.2.0/24 will thus have two MPLS labels imposed; 34
|
||||
and 33. This is the realisation of, e,g, an MPLS BGP VPNv4.
|
||||
|
||||
To associate/allocate a local-label for a prefix, and thus have
|
||||
packets to that local-label forwarded equivalently to the prefix do;
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mpls local-label 99 2.2.2.0/24
|
||||
|
||||
In the API this action is called a ‘bind’.
|
||||
The router receiving the MPLS encapsulated packets needs to be
|
||||
programmed with actions associated which each label value – this is
|
||||
the role of the MPLS FIB. The MPLS FIB is a table, whose key is the
|
||||
MPLS label value and end-of-stack (EOS) bit, which stores the action
|
||||
to perform on packets with matching encapsulation. Currently supported
|
||||
actions are:
|
||||
|
||||
#. Pop the label and perform an IPv[46] lookup in a specified table
|
||||
#. Pop the label and forward via a specified next-hop (this is penultimate-hop-pop, PHP)
|
||||
#. Swap the label and forward via a specified next-hop.
|
||||
|
||||
These can be programmed respectively by:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mpls local-label 33 eos ip4-lookup-in-table X
|
||||
$ mpls local-label 33 [eos] via 10.10.10.10 GigEthernet0/0/0
|
||||
$ mpls local-label 33 [eos] via 10.10.10.10 GigEthernet0/0/0 out-labels 66
|
||||
|
||||
the latter is an example of an MPLS cross connect. Any description of
|
||||
a next-hop, recursive, non-recursive, labelled, non-labelled, etc,
|
||||
that is valid for an IP prefix, is also valid for an MPLS
|
||||
local-label. Note the use of the 'eos' keyword which indicates the
|
||||
programming is for the case when the label is end-of-stack. The last
|
||||
two operations can apply to both eos and non-eos packets, but the pop
|
||||
and IP lookup only to an eos packet.
|
||||
|
||||
|
||||
MPLS VPN
|
||||
^^^^^^^^
|
||||
|
||||
To configure an MPLS VPN for a PE the following example can be used.
|
||||
|
||||
Step 1; Configure routes to the iBGP peers - note these route MUST
|
||||
have out-going labels;
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip route add 10.0.0.1/32 via 192.168.1.2 Eth0 out-labels 33
|
||||
$ ip route add 10.0.0.2/32 via 192.168.2.2 Eth0 out-labels 34
|
||||
|
||||
Step 2; Configure the customer 'VRF'
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip table add 2
|
||||
|
||||
Step 3; add a route via the iBGP peer[s] with the MPLS label
|
||||
advertised by that peer
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip route add table 2 10.10.10.0/24 via 10.0.0.2 next-hop-table 0 out-label 122
|
||||
$ ip route add table 2 10.10.10.0/24 via 10.0.0.1 next-hop-table 0 out-label 121
|
||||
|
||||
Step 4; add a route via the eBGP peer
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip route add table 2 10.10.20.0/24 via 172.16.0.1 next-hop-table 2
|
||||
|
||||
Step 5; depending on the label allocation scheme used, add routes to
|
||||
the MPLS FIB to accept incoming labelled packets:
|
||||
|
||||
#. per-prefix label scheme - this command 'binds' the label to the same
|
||||
forwarding as the IP route
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mpls local-label 99 10.10.20.0/24
|
||||
|
||||
#. per-CE label scheme - this pops the incoming label and forwards via
|
||||
the next-hop provided. Append config for 'out-labels' if so desired.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mpls local-label 99 via 172.16.0.1 next-hop-table 2
|
||||
|
||||
#. per-VRF label scheme
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mpls local-label 99 via ip4-lookup-in-table 2
|
||||
|
||||
MPLS Tunnels
|
||||
^^^^^^^^^^^^
|
||||
|
||||
MPLS tunnels are unidirectional and can impose a stack of labels. They
|
||||
are 'normal' interfaces and thus can be used, for example, as the
|
||||
target for IP routes and L2 cross-connects. To construct a tunnel:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mpls tunnel add via 10.10.10.10 GigEthernet0/0/0 out-labels 33 44 55
|
||||
|
||||
and to then have that created tunnel to perform ECMP:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mpls tunnel add mpls-tunnel0 via 10.10.10.11 GigEthernet0/0/0 out-labels 66 77 88
|
||||
|
||||
use
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ sh mpls tunnel [X]
|
||||
|
||||
to see the monster you have created.
|
||||
|
||||
An MPLS tunnel interface is an interface like any other and now ready
|
||||
for use with the usual set of interface commands, e.g.:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ set interface state mpls-tunnel0 up
|
||||
$ set interface ip address mpls-tunnel0 192.168.1.1/30
|
||||
$ ip route 1.1.1.1/32 via mpls-tunnel0
|
106
docs/developer/corefeatures/fib/multicast.rst
Normal file
106
docs/developer/corefeatures/fib/multicast.rst
Normal file
@ -0,0 +1,106 @@
|
||||
.. _mfib:
|
||||
|
||||
IP Multicast FIB
|
||||
----------------
|
||||
|
||||
The two principal differences between multicast and unicast forwarding
|
||||
are:
|
||||
|
||||
* there is no load-balancing among paths, there is only replication
|
||||
across paths.
|
||||
* multicast forwarding has an explicit reverse path forwarding (RPF)
|
||||
check. It will only forward a packet if it arrives from a peer for
|
||||
which it has been explicitly configured to accept.
|
||||
|
||||
The other factor that influences the design of the mFIB is that the
|
||||
match criteria (the prefix) is different. For multicast it is
|
||||
necessary to be able to match on source and destination/group
|
||||
addresses (termed an (S,G)) and only on a destination prefix (a (\*,
|
||||
G/m)). This prefix is much bigger than a unicast prefix, and since
|
||||
unicast scale is almost always greater than multicast scale, it is not
|
||||
a good idea to have a single definition of a prefix. Therefore,
|
||||
there is a fib_prefix_t (and hence a fib_entry_t) and an
|
||||
mfib_prefix_t (and hence a mfib_entry_t).
|
||||
|
||||
The fib_path_t and fib_path_list_t are reused. A path can represent
|
||||
either a peer from which to accept packets or a peer to which to send
|
||||
packets. A path-extension is added to the fib_path_t/mfib_entry_t to
|
||||
describe the role the path plays. Logically the path-list is split
|
||||
into two sets; an accepting set and a forwarding set. The forwarding set
|
||||
contributes a replicate DPO for forwarding and the accepting set
|
||||
contributes a list of interfaces (an mfib_itf_t) for the RPF check.
|
||||
|
||||
An IP multicast FIB (mFIB) is a data-structure that holds entries that
|
||||
represent a (S,G) or a (\*,G/m) multicast group. There is one IPv4 and
|
||||
one IPv6 mFIB per IP table, i.e. each time the user calls 'ip[6] table
|
||||
add X' an mFIB is created.
|
||||
|
||||
Usage
|
||||
^^^^^
|
||||
|
||||
To add an entry to the default mFIB for the group (1.1.1.1, 239.1.1.1)
|
||||
that will replicate packets to GigEthernet0/0/0 and GigEthernet0/0/1, do:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip mroute add 1.1.1.1 239.1.1.1 via GigEthernet0/0/0 Forward
|
||||
$ ip mroute add 1.1.1.1 239.1.1.1 via GigEthernet0/0/1 Forward
|
||||
|
||||
the flag 'Forward' passed with the path specifies this path to be part of the replication set.
|
||||
To add a path from GigEthernet0/0/2 to the accepting (RPF) set do:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip mroute add 1.1.1.1 239.1.1.1 via GigEthernet0/0/2 Accept
|
||||
|
||||
A (\*,G) entry is added by not specifying a source address:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip mroute add 232.2.2.2 via GigEthernet0/0/2 Forward
|
||||
|
||||
A (\*,G/m) entry is added by not specifying a source address and giving
|
||||
the group address a mask:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip mroute add 232.2.2.0/24 via GigEthernet0/0/2 Forward
|
||||
|
||||
Entries are deleted when all paths have been removed and all entry flags (see below) are also removed.
|
||||
|
||||
Advanced
|
||||
^^^^^^^^
|
||||
|
||||
There are a set of flags associated only with an entry, see:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ show mfib route flags
|
||||
|
||||
only some of these are relevant over the API/CLI:
|
||||
|
||||
#. Signal - packets that match this entry will generate an event that
|
||||
is sent to the control plane (which can be retrieved via the signal
|
||||
dump API)
|
||||
#. Connected - indicates that the control plane should be informed of
|
||||
connected sources (also retrieved via the signal dump API)
|
||||
#. Accept-all-itf - the entry shall accept packets from all
|
||||
interfaces, thus eliminating the RPF check
|
||||
#. Drop - Drop all packet matching this entry.
|
||||
|
||||
flags on an entry can be changed with:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip mroute <PREFIX> <FLAG>
|
||||
|
||||
An alternative approach to the RPF check, that does check the
|
||||
accepting path set, is to give the entry and RPF-ID:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip mroute <PREFIX> rpf-id X
|
||||
|
||||
the RPF-ID is an attribute of a received packet's meta-data and is
|
||||
added to the packet when it ingresses on a given entity such as an
|
||||
MPLS-tunnel or a BIER table disposition entry.
|
88
docs/developer/corefeatures/fib/neighbors.rst
Normal file
88
docs/developer/corefeatures/fib/neighbors.rst
Normal file
@ -0,0 +1,88 @@
|
||||
.. _neighbors:
|
||||
|
||||
Neighbours
|
||||
^^^^^^^^^^^
|
||||
|
||||
.. figure:: /_images/ip-neighbor.png
|
||||
|
||||
Figure 1: Neighbour data model
|
||||
|
||||
Figure 1 shows the data model for IP neighbours. An IP neighbour contains the mapping
|
||||
between a peer, identified by an IPv4 or IPv6 address, and its MAC address on a given
|
||||
interface. An IP-table (VRF) is not part of the neighbour's
|
||||
data/identity. This is because the virtualization of a router into
|
||||
different tables (VRFs) is performed at the interface level, i.e. an
|
||||
IP-table is bound to a particular interface. A neighbour, which is
|
||||
attached to an interface, is thus implicitly in that table, and
|
||||
only in that table. It is also worth noting that IP neighbours
|
||||
contribute forwarding for the egress direction, whereas an IP-table
|
||||
is an ingress only function.
|
||||
|
||||
The *ip_neighbor_t* represents the control-plane addition of the
|
||||
neighbour. The *ip_adjacency_t* contains the data derived from the *ip_neighbor_t* that is needed to
|
||||
forward packets to the peer. The additional data in the adjacency are the *rewrite*
|
||||
and the *link_type*. The *link_type* is a description of the protocol of the packets
|
||||
that will be forwarded with this adjacency; e.g. IPv4, IPv6 or MPLS. The *link_type*
|
||||
maps directly to the ether-type in an Ethernet header, or the protocol filed in a
|
||||
GRE header. The rewrite is a byte string representation of the header that will be
|
||||
prepended to the packet when it is sent to that peer. For Ethernet interfaces this
|
||||
is be the src,dst MAC and the ether-type. For LISP tunnels, the IP src,dst pair
|
||||
and the LISP header.
|
||||
|
||||
The *ip_neighbor_t* for an IPv4 peer (learned e.g. over ARP) will
|
||||
install a *link_type=IPv4* when the entry is created and a
|
||||
link_type=MPLS on demand (i.e. when a route with output labels resolves via the peer).
|
||||
|
||||
Adjacency
|
||||
---------
|
||||
|
||||
There are three sub-types of adjacencies. Purists would argue that some
|
||||
of these sub-types are not really adjacencies but are instead other
|
||||
forms of DPOs, and it would be hard to argue against that, but
|
||||
historically (not just in VPP, but in the FIB implementations from
|
||||
which VPP draws on for some of its concepts), these have been modelled
|
||||
as adjacency types, the one thing they have in common is that they
|
||||
have an associated interface and are terminal. The [sub] sub-types are:
|
||||
|
||||
* A Neighbour Adjacency (key={interface, next-hop, link-type}). A
|
||||
representation of a peer on a link (as described above). A neighbour adjacency itself has
|
||||
two sub-types; terminal and mid-chain. When one speak of 'an
|
||||
adjacency' one is usually referring to a terminal neighbour
|
||||
sub-type. A mid-chain adjacency represents a neighbor on a virtual
|
||||
interface which relies on the FIB to perform further forwarding. This
|
||||
adjacency is thus not terminal for the FIB object graph but instead
|
||||
appears in the 'middle' (the term chain is a synonym for graph in
|
||||
some contexts).
|
||||
A neighbour adjacency can be in one of two states; complete and
|
||||
incomplete. A complete adjacency knows the rewrite string that
|
||||
should be used to reach the peer, an incomplete adjacency does
|
||||
not. If the adjacency was added as a result of the addition of an
|
||||
*ip_neighbor_t* then the adjacency will be complete (because the
|
||||
*ip_neighbor_t* knows the peer's MAC address). An incomplete
|
||||
adjacency is created on demand by the FIB when a route's path
|
||||
requires to resolve through such an adjacency. It is thus created in
|
||||
order to resolve the missing dependency, it will become complete
|
||||
once the *ip_neighbor_t* is discovered.
|
||||
In the forwarding path a complete adjacency will prepend the rewrite
|
||||
string and transmit on the egress interface, an incomplete adjacency
|
||||
will construct a ARP/ND request to resolve the peer's IP address.
|
||||
|
||||
* A Glean Adjacency (key={interface}). This is a representation of the need to discover
|
||||
a peer on the given interface. It is used when it is known that the
|
||||
packet is destined to an undiscovered peer on that interface. The
|
||||
difference between the glean adjacency and an
|
||||
incomplete neighbour adjacency is that in the forwarding path the
|
||||
glean adjacency will construct an ARP/ND request for the peer as
|
||||
determined from the packet's destination address. The glean
|
||||
adjacency is used to resolve connected prefixes on multi-access
|
||||
interfaces.
|
||||
|
||||
* A Multicast Adjacency (key={interface}). This represents the need to send an IP
|
||||
multicast packet out of the adjacency's associated interface. Since
|
||||
IP multicast constructs the destination MAC address from the IP
|
||||
packet's destination/group address, the rewrite is always known and
|
||||
hence the adjacency is always complete.
|
||||
|
||||
|
||||
All adjacency types can be shared between routes, hence each type is
|
||||
stored in a DB whose key is appropriate for the type.
|
17
docs/developer/corefeatures/fib/prefixes.rst
Normal file
17
docs/developer/corefeatures/fib/prefixes.rst
Normal file
@ -0,0 +1,17 @@
|
||||
.. _prefixes:
|
||||
|
||||
Prefixes
|
||||
^^^^^^^^
|
||||
|
||||
Some nomenclature used to describe prefixes:
|
||||
|
||||
* 1.1.1.1 This is an address since it has no associated mask
|
||||
* 1.1.1.0/24 This is a prefix.
|
||||
* 1.1.1.1/32 This is a host prefix (the mask length is the size of the address).
|
||||
|
||||
Prefix A is more specific than B if its mask length is longer, and less specific if
|
||||
the mask is shorter. For example, 1.1.1.0/28 is more specific than 1.1.1.0/24. A
|
||||
less specific prefix that overlaps with a more specific is the **covering** prefix.
|
||||
For example, 1.1.1.0/24 is the covering prefix for 1.1.1.0/28 and 1.1.1.0/28 is termed
|
||||
the **covered** prefix. A covering prefix is therefore always less specific than its
|
||||
covered prefixes.
|
12
docs/developer/corefeatures/fib/prerequisites.rst
Normal file
12
docs/developer/corefeatures/fib/prerequisites.rst
Normal file
@ -0,0 +1,12 @@
|
||||
.. _prerequisites:
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
|
||||
This section describes some prerequisite topics and nomenclature that are
|
||||
foundational to understanding the FIB architecture.
|
||||
|
||||
.. toctree::
|
||||
|
||||
graphs
|
||||
prefixes
|
353
docs/developer/corefeatures/fib/routes.rst
Normal file
353
docs/developer/corefeatures/fib/routes.rst
Normal file
File diff suppressed because it is too large
Load Diff
247
docs/developer/corefeatures/fib/scale.rst
Normal file
247
docs/developer/corefeatures/fib/scale.rst
Normal file
@ -0,0 +1,247 @@
|
||||
.. _scale:
|
||||
|
||||
Scale
|
||||
-----
|
||||
|
||||
The only limiting factor on FIB scale is the amount of memory
|
||||
allocated to each heap the FIB uses, and there are 2:
|
||||
|
||||
* The main heap
|
||||
* The stats heap
|
||||
|
||||
|
||||
Main Heap
|
||||
^^^^^^^^^
|
||||
|
||||
The main heap is used to allocate all memory needed for the FIB
|
||||
data-structures. Each table, created by the user, i.e. with;
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ ip table add 1
|
||||
|
||||
or the default table, comprises 2 *ip4_fib_t* objects.
|
||||
The 'non-forwarding' *ip4_fib_t* contains all the entries in the table
|
||||
and, the 'forwarding' contains the entries that are matched against in
|
||||
the data-plane. The difference between the two sets are the entries
|
||||
that should not be matched in the data-plane.
|
||||
Each *ip4_fib_t* comprises an mtrie (for fast lookup in the data-plane)
|
||||
and a hash table per-prefix length (for lookup in the control plane).
|
||||
|
||||
To see the amount of memory consumed by the IPv4 tables use:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh ip fib mem
|
||||
ipv4-VRF:0 mtrie:335744 hash:4663
|
||||
ipv4-VRF:1 mtrie:333056 hash:3499
|
||||
totals: mtrie:668800 hash:8162 all:676962
|
||||
|
||||
this output shows two 'empty' (i.e. no added routes) tables. Each
|
||||
mtrie uses about 150k of memory, so each table about 300k.
|
||||
|
||||
|
||||
Below the output having added 1M, 2M and 4M routes respectively:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh ip fib mem
|
||||
ipv4-VRF:0 mtrie:335744 hash:4695
|
||||
totals: mtrie:335744 hash:4695 all:340439
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh ip fib mem
|
||||
ipv4-VRF:0 mtrie:5414720 hash:41177579
|
||||
totals: mtrie:5414720 hash:41177579 all:46592299
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh ip fib mem
|
||||
ipv4-VRF:0 mtrie:22452608 hash:168544508
|
||||
totals: mtrie:22452608 hash:168544508 all:190997116
|
||||
|
||||
|
||||
IPv6 also has the concept of forwarding and non-forwarding entries,
|
||||
however for IPv6 all the forwarding entries are stored in a single
|
||||
hash table (same goes for the non-forwarding). The key to the hash
|
||||
table includes the IPv6 table-id.
|
||||
|
||||
To see the amount of memory consumed by the IPv4 tables use:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh ip6 fib mem
|
||||
IPv6 Non-Forwarding Hash Table:
|
||||
Hash table ip6 FIB non-fwding table
|
||||
7 active elements 7 active buckets
|
||||
1 free lists
|
||||
0 linear search buckets
|
||||
arena: base 7f2fe28bf000, next 803c0
|
||||
used 525248 b (0 Mbytes) of 33554432 b (32 Mbytes)
|
||||
|
||||
IPv6 Forwarding Hash Table:
|
||||
Hash table ip6 FIB fwding table
|
||||
7 active elements 7 active buckets
|
||||
1 free lists
|
||||
0 linear search buckets
|
||||
arena: base 7f2fe48bf000, next 803c0
|
||||
used 525248 b (0 Mbytes) of 33554432 b (32 Mbytes)
|
||||
|
||||
as we scale to 128k IPv6 entries:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh ip6 fib mem
|
||||
IPv6 Non-Forwarding Hash Table:
|
||||
Hash table ip6 FIB non-fwding table
|
||||
131079 active elements 32773 active buckets
|
||||
2 free lists
|
||||
[len 1] 2 free elts
|
||||
0 linear search buckets
|
||||
arena: base 7fed7a514000, next 4805c0
|
||||
used 4720064 b (4 Mbytes) of 1073741824 b (1024 Mbytes)
|
||||
|
||||
IPv6 Forwarding Hash Table:
|
||||
Hash table ip6 FIB fwding table
|
||||
131079 active elements 32773 active buckets
|
||||
2 free lists
|
||||
[len 1] 2 free elts
|
||||
0 linear search buckets
|
||||
arena: base 7fedba514000, next 4805c0
|
||||
used 4720064 b (4 Mbytes) of 1073741824 b (1024 Mbytes)
|
||||
|
||||
and 256k:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh ip6 fib mem
|
||||
IPv6 Non-Forwarding Hash Table:
|
||||
Hash table ip6 FIB non-fwding table
|
||||
262151 active elements 65536 active buckets
|
||||
2 free lists
|
||||
[len 1] 6 free elts
|
||||
0 linear search buckets
|
||||
arena: base 7fed7a514000, next 880840
|
||||
used 8915008 b (8 Mbytes) of 1073741824 b (1024 Mbytes)
|
||||
|
||||
IPv6 Forwarding Hash Table:
|
||||
Hash table ip6 FIB fwding table
|
||||
262151 active elements 65536 active buckets
|
||||
2 free lists
|
||||
[len 1] 6 free elts
|
||||
0 linear search buckets
|
||||
arena: base 7fedba514000, next 880840
|
||||
used 8915008 b (8 Mbytes) of 1073741824 b (1024 Mbytes)
|
||||
|
||||
and 1M:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh ip6 fib mem
|
||||
IPv6 Non-Forwarding Hash Table:
|
||||
Hash table ip6 FIB non-fwding table
|
||||
1048583 active elements 65536 active buckets
|
||||
4 free lists
|
||||
[len 1] 65533 free elts
|
||||
[len 2] 65531 free elts
|
||||
[len 4] 9 free elts
|
||||
0 linear search buckets
|
||||
arena: base 7fed7a514000, next 3882740
|
||||
used 59254592 b (56 Mbytes) of 1073741824 b (1024 Mbytes)
|
||||
|
||||
IPv6 Forwarding Hash Table:
|
||||
Hash table ip6 FIB fwding table
|
||||
1048583 active elements 65536 active buckets
|
||||
4 free lists
|
||||
[len 1] 65533 free elts
|
||||
[len 2] 65531 free elts
|
||||
[len 4] 9 free elts
|
||||
0 linear search buckets
|
||||
arena: base 7fedba514000, next 3882740
|
||||
used 59254592 b (56 Mbytes) of 1073741824 b (1024 Mbytes)
|
||||
|
||||
as can be seen from the output the IPv6 hash-table in this case was scaled
|
||||
to 1GB and 1million prefixes has used 56MB of it.
|
||||
|
||||
The main heap is also used to allocate objects that represent the FIB
|
||||
entries in the control and data plane (see :ref:`controlplane` and
|
||||
:ref:`dataplane`) such as *fib_entry_t* and *load_balance_t*. These come
|
||||
from the main heap because they are not protocol specific
|
||||
(i.e. they are used to represent either IPv4, IPv6 or MPLS
|
||||
entries).
|
||||
|
||||
With 1M prefixes allocated the memory usage is:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh fib mem
|
||||
FIB memory
|
||||
Tables:
|
||||
SAFI Number Bytes
|
||||
IPv4 unicast 1 33619968
|
||||
IPv6 unicast 2 118502784
|
||||
MPLS 0 0
|
||||
IPv4 multicast 1 1175
|
||||
IPv6 multicast 1 525312
|
||||
Nodes:
|
||||
Name Size in-use /allocated totals
|
||||
Entry 72 1048589/ 1048589 75498408/75498408
|
||||
Entry Source 40 1048589/ 1048589 41943560/41943560
|
||||
Entry Path-Extensions 76 0 / 0 0/0
|
||||
multicast-Entry 192 6 / 6 1152/1152
|
||||
Path-list 40 18 / 18 720/720
|
||||
uRPF-list 16 14 / 14 224/224
|
||||
Path 72 22 / 22 1584/1584
|
||||
Node-list elements 20 1048602/ 1048602 20972040/20972040
|
||||
Node-list heads 8 24 / 24 192/192
|
||||
|
||||
and with 2M
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
vpp# sh fib mem
|
||||
FIB memory
|
||||
Tables:
|
||||
SAFI Number Bytes
|
||||
IPv4 unicast 1 33619968
|
||||
IPv6 unicast 2 252743040
|
||||
MPLS 0 0
|
||||
IPv4 multicast 1 1175
|
||||
IPv6 multicast 1 525312
|
||||
Nodes:
|
||||
Name Size in-use /allocated totals
|
||||
Entry 72 2097165/ 2097165 150995880/150995880
|
||||
Entry Source 40 2097165/ 2097165 83886600/83886600
|
||||
Entry Path-Extensions 76 0 / 0 0/0
|
||||
multicast-Entry 192 6 / 6 1152/1152
|
||||
Path-list 40 18 / 19 720/760
|
||||
uRPF-list 16 18 / 18 288/288
|
||||
Path 72 22 / 23 1584/1656
|
||||
Node-list elements 20 2097178/ 2097178 41943560/41943560
|
||||
Node-list heads 8 24 / 24 192/192
|
||||
|
||||
However, the situation is not a simple as that. All of the 1M prefixes
|
||||
added above were reachable via the same next-hop, so the path-list
|
||||
(and path) they use is shared. As prefixes are added that use
|
||||
different (sets of) next-hops, the number of path-lists and paths
|
||||
requires will increase.
|
||||
|
||||
|
||||
Stats Heap
|
||||
^^^^^^^^^^
|
||||
|
||||
VPP collects statistics for each route. For each route VPP collects
|
||||
byte and packet counters for packets sent to the prefix (i.e. the
|
||||
route was matched in the data-plane) and packets sent via the prefix (i.e. the
|
||||
matching prefix is reachable through it - like a BGP peer). This
|
||||
requires 4 counters per route in the stats segment.
|
||||
|
||||
Below shows the size of the stats segment with 1M, 2M and 4M routes.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
total: 1023.99M, used: 127.89M, free: 896.10M, trimmable: 830.94M
|
||||
total: 1023.99M, used: 234.14M, free: 789.85M, trimmable: 668.15M
|
||||
total: 1023.99M, used: 456.83M, free: 567.17M, trimmable: 388.91M
|
||||
|
15
docs/developer/corefeatures/fib/thedatamodel.rst
Normal file
15
docs/developer/corefeatures/fib/thedatamodel.rst
Normal file
@ -0,0 +1,15 @@
|
||||
.. _thedatamodel:
|
||||
|
||||
The Data Model
|
||||
--------------
|
||||
|
||||
The FIB data model comprises two parts; the control-plane (CP) and the data-plane
|
||||
(DP). The CP data model represents the data that is programmed into VPP by the
|
||||
upper layers. The DP model represents how VPP derives actions to be performed on
|
||||
packets as they are switched.
|
||||
|
||||
.. toctree::
|
||||
|
||||
controlplane
|
||||
dataplane
|
||||
|
62
docs/developer/corefeatures/fib/tunnels.rst
Normal file
62
docs/developer/corefeatures/fib/tunnels.rst
Normal file
@ -0,0 +1,62 @@
|
||||
.. _tunnels:
|
||||
|
||||
Tunnels
|
||||
-------
|
||||
|
||||
Tunnels share a similar property to recursive routes in that after applying the
|
||||
tunnel encapsulation, a new packet must be forwarded, i.e. forwarding is
|
||||
recursive. However, as with recursive routes the tunnel's destination is known
|
||||
beforehand, so the second lookup can be avoided if the packet can follow the
|
||||
already constructed data-plane graph for the tunnel's destination. This process
|
||||
of joining to DP graphs together is termed *stacking*.
|
||||
|
||||
.. figure:: /_images/fib20fig11.png
|
||||
|
||||
Figure 11: Tunnel control plane object diagram
|
||||
|
||||
Figure 11 shows the control plane object graph for a route via a tunnel. The two
|
||||
sub-graphs for the route via the tunnel and the route for the tunnel's
|
||||
destination are shown to the right and left respectively. The red line shows the
|
||||
relationship form by stacking the two sub-graphs. The adjacency on the tunnel
|
||||
interface is termed a 'mid-chain' since it is now present in the middle of the
|
||||
graph/chain rather than its usual terminal location.
|
||||
|
||||
The mid-chain adjacency is contributed by the gre_tunnel_t , which also becomes
|
||||
part of the FIB control-plane graph. Consequently it will be visited by a
|
||||
back-walk when the forwarding information for the tunnel's destination changes.
|
||||
This will trigger it to restack the mid-chain adjacency on the new
|
||||
*load_balance_t* contributed by the parent *fib_entry_t*.
|
||||
|
||||
If the back-walk indicates that there is no route to the tunnel's
|
||||
destination, or that the resolving route does not meet resolution
|
||||
constraints, then the tunnel can be marked as down, and fast
|
||||
convergence can be triggered in the same way as for physical interfaces (see section ...).
|
||||
|
||||
|
||||
Multi-Point Tunnels
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Multi-point tunnels are an example of a non-broadcast multi-access
|
||||
interface. In simple terms this means there are many peers on the link
|
||||
but it is not possible to broadcast a single message to all of them at
|
||||
once, and hence the usual peer discovery mechanism (as employed,
|
||||
e.g. by ARP) is not available. Although an *ip_neighbor_t* is a
|
||||
representation of an IP peer on a link, it is not valid in this
|
||||
context as it maps the peer's identity to its MAC address. For a
|
||||
tunnel peer it is required to map the peer's overlay address (the
|
||||
attached address, the one in the same subnet as the device) with the
|
||||
peer's underlay address (probably on the other side of the
|
||||
internet). In the P2P case where there is only one peer on the link,
|
||||
the peer's underlay address is the same as the tunnel's destination
|
||||
address.
|
||||
The data structure that represents the mapping of the peer's overlay
|
||||
with underlay address is an entry in the Tunnel Endpoint Information
|
||||
Base (TEIB); the *tieb_entry_t*. TEIB entries are created by the
|
||||
control plane (e.g. NHRP (RFC2332)).
|
||||
|
||||
Each mid-chain adjacency on a multi-point tunnel is stacked on the
|
||||
*fib_entry_t* object that resolves the peer's underlay address. The
|
||||
glean adjacency on the tunnel resolves via a drop, since broadcasts
|
||||
are not possible. A multicast adjacency on a multi-point tunnel is
|
||||
currently a work in progress.
|
||||
|
Reference in New Issue
Block a user