197 lines
9.4 KiB
ReStructuredText
197 lines
9.4 KiB
ReStructuredText
|
Contiv/VPP Network Operation
|
||
|
============================
|
||
|
|
||
|
This document describes the network operation of the Contiv/VPP k8s
|
||
|
network plugin. It elaborates the operation and config options of the
|
||
|
Contiv IPAM, as well as details on how the VPP gets programmed by
|
||
|
Contiv/VPP control plane.
|
||
|
|
||
|
The following picture shows 2-node k8s deployment of Contiv/VPP, with a
|
||
|
VXLAN tunnel established between the nodes to forward inter-node POD
|
||
|
traffic. The IPAM options are depicted on the Node 1, whereas the VPP
|
||
|
programming is depicted on the Node 2.
|
||
|
|
||
|
.. figure:: /_images/contiv-networking.png
|
||
|
:alt: contiv-networking.png
|
||
|
|
||
|
Contiv/VPP Architecture
|
||
|
|
||
|
Contiv/VPP IPAM (IP Address Management)
|
||
|
---------------------------------------
|
||
|
|
||
|
IPAM in Contiv/VPP is based on the concept of **Node ID**. The Node ID
|
||
|
is a number that uniquely identifies a node in the k8s cluster. The
|
||
|
first node is assigned the ID of 1, the second node 2, etc. If a node
|
||
|
leaves the cluster, its ID is released back to the pool and will be
|
||
|
re-used by the next node.
|
||
|
|
||
|
The Node ID is used to calculate per-node IP subnets for PODs and other
|
||
|
internal subnets that need to be unique on each node. Apart from the
|
||
|
Node ID, the input for IPAM calculations is a set of config knobs, which
|
||
|
can be specified in the ``IPAMConfig`` section of the [Contiv/VPP
|
||
|
deployment YAML](../../../k8s/contiv-vpp.yaml):
|
||
|
|
||
|
- **PodSubnetCIDR** (default ``10.1.0.0/16``): each pod gets an IP
|
||
|
address assigned from this range. The size of this range (default
|
||
|
``/16``) dictates upper limit of POD count for the entire k8s cluster
|
||
|
(default 65536 PODs).
|
||
|
|
||
|
- **PodNetworkPrefixLen** (default ``24``): per-node dedicated
|
||
|
podSubnet range. From the allocatable range defined in
|
||
|
``PodSubnetCIDR``, this value will dictate the allocation for each
|
||
|
node. With the default value (``24``) this indicates that each node
|
||
|
has a ``/24`` slice of the ``PodSubnetCIDR``. The Node ID is used to
|
||
|
address the node. In case of ``PodSubnetCIDR = 10.1.0.0/16``,
|
||
|
``PodNetworkPrefixLen = 24`` and ``NodeID = 5``, the resulting POD
|
||
|
subnet for the node would be ``10.1.5.0/24``.
|
||
|
|
||
|
- **PodIfIPCIDR** (default ``10.2.1.0/24``): VPP-internal addresses put
|
||
|
the VPP interfaces facing towards the PODs into L3 mode. This IP
|
||
|
range will be reused on each node, thereby it is never externally
|
||
|
addressable outside of the node itself. The only requirement is that
|
||
|
this subnet should not collide with any other IPAM subnet.
|
||
|
|
||
|
- **VPPHostSubnetCIDR** (default ``172.30.0.0/16``): used for
|
||
|
addressing the interconnect of VPP with the Linux network stack,
|
||
|
within the same node. Since this subnet needs to be unique on each
|
||
|
node, the Node ID is used to determine the actual subnet used on the
|
||
|
node with the combination of ``VPPHostNetworkPrefixLen``,
|
||
|
``PodSubnetCIDR`` and ``PodNetworkPrefixLen``.
|
||
|
|
||
|
- **VPPHostNetworkPrefixLen** (default ``24``): used to calculate the
|
||
|
subnet for addressing the interconnect of VPP with the Linux network
|
||
|
stack, within the same node. With
|
||
|
``VPPHostSubnetCIDR = 172.30.0.0/16``,
|
||
|
``VPPHostNetworkPrefixLen = 24`` and ``NodeID = 5`` the resulting
|
||
|
subnet for the node would be ``172.30.5.0/24``.
|
||
|
|
||
|
- **NodeInterconnectCIDR** (default ``192.168.16.0/24``): range for the
|
||
|
addresses assigned to the data plane interfaces managed by VPP.
|
||
|
Unless DHCP is used (``NodeInterconnectDHCP = True``), the Contiv/VPP
|
||
|
control plane automatically assigns an IP address from this range to
|
||
|
the DPDK-managed ethernet interface bound to VPP on each node. The
|
||
|
actual IP address will be calculated from the Node ID (e.g., with
|
||
|
``NodeInterconnectCIDR = 192.168.16.0/24`` and ``NodeID = 5``, the
|
||
|
resulting IP address assigned to the ethernet interface on VPP will
|
||
|
be ``192.168.16.5`` ).
|
||
|
|
||
|
- **NodeInterconnectDHCP** (default ``False``): instead of assigning
|
||
|
the IPs for the data plane interfaces, which are managed by VPP from
|
||
|
``NodeInterconnectCIDR`` by the Contiv/VPP control plane, DHCP
|
||
|
assigns the IP addresses. The DHCP must be running in the network
|
||
|
where the data plane interface is connected, in case
|
||
|
``NodeInterconnectDHCP = True``, ``NodeInterconnectCIDR`` is ignored.
|
||
|
|
||
|
- **VxlanCIDR** (default ``192.168.30.0/24``): in order to provide
|
||
|
inter-node POD to POD connectivity via any underlay network (not
|
||
|
necessarily an L2 network), Contiv/VPP sets up a VXLAN tunnel overlay
|
||
|
between each of the 2 nodes within the cluster. Each node needs its
|
||
|
unique IP address of the VXLAN BVI interface. This IP address is
|
||
|
automatically calculated from the Node ID, (e.g., with
|
||
|
``VxlanCIDR = 192.168.30.0/24`` and ``NodeID = 5``, the resulting IP
|
||
|
address assigned to the VXLAN BVI interface will be
|
||
|
``192.168.30.5``).
|
||
|
|
||
|
VPP Programming
|
||
|
---------------
|
||
|
|
||
|
This section describes how the Contiv/VPP control plane programs VPP,
|
||
|
based on the events it receives from k8s. This section is not
|
||
|
necessarily for understanding basic Contiv/VPP operation, but is very
|
||
|
useful for debugging purposes.
|
||
|
|
||
|
Contiv/VPP currently uses a single VRF to forward the traffic between
|
||
|
PODs on a node, PODs on different nodes, host network stack, and
|
||
|
DPDK-managed dataplane interface. The forwarding between each of them is
|
||
|
purely L3-based, even for cases of communication between 2 PODs within
|
||
|
the same node.
|
||
|
|
||
|
DPDK-Managed Data Interface
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
In order to allow inter-node communication between PODs on different
|
||
|
nodes and between PODs and outside world, Contiv/VPP uses data-plane
|
||
|
interfaces bound to VPP using DPDK. Each node should have one “main” VPP
|
||
|
interface, which is unbound from the host network stack and bound to
|
||
|
VPP. The Contiv/VPP control plane automatically configures the interface
|
||
|
either via DHCP, or with a statically assigned address (see
|
||
|
``NodeInterconnectCIDR`` and ``NodeInterconnectDHCP`` yaml settings).
|
||
|
|
||
|
PODs on the Same Node
|
||
|
~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
PODs are connected to VPP using virtio-based TAP interfaces created by
|
||
|
VPP, with the POD-end of the interface placed into the POD container
|
||
|
network namespace. Each POD is assigned an IP address from the
|
||
|
``PodSubnetCIDR``. The allocated IP is configured with the prefix length
|
||
|
``/32``. Additionally, a static route pointing towards the VPP is
|
||
|
configured in the POD network namespace. The prefix length ``/32`` means
|
||
|
that all IP traffic will be forwarded to the default route - VPP. To get
|
||
|
rid of unnecessary broadcasts between POD and VPP, a static ARP entry is
|
||
|
configured for the gateway IP in the POD namespace, as well as for POD
|
||
|
IP on VPP. Both ends of the TAP interface have a static (non-default)
|
||
|
MAC address applied.
|
||
|
|
||
|
PODs with hostNetwork=true
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
PODs with a ``hostNetwork=true`` attribute are not placed into a
|
||
|
separate network namespace, they instead use the main host Linux network
|
||
|
namespace; therefore, they are not directly connected to the VPP. They
|
||
|
rely on the interconnection between the VPP and the host Linux network
|
||
|
stack, which is described in the next paragraph. Note, when these PODs
|
||
|
access some service IP, their network communication will be NATed in
|
||
|
Linux (by iptables rules programmed by kube-proxy) as opposed to VPP,
|
||
|
which is the case for the PODs connected to VPP directly.
|
||
|
|
||
|
Linux Host Network Stack
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
In order to interconnect the Linux host network stack with VPP (to allow
|
||
|
access to the cluster resources from the host itself, as well as for the
|
||
|
PODs with ``hostNetwork=true``), VPP creates a TAP interface between VPP
|
||
|
and the main network namespace. The TAP interface is configured with IP
|
||
|
addresses from the ``VPPHostSubnetCIDR`` range, with ``.1`` in the
|
||
|
latest octet on the VPP side, and ``.2`` on the host side. The name of
|
||
|
the host interface is ``vpp1``. The host has static routes pointing to
|
||
|
VPP configured with: - A route to the whole ``PodSubnetCIDR`` to route
|
||
|
traffic targeting PODs towards VPP. - A route to ``ServiceCIDR``
|
||
|
(default ``10.96.0.0/12``), to route service IP targeted traffic that
|
||
|
has not been translated by kube-proxy for some reason towards VPP. - The
|
||
|
host also has a static ARP entry configured for the IP of the VPP-end
|
||
|
TAP interface, to get rid of unnecessary broadcasts between the main
|
||
|
network namespace and VPP.
|
||
|
|
||
|
VXLANs to Other Nodes
|
||
|
~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
In order to provide inter-node POD to POD connectivity via any underlay
|
||
|
network (not necessarily an L2 network), Contiv/VPP sets up a VXLAN
|
||
|
tunnel overlay between each 2 nodes within the cluster (full mesh).
|
||
|
|
||
|
All VXLAN tunnels are terminated in one bridge domain on each VPP. The
|
||
|
bridge domain has learning and flooding disabled, the l2fib of the
|
||
|
bridge domain contains a static entry for each VXLAN tunnel. Each bridge
|
||
|
domain has a BVI interface, which interconnects the bridge domain with
|
||
|
the main VRF (L3 forwarding). This interface needs a unique IP address,
|
||
|
which is assigned from the ``VxlanCIDR`` as describe above.
|
||
|
|
||
|
The main VRF contains several static routes that point to the BVI IP
|
||
|
addresses of other nodes. For each node, it is a route to PODSubnet and
|
||
|
VppHostSubnet of the remote node, as well as a route to the management
|
||
|
IP address of the remote node. For each of these routes, the next hop IP
|
||
|
is the BVI interface IP of the remote node, which goes via the BVI
|
||
|
interface of the local node.
|
||
|
|
||
|
The VXLAN tunnels and the static routes pointing to them are
|
||
|
added/deleted on each VPP, whenever a node is added/deleted in the k8s
|
||
|
cluster.
|
||
|
|
||
|
More Info
|
||
|
~~~~~~~~~
|
||
|
|
||
|
Please refer to the [Packet Flow Dev
|
||
|
Guide](../dev-guide/PACKET_FLOW.html) for more detailed description of
|
||
|
paths traversed by request and response packets inside Contiv/VPP
|
||
|
Kubernetes cluster under different situations.
|