402 lines
11 KiB
ReStructuredText
402 lines
11 KiB
ReStructuredText
|
Debugging and Reporting Bugs in Contiv-VPP
|
||
|
==========================================
|
||
|
|
||
|
Bug Report Structure
|
||
|
--------------------
|
||
|
|
||
|
- `Deployment description <#describe-deployment>`__: Briefly describes
|
||
|
the deployment, where an issue was spotted, number of k8s nodes, is
|
||
|
DHCP/STN/TAP used.
|
||
|
|
||
|
- `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least
|
||
|
from the vswitch pods.
|
||
|
|
||
|
- `VPP config <#inspect-vpp-config>`__: Attach output of the show
|
||
|
commands.
|
||
|
|
||
|
- `Basic Collection Example <#basic-example>`__
|
||
|
|
||
|
Describe Deployment
|
||
|
~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
Since contiv-vpp can be used with different configurations, it is
|
||
|
helpful to attach the config that was applied. Either attach
|
||
|
``values.yaml`` passed to the helm chart, or attach the `corresponding
|
||
|
part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__
|
||
|
from the deployment yaml file.
|
||
|
|
||
|
.. code:: yaml
|
||
|
|
||
|
contiv.yaml: |-
|
||
|
TCPstackDisabled: true
|
||
|
UseTAPInterfaces: true
|
||
|
TAPInterfaceVersion: 2
|
||
|
NatExternalTraffic: true
|
||
|
MTUSize: 1500
|
||
|
IPAMConfig:
|
||
|
PodSubnetCIDR: 10.1.0.0/16
|
||
|
PodNetworkPrefixLen: 24
|
||
|
PodIfIPCIDR: 10.2.1.0/24
|
||
|
VPPHostSubnetCIDR: 172.30.0.0/16
|
||
|
VPPHostNetworkPrefixLen: 24
|
||
|
NodeInterconnectCIDR: 192.168.16.0/24
|
||
|
VxlanCIDR: 192.168.30.0/24
|
||
|
NodeInterconnectDHCP: False
|
||
|
|
||
|
Information that might be helpful: - Whether node IPs are statically
|
||
|
assigned, or if DHCP is used - STN is enabled - Version of TAP
|
||
|
interfaces used - Output of
|
||
|
``kubectl get pods -o wide --all-namespaces``
|
||
|
|
||
|
Collecting the Logs
|
||
|
~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
The most essential thing that needs to be done when debugging and
|
||
|
**reporting an issue** in Contiv-VPP is **collecting the logs from the
|
||
|
contiv-vpp vswitch containers**.
|
||
|
|
||
|
a) Collecting Vswitch Logs Using kubectl
|
||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
|
||
|
In order to collect the logs from individual vswitches in the cluster,
|
||
|
connect to the master node and then find the POD names of the individual
|
||
|
vswitch containers:
|
||
|
|
||
|
::
|
||
|
|
||
|
$ kubectl get pods --all-namespaces | grep vswitch
|
||
|
kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h
|
||
|
kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h
|
||
|
|
||
|
Then run the following command, with *pod name* replaced by the actual
|
||
|
POD name:
|
||
|
|
||
|
::
|
||
|
|
||
|
$ kubectl logs <pod name> -n kube-system -c contiv-vswitch
|
||
|
|
||
|
Redirect the output to a file to save the logs, for example:
|
||
|
|
||
|
::
|
||
|
|
||
|
kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
|
||
|
|
||
|
b) Collecting Vswitch Logs Using Docker
|
||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
|
||
|
If option a) does not work, then you can still collect the same logs
|
||
|
using the plain docker command. For that, you need to connect to each
|
||
|
individual node in the k8s cluster, and find the container ID of the
|
||
|
vswitch container:
|
||
|
|
||
|
::
|
||
|
|
||
|
$ docker ps | grep contivvpp/vswitch
|
||
|
b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
|
||
|
|
||
|
Now use the ID from the first column to dump the logs into the
|
||
|
``logs-master.txt`` file:
|
||
|
|
||
|
::
|
||
|
|
||
|
$ docker logs b682b5837e52 > logs-master.txt
|
||
|
|
||
|
Reviewing the Vswitch Logs
|
||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
|
||
|
In order to debug an issue, it is good to start by grepping the logs for
|
||
|
the ``level=error`` string, for example:
|
||
|
|
||
|
::
|
||
|
|
||
|
$ cat logs-master.txt | grep level=error
|
||
|
|
||
|
Also, VPP or contiv-agent may crash with some bugs. To check if some
|
||
|
process crashed, grep for the string ``exit``, for example:
|
||
|
|
||
|
::
|
||
|
|
||
|
$ cat logs-master.txt | grep exit
|
||
|
2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
|
||
|
2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
|
||
|
|
||
|
Collecting the STN Daemon Logs
|
||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
|
||
|
In STN (Steal The NIC) deployment scenarios, often need to collect and
|
||
|
review the logs from the STN daemon. This needs to be done on each node:
|
||
|
|
||
|
::
|
||
|
|
||
|
$ docker logs contiv-stn > logs-stn-master.txt
|
||
|
|
||
|
Collecting Logs in Case of Crash Loop
|
||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
|
||
|
If the vswitch is crashing in a loop (which can be determined by
|
||
|
increasing the number in the ``RESTARTS`` column of the
|
||
|
``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or
|
||
|
``docker logs`` would give us the logs of the latest incarnation of the
|
||
|
vswitch. That might not be the original root cause of the very first
|
||
|
crash, so in order to debug that, we need to disable k8s health check
|
||
|
probes to not restart the vswitch after the very first crash. This can
|
||
|
be done by commenting-out the ``readinessProbe`` and ``livenessProbe``
|
||
|
in the contiv-vpp deployment YAML:
|
||
|
|
||
|
.. code:: diff
|
||
|
|
||
|
diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
|
||
|
index 3676047..ffa4473 100644
|
||
|
--- a/k8s/contiv-vpp.yaml
|
||
|
+++ b/k8s/contiv-vpp.yaml
|
||
|
@@ -224,18 +224,18 @@ spec:
|
||
|
ports:
|
||
|
# readiness + liveness probe
|
||
|
- containerPort: 9999
|
||
|
- readinessProbe:
|
||
|
- httpGet:
|
||
|
- path: /readiness
|
||
|
- port: 9999
|
||
|
- periodSeconds: 1
|
||
|
- initialDelaySeconds: 15
|
||
|
- livenessProbe:
|
||
|
- httpGet:
|
||
|
- path: /liveness
|
||
|
- port: 9999
|
||
|
- periodSeconds: 1
|
||
|
- initialDelaySeconds: 60
|
||
|
+ # readinessProbe:
|
||
|
+ # httpGet:
|
||
|
+ # path: /readiness
|
||
|
+ # port: 9999
|
||
|
+ # periodSeconds: 1
|
||
|
+ # initialDelaySeconds: 15
|
||
|
+ # livenessProbe:
|
||
|
+ # httpGet:
|
||
|
+ # path: /liveness
|
||
|
+ # port: 9999
|
||
|
+ # periodSeconds: 1
|
||
|
+ # initialDelaySeconds: 60
|
||
|
env:
|
||
|
- name: MICROSERVICE_LABEL
|
||
|
valueFrom:
|
||
|
|
||
|
If VPP is the crashing process, please follow the
|
||
|
[CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
|
||
|
|
||
|
Inspect VPP Config
|
||
|
~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
Inspect the following areas: - Configured interfaces (issues related
|
||
|
basic node/pod connectivity issues):
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh int addr
|
||
|
GigabitEthernet0/9/0 (up):
|
||
|
192.168.16.1/24
|
||
|
local0 (dn):
|
||
|
loop0 (up):
|
||
|
l2 bridge bd_id 1 bvi shg 0
|
||
|
192.168.30.1/24
|
||
|
tapcli-0 (up):
|
||
|
172.30.1.1/24
|
||
|
|
||
|
- IP forwarding table:
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh ip fib
|
||
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
|
||
|
0.0.0.0/0
|
||
|
unicast-ip4-chain
|
||
|
[@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
|
||
|
[0] [@0]: dpo-drop ip4
|
||
|
0.0.0.0/32
|
||
|
unicast-ip4-chain
|
||
|
[@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
|
||
|
[0] [@0]: dpo-drop ip4
|
||
|
|
||
|
...
|
||
|
...
|
||
|
|
||
|
255.255.255.255/32
|
||
|
unicast-ip4-chain
|
||
|
[@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
|
||
|
[0] [@0]: dpo-drop ip4
|
||
|
|
||
|
- ARP Table:
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh ip arp
|
||
|
Time IP4 Flags Ethernet Interface
|
||
|
728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0
|
||
|
542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0
|
||
|
1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0
|
||
|
15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1
|
||
|
739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2
|
||
|
739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3
|
||
|
|
||
|
- NAT configuration (issues related to services):
|
||
|
|
||
|
::
|
||
|
|
||
|
DBGvpp# sh nat44 addresses
|
||
|
NAT44 pool addresses:
|
||
|
192.168.16.10
|
||
|
tenant VRF independent
|
||
|
0 busy udp ports
|
||
|
0 busy tcp ports
|
||
|
0 busy icmp ports
|
||
|
NAT44 twice-nat pool addresses:
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh nat44 static mappings
|
||
|
NAT44 static mappings:
|
||
|
tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only
|
||
|
tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only
|
||
|
tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only
|
||
|
tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only
|
||
|
tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only
|
||
|
tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only
|
||
|
udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
|
||
|
tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh nat44 interfaces
|
||
|
NAT44 interfaces:
|
||
|
loop0 in out
|
||
|
GigabitEthernet0/9/0 out
|
||
|
tapcli-0 in out
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh nat44 sessions
|
||
|
NAT44 sessions:
|
||
|
192.168.20.2: 0 dynamic translations, 3 static translations
|
||
|
10.1.1.3: 0 dynamic translations, 0 static translations
|
||
|
10.1.1.4: 0 dynamic translations, 0 static translations
|
||
|
10.1.1.2: 0 dynamic translations, 6 static translations
|
||
|
10.1.2.18: 0 dynamic translations, 2 static translations
|
||
|
|
||
|
- ACL config (issues related to policies):
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh acl-plugin acl
|
||
|
|
||
|
- “Steal the NIC (STN)” config (issues related to host connectivity
|
||
|
when STN is active):
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh stn rules
|
||
|
- rule_index: 0
|
||
|
address: 10.1.10.47
|
||
|
iface: tapcli-0 (2)
|
||
|
next_node: tapcli-0-output (410)
|
||
|
|
||
|
- Errors:
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh errors
|
||
|
|
||
|
- Vxlan tunnels:
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh vxlan tunnels
|
||
|
|
||
|
- Vxlan tunnels:
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh vxlan tunnels
|
||
|
|
||
|
- Hardware interface information:
|
||
|
|
||
|
::
|
||
|
|
||
|
vpp# sh hardware-interfaces
|
||
|
|
||
|
Basic Example
|
||
|
~~~~~~~~~~~~~
|
||
|
|
||
|
`contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__
|
||
|
is an example of a script that may be a useful starting point to
|
||
|
gathering the above information using kubectl.
|
||
|
|
||
|
Limitations: - The script does not include STN daemon logs nor does it
|
||
|
handle the special case of a crash loop
|
||
|
|
||
|
Prerequisites: - The user specified in the script must have passwordless
|
||
|
access to all nodes in the cluster; on each node in the cluster the user
|
||
|
must have passwordless access to sudo.
|
||
|
|
||
|
Setting up Prerequisites
|
||
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
|
||
|
To enable logging into a node without a password, copy your public key
|
||
|
to the following node:
|
||
|
|
||
|
::
|
||
|
|
||
|
ssh-copy-id <user-id>@<node-name-or-ip-address>
|
||
|
|
||
|
To enable running sudo without a password for a given user, enter:
|
||
|
|
||
|
::
|
||
|
|
||
|
$ sudo visudo
|
||
|
|
||
|
Append the following entry to run ALL command without a password for a
|
||
|
given user:
|
||
|
|
||
|
::
|
||
|
|
||
|
<userid> ALL=(ALL) NOPASSWD:ALL
|
||
|
|
||
|
You can also add user ``<user-id>`` to group ``sudo`` and edit the
|
||
|
``sudo`` entry as follows:
|
||
|
|
||
|
::
|
||
|
|
||
|
# Allow members of group sudo to execute any command
|
||
|
%sudo ALL=(ALL:ALL) NOPASSWD:ALL
|
||
|
|
||
|
Add user ``<user-id>`` to group ``<group-id>`` as follows:
|
||
|
|
||
|
::
|
||
|
|
||
|
sudo adduser <user-id> <group-id>
|
||
|
|
||
|
or as follows:
|
||
|
|
||
|
::
|
||
|
|
||
|
usermod -a -G <group-id> <user-id>
|
||
|
|
||
|
Working with the Contiv-VPP Vagrant Test Bed
|
||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
|
||
|
The script can be used to collect data from the `Contiv-VPP test bed
|
||
|
created with
|
||
|
Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__.
|
||
|
To collect debug information from this Contiv-VPP test bed, do the
|
||
|
following steps: \* In the directory where you created your vagrant test
|
||
|
bed, do:
|
||
|
|
||
|
::
|
||
|
|
||
|
vagrant ssh-config > vagrant-ssh.conf
|
||
|
|
||
|
- To collect the debug information do:
|
||
|
|
||
|
::
|
||
|
|
||
|
./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf
|