Failover for networking devices
Current status (as of January 2019): The guest driver model is described in linux kernel Documentation/networking/net_failover.rst or in https://www.kernel.org/doc/html/latest/networking/net_failover.html
The hypervisor part, especially how to unplug, re-plug the SR-IOV device still has a lot of open questions.
What is currently explored are two ways of orchestrating datapath switching in case of (for example) a migration.
1. A way that tries to encapsulate the mechanism in qemu/host and avoids involving the management layer. mark the virtio device in qemu as hidden if specified on qemu cmdline. save the virtio-net device cmdline params for later. virtio device is 'unhidden' when VIRTIO_NET_F_STANDBY feature is negotiated. This is for automatic hot-plugging and datapath switching (in host kernel).
2. Another way is to involve the management layer. Oracle has posted patches for QEMU: https://patchwork.kernel.org/cover/10721503/ This patch set introduces two new events, 1. is emitted when the VIRTIO_NET_F_STANDBY features is negotiated, 2. is emitted when the virtio_net driver is removed (manually or during a reboot). It also remove vfio devices from guest during reboot which are markes with x-failover-primary flag.
How does netvsc do failover and migration
The netvsc driver doesn't use the net_failover module. In fact it was written before net_failover existed. It does have very similar features though, built-in to the netvsc driver itself. It uses a 2-device model, meaning it creates a logical device, the netvsc device and manages VF devices as they are hot-plugged. As a new VF device is detected it is grabbed and made a slave of the corresponding netvsc device. The VF device is usually renamed by udev and gets a persistant name. It is also marked with the IFF_SLAVE flag so that networking tools that are smart enough can recognize the connection between netvsc and VF device.
Matching netvsc and VF device: Here a different approach to net_failover is taken. When a new VF device is hotplugged finding the corresponding netvsc device is done by finding one with the same serial number. The VMBus API provides a serial number that is used for matching the devices. Hyperv pci controller saves the ID as the PCI slot name.
During creation/initialization of the netvsc device the slower VMBUS channel is set up as data-path called synthetic data-path. When the netvsc driver enslaves the VF device the datapath is now switched over to it. For switching the data-path a VMBUS message is sent to the hypervisor which takes care of re-programming the NIC internal switch.
The failover data-path is a VMBUS channel. It is switched to when the VF device is unplugged or becomes malfunctioning.
For the user the bond between the two devices is transparent. The netvsc device is created and becomes the interface over which 1. all traffic is routed and 2. all configuration is done to. When a VF device is hotplugged the data-path will be automatically fast and on failure/initializtion/shutdown/migration the driver switches to the slower VMBUS channel.
(change from device matching via MAC to serial ID: commit 00d7ddba1143623b31bc2c15d18216e2da031b14 Author: Stephen Hemminger <firstname.lastname@example.org> Date: Fri Sep 14 12:54:57 2018 -0700
hv_netvsc: pair VF based on serial number
- Packet loss due to early mac filter update
- Some NIC drivers will update the MAC filter as soon as a vf is created, but before the vf driver is loaded in the guest and the vf device is ready. Therefore packets are not going to the standby(virtio) device but to the pf until the guest is up and the vf driver is loaded.
- Todos: create a tool to test if a NIC driver acts like this. Idea: test without a VM involved. In host test where packets go when a mac address is added to a vf on a vlan and same mac address is added to pf. pf should not be in promiscuous mode(?). Packets can be generated and send with tool 'mausezahn' of netsniff-ng and it can be determined where packets end up by using 'netsniff-ng' on pf and vf device.
Status: set up tools and environment to test (2019-01-16) Status 2019-04-02: Code of tools is on https://github.com/jensfr/netfailover_driver_detect
- What to do with the results of this test? 1. Can we add a flag to the device to mark it as not usable for failover? Where to put the flag
- How to support hotplug of a primary/SR-IOV device for failover. Guest is already up, SR-IOV device is hotplugged in hypervisor. Device shows up in guest. How can we make it primary device in a failover/standby/primary trio?
- How to involve management layer in migration process? Patches sent from Oracle with rationale to make it work with old nics as well. Idea is to sent events for busmaster enable/disable.
- Can it be made race free? Probably yes, but we'd need to stop vcpu.
- Can we use switchdev to program FDBs on NIC and redirect traffic from pf to vf. Can we avoid need to stop vcpu with this? Which commands to use to program FDB entries offloaded to NIC (bridge?)
- Mechanisms to pci device removal
- pci surprise removal (as defined in pci(e) spec). might be buggy in some linux drivers, but fixes are welcome, surprise removal expected to work in general. What about EMI (electric mechanical interlock) support in Linux/Qemu?
- ordered removal/with guest cooperation. hw has 'attention' button. send interrupt to guest, guest ejects device. what needs to be done for that? probably only with q35 chip set. QEMU press attention button: 'pcie_abp 0'
- PCI Overview - Qemu: https://wiki.qemu.org/images/f/f6/PCIvsPCIe.pdf
- Other ways of marking two devices as standby/primary. Assigning identical MAC addresses could be problematic/confusing, e.g. what if more than two devices with identical MAC addresses show up. Which one to choose? Ideas for other ways to recognize devices that belong together:
- Assign an ID to both devices. Where to store the ID? PCI config space? -> no way to know what part of config space is unused.
- Put devices behind a PCI(e) bridge with a special device ID. The two devices on this bridge are supposed to form a failover/standby/primary device set.