Failover

From KVM
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Failover for networking devices

Current status (as of January 2019): The guest driver model is described in linux kernel Documentation/networking/net_failover.rst or in https://www.kernel.org/doc/html/latest/networking/net_failover.html

The hypervisor part, especially how to unplug, re-plug the SR-IOV device still has a lot of open questions.

What is currently explored are two ways of orchestrating datapath switching in case of (for example) a migration.

1. A way that tries to encapsulate the mechanism in qemu/host and avoids involving the management layer. mark the virtio device in qemu as hidden if specified on qemu cmdline. save the virtio-net device cmdline params for later. virtio device is 'unhidden' when VIRTIO_NET_F_STANDBY feature is negotiated. This is for automatic hot-plugging and datapath switching (in host kernel).

2. Another way is to involve the management layer. Oracle has posted patches for QEMU: https://patchwork.kernel.org/cover/10721503/ This patch set introduces two new events, 1. is emitted when the VIRTIO_NET_F_STANDBY features is negotiated, 2. is emitted when the virtio_net driver is removed (manually or during a reboot). It also remove vfio devices from guest during reboot which are markes with x-failover-primary flag.


Problems/Questions

  1. Packet loss due to early mac filter update
    1. Some NIC drivers will update the MAC filter as soon as a vf is created, but before the vf driver is loaded in the guest and the vf device is ready. Therefore packets are not going to the standby(virtio) device but to the pf until the guest is up and the vf driver is loaded.
    2. Todos: create a tool to test if a NIC driver acts like this. Idea: test without a VM involved. In host test where packets go when a mac address is added to a vf on a vlan and same mac address is added to pf. pf should not be in promiscuous mode(?). Packets can be generated and send with tool 'mausezahn' of netsniff-ng and it can be determined where packets end up by using 'netsniff-ng' on pf and vf device. Status: set up tools and environment to test (2019-01-16)
    3. What to do with the results of this test? 1. Can we add a flag to the device to mark it as not usable for failover? Where to put the flag
  2. How to support hotplug of a primary/SR-IOV device for failover. Guest is already up, SR-IOV device is hotplugged in hypervisor. Device shows up in guest. How can we make it primary device in a failover/standby/primary trio?
  3. How to involve management layer in migration process? Patches sent from Oracle with rationale to make it work with old nics as well. Idea is to sent events for busmaster enable/disable.
    1. Can it be made race free? Probably yes, but we'd need to stop vcpu.
    2. Can we use switchdev to program FDBs on NIC and redirect traffic from pf to vf. Can we avoid need to stop vcpu with this? Which commands to use to program FDB entries offloaded to NIC (bridge?)
  4. Mechanisms to pci device removal
    1. pci surprise removal (as defined in pci(e) spec). might be buggy in some linux drivers, but fixes are welcome, surprise removal expected to work in general. What about EMI (electric mechanical interlock) support in Linux/Qemu?
    2. ordered removal/with guest cooperation. hw has 'attention' button. send interrupt to guest, guest ejects device. what needs to be done for that? probably only with q35 chip set. QEMU press attention button: 'pcie_abp 0'
    3. PCI Overview - Qemu: https://wiki.qemu.org/images/f/f6/PCIvsPCIe.pdf
  5. Other ways of marking two devices as standby/primary. Assigning identical MAC addresses could be problematic/confusing, e.g. what if more than two devices with identical MAC addresses show up. Which one to choose? Ideas for other ways to recognize devices that belong together:
    1. Assign an ID to both devices. Where to store the ID? PCI config space? -> no way to know what part of config space is unused.
    2. Put devices behind a PCI(e) bridge with a special device ID. The two devices on this bridge are supposed to form a failover/standby/primary device set.