NetworkingTodo

This page should cover all networking related activity in KVM, currently most info is related to virtio-net.

TODO: add bugzilla entry links.

projects in progress. contributions are still very wellcome!

large-order allocations

  see 28d6427109d13b0f447cba5761f88d3548e83605
  Developer: MST

vhost-net scalability tuning: threading for many VMs

     Plan: switch to workqueue shared by many VMs
     http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html

http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument

     Developer: Bandan Das
     Testing: netperf guest to guest

support more queues

    We limit TUN to 8 queues, but we really want
    1 queue per guest CPU. The limit comes from net
    core, need to teach it to allocate array of
    pointers and not array of queues.
    Jason has an draft patch to use flex array.
    Another thing is to move the flow caches out of tun_struct.
    Developer: Jason Wang

enable multiqueue by default

      Multiqueue causes regression in some workloads, thus
      it is off by default. Detect and enable/disable
      automatically so we can make it on by default.
      This is because GSO tends to batch less when mq is enabled.
      https://patchwork.kernel.org/patch/2235191/
      Developer: Jason Wang

rework on flow caches

      Current hlist implementation of flow caches has several limitations:
      1) at worst case, linear search will be bad
      2) not scale
      https://patchwork.kernel.org/patch/2025121/
      Developer: Jason Wang

eliminate the extra copy in virtio-net driver

      We need do an extra copy of 128 bytes for every packets. 
      This could be eliminated for small packets by:
      1) use build_skb() and head frag
      2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )
      Or use a dedicated queue for small packet receiving ? (reordering)
      Developer: Jason Wang

orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))

      virtio-net orphans all skbs during tx, this used to be optimal.
      Recent changes in guest networking stack and hardware advances
      such as APICv changed optimal behaviour for drivers.
      We need to revisit optimizations such as orphaning all packets early
      to have optimal behaviour.

      this should also fix pktgen which is currently broken with virtio net:
      orphaning all skbs makes pktgen wait for ever to the refcnt.
      Jason's idea: bring back tx interrupt (partially)
      Jason's idea: introduce a flag to tell pktgen not for wait
      Discussion here: https://patchwork.kernel.org/patch/1800711/
      MST's idea: add a .ndo_tx_polling not only for pktgen
      Developers: Jason Wang, MST

Head of line blocking issue with zerocopy

      zerocopy has several defects that will cause head of line blocking problem:
      - limit the number of pending DMAs
      - complete in order
      This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:
      - boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)
      - setup tbf to limit the tap2 bandwidth to 10Mbit/s
      - start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host
      Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.
      For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.
      But we have have similar issues in other case:
      - The card has its own priority queues
      - Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.
      The final solution is to remove receive buffering at tun, and convert it to user NAPI
      Developer: Developers were welcomed! (Jason Wang)
      Reference: https://lkml.org/lkml/2014/1/17/105

Write a ethtool seftest for virtio-net

       Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.
       Developer: Jason Wang

Dev watchdog for virtio-net:

       Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.
       Developer: Jason Wang

guest programmable mac/vlan filtering with macvtap

       Developer: Amos Kong
       qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 (applied by upstream)
       libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199
       http://git.qemu.org/?p=qemu.git;a=commit;h=b1be42803b31a913bab65bab563a8760ad2e7f7f
       Status: GuestProgrammableMacVlanFiltering

bridge without promisc mode in NIC

 given hardware support, teach bridge
 to program mac/vlan filtering in NIC
 Helps performance and security on noisy LANs
 http://comments.gmane.org/gmane.linux.network/266546
 Developer: Vlad Yasevich

reduce networking latency:

 allow handling short packets from softirq or VCPU context
 Plan:
   We are going through the scheduler 3 times
   (could be up to 5 if softirqd is involved)
   Consider RX: host irq -> io thread -> VCPU thread ->
   guest irq -> guest thread.
   This adds a lot of latency.
   We can cut it by some 1.5x if we do a bit of work
   either in the VCPU or softirq context.
 Testing: netperf TCP RR - should be improved drastically
          netperf TCP STREAM guest to host - no regression
 Developer: MST

Flexible buffers: put virtio header inline with packet data

 https://patchwork.kernel.org/patch/1540471/
 Developer: MST

device failover to allow migration with assigned devices

 https://fedoraproject.org/wiki/Features/Virt_Device_Failover
 Developer: Gal Hammer, Cole Robinson, Laine Stump, MST

Reuse vringh code for better maintainability

 Developer: Rusty Russell

Improve stats, make them more helpful for per analysis

 Developer: Sriram Narasimhan

Enable GRO for packets coming to bridge from a tap interface

 Developer: Dmitry Fleytman

Better support for windows LRO

 Extend virtio-header with statistics for GRO packets:
 number of packets coalesced and number of duplicate ACKs coalesced
 Developer: Dmitry Fleytman

IPoIB infiniband bridging

 Plan: implement macvtap for ipoib and virtio-ipoib
 Developer: MST

netdev polling for virtio.

 There are two kinds of netdev polling:
 - netpoll - used for debugging
 - rx busy polling for virtio-net [DONE]
   see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.
   Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. 
 Developer: Jason Wang

interrupt coalescing

 Reduce the number of interrupt
 Rx interrupt coalescing should be good for rx stream throughput.
 Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.
 Developer: Jason Wang

enable tx interrupt conditionally

 Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.
 Jason's idea: switch between poll and tx interrupt mode based on recent statistics.
 MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.
 Developer: Jason Wang, MST

use kvm eventfd support for injecting level-triggered interrupts

 aim: enable vhost by default for level interrupts.
 The benefit is security: we want to avoid using userspace
 virtio net so that vhost-net is always used.

 Alex emulated (post & re-enable) level-triggered interrupt in KVM for
 skipping userspace. VFIO already enjoied the performance benefit,
 let's do it for virtio-pci. Current virtio-pci devices still use
 level-interrupt in userspace.

kernel:
 7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts
qemu:
 68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers
          (virtio-pci didn't use the wrappers)
 e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration

 Developer: Amos Kong

sharing config interrupts

 Support more devices by sharing a single msi vector
 between multiple virtio devices.
 (Applies to virtio-blk too).
 Developer: Amos Kong

network traffic throttling

 block implemented "continuous leaky bucket" for throttling
 we can use continuous leaky bucket to network
 IOPS/BPS * RX/TX/TOTAL
 Developer: Amos Kong

Allocate mac_table dynamically

 In the future, maybe we can allocate the mac_table dynamically instead
 of embed it in VirtIONet. Then we can just does a pointer swap and
 gfree() and can save a memcpy() here.
 Developer: Amos Kong

reduce conflict with VCPU thread

   if VCPU and networking run on same CPU,
   they conflict resulting in bad performance.
   Fix that, push vhost thread out to another CPU
   more aggressively.
   Developer: Amos Kong

rx mac filtering in tun

       the need for this is still not understood as we have filtering in bridge
       we have a small table of addresses, need to make it larger
       if we only need filtering for unicast (multicast is handled by IMP filtering)
       Developer: Amos Kong

vlan filtering in tun

       the need for this is still not understood as we have filtering in bridge
       Developer: Amos Kong

projects that are not started yet - no owner

add documentation for macvlan and macvtap

  recent docs here:
  http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/
  need to integrate in iproute and kernel docs.

receive side zero copy

 The ideal is a NIC with accelerated RFS support,
 So we can feed the virtio rx buffers into the correct NIC queue.
 Depends on non promisc NIC support in bridge.
 Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"
 for a very old prototype

RDMA bridging

DMA emgine (IOAT) use in tun

 Old patch here: [PATCH RFC] tun: dma engine support
 It does not speed things up. Need to see why and
 what can be done.

virtio API extension: improve small packet/large buffer performance:

 support "reposting" buffers for mergeable buffers,
 support pool for indirect buffers

more GSO type support:

      Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL

ring aliasing:

 using vhost-net as a networking backend with virtio-net in QEMU
 being what's guest facing.
 This gives you the best of both worlds: QEMU acts as a first
 line of defense against a malicious guest while still getting the
 performance advantages of vhost-net (zero-copy).
 In fact a bit of complexity in vhost was put there in the vague hope to
 support something like this: virtio rings are not translated through
 regular memory tables, instead, vhost gets a pointer to ring address.
 This allows qemu acting as a man in the middle,
 verifying the descriptors but not touching the packet data.

non-virtio device support with vhost

 Use vhost interface for guests that don't use virtio-net

Extend sndbuf scope to int64

 Current sndbuf limit is INT_MAX in tap_set_sndbuf(),
 large values (like 8388607T) can be converted rightly by qapi from qemu commandline,
 If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'

 Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html

vague ideas: path to implementation not clear

change tcp_tso_should_defer for kvm: batch more

 aggressively.
 in particular, see below

tcp: increase gso buffering for cubic,reno

   At the moment we push out an skb whenever the limit becomes
   large enough to send a full-sized TSO skb even if the skb,
   in fact, is not full-sized.
   The reason for this seems to be that some congestion avoidance
   protocols rely on the number of packets in flight to calculate
   CWND, so if we underuse the available CWND it shrinks
   which degrades performance:
   http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html

   However, there seems to be no reason to do this for
   protocols such as reno and cubic which don't rely on packets in flight,
   and so will simply increase CWND a bit more to compensate for the
   underuse.

ring redesign:

     find a way to test raw ring performance 
     fix cacheline bounces 
     reduce interrupts

irq/numa affinity:

    networking goes much faster with irq pinning:
    both with and without numa.
    what can be done to make the non-pinned setup go faster?

vlan filtering in bridge

       kernel part is done (Vlad Yasevich)
       teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)

tx coalescing

       Delay several packets before kick the device.

bridging on top of macvlan

 add code to forward LRO status from macvlan (not macvtap)
 back to the lowerdev, so that setting up forwarding
 from macvlan disables LRO on the lowerdev

virtio: preserve packets exactly with LRO

 LRO is not normally compatible with forwarding.
 virtio we are getting packets from a linux host,
 so we could thinkably preserve packets exactly
 even with LRO. I am guessing other hardware could be
 doing this as well.

vxlan

 What could we do here?

bridging without promisc mode with OVS

high level issues: not clear what the project is, yet

security: iptables

At the moment most people disables iptables to get good performance on 10G/s networking. Any way to improve experience?

performance

Going through scheduler and full networking stack twice (host+guest) adds a lot of overhead Any way to allow bypassing some layers?

manageability

Still hard to figure out VM networking, VM networking is through libvirt, host networking through NM Any way to integrate?

testing projects

Keeping networking stable is highest priority.

Write some unit tests for vhost-net/vhost-scsi
Run weekly test on upstream HEAD covering test matrix with autotest
Measure the effect of each of the above-mentioned optimizations

 - Use autotest network performance regression testing (that runs netperf)
 - Also test any wild idea that works. Some may be useful.

Migrate some of the performance regression autotest functionality into Netperf

 - Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...
 - Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)
 - Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.
 - Make the scripts more visible

non-virtio-net devices

e1000: stabilize

test matrix

DOA test matrix (all combinations should work):

       vhost: test both on and off, obviously
       test: hotplug/unplug, vlan/mac filtering, netperf,
            file copy both ways: scp, NFS, NTFS
       guests: linux: release and debug kernels, windows
       conditions: plain run, run while under migration,
               vhost on/off migration
       networking setup: simple, qos with cgroups
       host configuration: host-guest, external-guest