From KVM
m (projects in progress. contributions are still very wellcome!)
 
(19 intermediate revisions by 4 users not shown)
Line 6: Line 6:
 
=== projects in progress. contributions are still very wellcome! ===
 
=== projects in progress. contributions are still very wellcome! ===
  
* large-order allocations
+
* virtio 1.0 support for linux guests
  see 28d6427109d13b0f447cba5761f88d3548e83605
+
        required for maintainatibility
  Developer: MST
+
        mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com
 +
        Developer: MST,Cornelia Huck
  
* vhost-net scalability tuning: threading for many VMs
+
* virtio 1.0 support in qemu
      Plan: switch to workqueue shared by many VMs
+
        required for maintainatibility
      http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
+
        mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com
 +
        Developer: Cornelia Huck, MST
  
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument
+
* improve net polling for cpu overcommit
 +
        exit busy loop when another process is runnable
 +
        mid.gmane.org/20140822073653.GA7372@gmail.com
 +
        mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com
 +
        Another idea is make the busy_read/busy_poll dynamic like dynamic PLE  window.
 +
        Developer: Jason Wang, MST
  
      Developer: Bandan Das
+
* vhost-net/tun/macvtap cross endian support
      Testing: netperf guest to guest
+
        mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com
 +
        Developer: Greg Kurz, MST
 +
 
 +
* BQL/aggregation for virtio net
 +
        dependencies: orphan packets less agressively, enable tx interrupt
 +
        Developers: MST, Jason
 +
 
 +
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))
 +
      virtio-net orphans all skbs during tx, this used to be optimal.
 +
      Recent changes in guest networking stack and hardware advances
 +
      such as APICv changed optimal behaviour for drivers.
 +
      We need to revisit optimizations such as orphaning all packets early
 +
      to have optimal behaviour.
 +
 
 +
      this should also fix pktgen which is currently broken with virtio net:
 +
      orphaning all skbs makes pktgen wait for ever to the refcnt.
 +
      Jason's idea: bring back tx interrupt (partially)
 +
      Jason's idea: introduce a flag to tell pktgen not for wait
 +
      Discussion here: https://patchwork.kernel.org/patch/1800711/
 +
      MST's idea: add a .ndo_tx_polling not only for pktgen
 +
      Developers: Jason Wang, MST
 +
 
 +
* enable tx interrupt (conditionally?)
 +
      Small packet TCP stream performance is not good. This is because
 +
      virtio-net orphan the packet during ndo_start_xmit() which disable the
 +
      TCP small packet optimizations like TCP small Queue and AutoCork. The
 +
      idea is enable the tx interrupt to TCP small packets.
 +
      Jason's idea: switch between poll and tx interrupt mode based on recent statistics.
 +
      MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.
 +
      Developer: Jason Wang, MST
 +
 +
* vhost-net polling
 +
      mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com
 +
      Developer: Razya Ladelsky
  
* support more queues
+
* support more queues in tun and macvtap
    We limit TUN to 8 queues, but we really want
+
      We limit TUN to 8 queues, but we really want 1 queue per guest CPU. The
    1 queue per guest CPU. The limit comes from net
+
      limit comes from net core, need to teach it to allocate array of
    core, need to teach it to allocate array of
+
      pointers and not array of queues. Jason has an draft patch to use flex
    pointers and not array of queues.
+
      array. Another thing is to move the flow caches out of tun_struct.
    Jason has an draft patch to use flex array.
+
      http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com
    Another thing is to move the flow caches out of tun_struct.
+
      tun part is done.
    Developer: Jason Wang
+
      Developers: Pankaj Gupta, Jason Wang
  
 
* enable multiqueue by default
 
* enable multiqueue by default
 
       Multiqueue causes regression in some workloads, thus
 
       Multiqueue causes regression in some workloads, thus
       it is off by default. Detect and enable/disable
+
       it is off by default. Documentation/networking/scaling.txt
       automatically so we can make it on by default.
+
      Detect and enable/disable
 +
       automatically so we can make it on by default?
 +
      depends on: BQL
 
       This is because GSO tends to batch less when mq is enabled.
 
       This is because GSO tends to batch less when mq is enabled.
 
       https://patchwork.kernel.org/patch/2235191/
 
       https://patchwork.kernel.org/patch/2235191/
Line 43: Line 85:
 
       Developer: Jason Wang
 
       Developer: Jason Wang
 
        
 
        
* eliminate the extra copy in virtio-net driver
+
* bridge without promisc/allmulti mode in NIC
       We need do an extra copy of 128 bytes for every packets.  
+
       given hardware support, teach bridge to program mac/vlan filtering in NIC
       This could be eliminated for small packets by:
+
      Helps performance and security on noisy LANs
       1) use build_skb() and head frag
+
      http://comments.gmane.org/gmane.linux.network/266546
       2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )
+
      Done for unicast, but not for multicast.
       Or use a dedicated queue for small packet receiving ? (reordering)
+
       Developer: Vlad Yasevich
 +
 
 +
* Improve stats, make them more helpful for per analysis
 +
      Developer: Sriram Narasimhan?
 +
 
 +
* Enable LRO with bridging
 +
      Enable GRO for packets coming to bridge from a tap interface
 +
      Better support for windows LRO
 +
      Extend virtio-header with statistics for GRO packets:
 +
       number of packets coalesced and number of duplicate ACKs coalesced
 +
       Developer: Dmitry Fleytman?
 +
 
 +
* IPoIB infiniband bridging
 +
       Plan: implement macvtap for ipoib and virtio-ipoib
 +
      Developer: Marcel Apfelbaum
 +
 
 +
* interrupt coalescing
 +
      Reduce the number of interrupt
 +
      Rx interrupt coalescing should be good for rx stream throughput.
 +
      Tx interrupt coalescing will help the optimization of enabling tx
 +
      interrupt conditionally.
 
       Developer: Jason Wang
 
       Developer: Jason Wang
  
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))
+
* sharing config interrupts
       virtio-net orphans all skbs during tx, this used to be optimal.
+
       Support more devices by sharing a single msi vector between multiple
       Recent changes in guest networking stack and hardware advances
+
       virtio devices.
      such as APICv changed optimal behaviour for drivers.
+
       (Applies to virtio-blk too).
       We need to revisit optimizations such as orphaning all packets early
+
       Developer: Amos Kong
       to have optimal behaviour.
+
  
      this should also fix pktgen which is currently broken with virtio net:
+
* Multi-queue macvtap with real multiple queues
       orphaning all skbs makes pktgen wait for ever to the refcnt.
+
       Macvtap only provides multiple queues to user in the form of multiple
       Jason's idea: bring back tx interrupt (partially)
+
       sockets.  As each socket will perform dev_queue_xmit() and we don't
       Jason's idea: introduce a flag to tell pktgen not for wait
+
       really have multiple real queues on the device, we now have a lock
       Discussion here: https://patchwork.kernel.org/patch/1800711/
+
       contention. This contention needs to be addressed.
       MST's idea: add a .ndo_tx_polling not only for pktgen
+
       Developer: Vlad Yasevich
      Developers: Jason Wang, MST
+
  
* Head of line blocking issue with zerocopy
+
* better xmit queueing for tun
       zerocopy has several defects that will cause head of line blocking problem:
+
       when guest is slower than host, tun drops packets aggressively. This is
      - limit the number of pending DMAs
+
       because keeping packets on the internal queue does not work well.
      - complete in order
+
       Re-enable functionality to stop queue, probably with some watchdog to
      This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:
+
       help with buggy guests.
       - boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)
+
       Developer: MST
      - setup tbf to limit the tap2 bandwidth to 10Mbit/s
+
      - start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host
+
      Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.
+
      For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.
+
       But we have have similar issues in other case:
+
      - The card has its own priority queues
+
      - Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.
+
       The final solution is to remove receive buffering at tun, and convert it to user NAPI
+
       Developer: Developers were welcomed! (Jason Wang)
+
      Reference: https://lkml.org/lkml/2014/1/17/105
+
  
* Write a ethtool seftest for virtio-net
+
* Dev watchdog for virtio-net:
 +
      Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.
 +
      Developer: Julio Faracco <jcfaracco@gmail.com>
 +
 
 +
* Extend virtio_net header for future offloads
 +
      virtio_net header is currently fixed sized and only supports
 +
      segmentation offloading.  It would be useful that would could
 +
      attach other data to virtio_net header to support things like
 +
      vlan acceleration, IPv6 fragment_id pass-through, rx and tx-hash
 +
      pass-through and some other ideas.
 +
      Developer: Vlad Yasevich <vyasevic@redhat.com>
 +
 
 +
=== projects in need of an owner ===
 +
 
 +
* improve netdev polling for virtio.
 +
  There are two kinds of netdev polling:
 +
  - netpoll - used for debugging
 +
  - rx busy polling for virtio-net [DONE]
 +
    see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.
 +
    Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu.
 +
  contact: Jason Wang
 +
 
 +
* drop vhostforce
 +
  it's an optimization, probbaly not worth it anymore
 +
 
 +
* avoid userspace virtio-net when vhost is enabled.
 +
  ATM we run in userspace until DRIVER_OK
 +
  this doubles our security attack surface,
 +
  so it's best avoided.
 +
 
 +
* feature negotiation for dpdk/vhost user
 +
  feature negotiation seems to be broken
 +
 
 +
* switch dpdk to qemu vhost user
 +
  this seems like a better interface than
 +
  character device in userspace,
 +
  designed for out of process networking
 +
 
 +
* netmap - like approach to zero copy networking
 +
  is anything like this feasible on linux?
 +
 
 +
* vhost-user: clean up protocol
 +
  address multiple issues in vhost user protocol:
 +
  missing VHOST_NET_SET_BACKEND
 +
  make more messages synchronous (with a reply)
 +
  VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL
 +
    mid.gmane.org/541956B8.1070203@huawei.com
 +
    mid.gmane.org/54192136.2010409@huawei.com
 +
  Contact: MST
 +
 
 +
* ethtool seftest support for virtio-net
 
         Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.
 
         Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.
         Developer: Jason Wang
+
         http://mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com
 +
        Contact: Jason Wang, Pankaj Gupta
  
* Dev watchdog for virtio-net:
+
* vhost-net scalability tuning: threading for many VMs
        Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.
+
      Plan: switch to workqueue shared by many VMs
        Developer: Jason Wang
+
      http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
  
 +
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument
  
* guest programmable mac/vlan filtering with macvtap
+
      Contact: Razya Ladelsky, Bandan Das
        Developer: Amos Kong
+
      Testing: netperf guest to guest
        qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 (applied by upstream)
+
        libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199
+
        http://git.qemu.org/?p=qemu.git;a=commit;h=b1be42803b31a913bab65bab563a8760ad2e7f7f
+
        Status: [[GuestProgrammableMacVlanFiltering]]
+
  
* bridge without promisc mode in NIC
+
* DPDK with vhost-user
   given hardware support, teach bridge
+
  Support vhost-user in addition to vhost net cuse device
   to program mac/vlan filtering in NIC
+
   Contact: Linhaifeng, MST
   Helps performance and security on noisy LANs
+
 
   http://comments.gmane.org/gmane.linux.network/266546
+
* DPDK with vhost-net/user: fix offloads
   Developer: Vlad Yasevich
+
   DPDK requires disabling offloads ATM,
 +
  need to fix this.
 +
  Contact: MST
 +
 
 +
* reduce per-device memory allocations
 +
  vhost device is very large due to need to
 +
  keep large arrays of iovecs around.
 +
  we do need large arrays for correctness,
 +
  but we could move them out of line,
 +
   and add short inline arrays for typical use-cases.
 +
   contact: MST
 +
 
 +
* batch tx completions in vhost
 +
  vhost already batches up to 64 tx completions for zero copy
 +
  batch non zero copy as well
 +
  contact: Jason Wang
 +
 
 +
* better parallelize small queues
 +
  don't wait for ring full to kick.
 +
  add api to detect ring almost full (e.g. 3/4) and kick
 +
   depends on: BQL
 +
  contact: MST
 +
 
 +
* improve vhost-user unit test
 +
  support running on machines without hugetlbfs
 +
  support running with more vm memory layouts
 +
  Contact: MST
 +
 
 +
* tun: fix RX livelock
 +
        it's easy for guest to starve out host networking
 +
        open way to fix this is to use napi
 +
        Contact: MST
 +
 
 +
* large-order allocations
 +
  see 28d6427109d13b0f447cba5761f88d3548e83605
 +
  contact: MST
  
 
* reduce networking latency:
 
* reduce networking latency:
Line 118: Line 251:
 
   Testing: netperf TCP RR - should be improved drastically
 
   Testing: netperf TCP RR - should be improved drastically
 
           netperf TCP STREAM guest to host - no regression
 
           netperf TCP STREAM guest to host - no regression
   Developer: MST
+
   Contact: MST
 
+
* Flexible buffers: put virtio header inline with packet data
+
  https://patchwork.kernel.org/patch/1540471/
+
  Developer: MST
+
  
 
* device failover to allow migration with assigned devices
 
* device failover to allow migration with assigned devices
 
   https://fedoraproject.org/wiki/Features/Virt_Device_Failover
 
   https://fedoraproject.org/wiki/Features/Virt_Device_Failover
   Developer: Gal Hammer, Cole Robinson, Laine Stump, MST
+
   Contact: Gal Hammer, Cole Robinson, Laine Stump, MST
  
 
* Reuse vringh code for better maintainability
 
* Reuse vringh code for better maintainability
   Developer: Rusty Russell
+
   This project seems abandoned?
 
+
   Contact: Rusty Russell
* Improve stats, make them more helpful for per analysis
+
  Developer: Sriram Narasimhan
+
 
+
* Enable GRO for packets coming to bridge from a tap interface
+
  Developer: Dmitry Fleytman
+
 
+
* Better support for windows LRO
+
  Extend virtio-header with statistics for GRO packets:
+
  number of packets coalesced and number of duplicate ACKs coalesced
+
  Developer: Dmitry Fleytman
+
 
+
* IPoIB infiniband bridging
+
  Plan: implement macvtap for ipoib and virtio-ipoib
+
  Developer: MST
+
 
+
* netdev polling for virtio.
+
  There are two kinds of netdev polling:
+
  - netpoll - used for debugging
+
  - rx busy polling for virtio-net [DONE]
+
    see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.
+
    Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu.
+
   Developer: Jason Wang
+
 
+
* interrupt coalescing
+
  Reduce the number of interrupt
+
  Rx interrupt coalescing should be good for rx stream throughput.
+
  Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.
+
  Developer: Jason Wang
+
  
* enable tx interrupt conditionally
 
  Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.
 
  Jason's idea: switch between poll and tx interrupt mode based on recent statistics.
 
  MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.
 
  Developer: Jason Wang, MST
 
 
 
 
* use kvm eventfd support for injecting level-triggered interrupts
 
* use kvm eventfd support for injecting level-triggered interrupts
 
   aim: enable vhost by default for level interrupts.
 
   aim: enable vhost by default for level interrupts.
Line 175: Line 270:
 
   let's do it for virtio-pci. Current virtio-pci devices still use
 
   let's do it for virtio-pci. Current virtio-pci devices still use
 
   level-interrupt in userspace.
 
   level-interrupt in userspace.
 
+
  see: kernel:
kernel:
+
 
   7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts
 
   7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts
 
  qemu:
 
  qemu:
Line 183: Line 277:
 
   e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration
 
   e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration
  
   Developer: Amos Kong
+
   Contact: Amos Kong, MST     
  
* sharing config interrupts
+
* Head of line blocking issue with zerocopy
  Support more devices by sharing a single msi vector
+
      zerocopy has several defects that will cause head of line blocking problem:
  between multiple virtio devices.
+
      - limit the number of pending DMAs
  (Applies to virtio-blk too).
+
      - complete in order
  Developer: Amos Kong
+
      This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:
 +
      - boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)
 +
      - setup tbf to limit the tap2 bandwidth to 10Mbit/s
 +
      - start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host
 +
      Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.
 +
      For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.
 +
      But we have have similar issues in other case:
 +
      - The card has its own priority queues
 +
      - Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.
 +
      The final solution is to remove receive buffering at tun, and convert it to use NAPI
 +
      Contact: Jason Wang, MST
 +
      Reference: https://lkml.org/lkml/2014/1/17/105
  
 
* network traffic throttling
 
* network traffic throttling
Line 202: Line 307:
 
   of embed it in VirtIONet. Then we can just does a pointer swap and
 
   of embed it in VirtIONet. Then we can just does a pointer swap and
 
   gfree() and can save a memcpy() here.
 
   gfree() and can save a memcpy() here.
   Developer: Amos Kong
+
   Contact: Amos Kong
  
 
* reduce conflict with VCPU thread
 
* reduce conflict with VCPU thread
Line 209: Line 314:
 
     Fix that, push vhost thread out to another CPU
 
     Fix that, push vhost thread out to another CPU
 
     more aggressively.
 
     more aggressively.
     Developer: Amos Kong
+
     Contact: Amos Kong
  
 
* rx mac filtering in tun
 
* rx mac filtering in tun
Line 215: Line 320:
 
         we have a small table of addresses, need to make it larger
 
         we have a small table of addresses, need to make it larger
 
         if we only need filtering for unicast (multicast is handled by IMP filtering)
 
         if we only need filtering for unicast (multicast is handled by IMP filtering)
         Developer: Amos Kong
+
         Contact: Amos Kong
  
 
* vlan filtering in tun
 
* vlan filtering in tun
 
         the need for this is still not understood as we have filtering in bridge
 
         the need for this is still not understood as we have filtering in bridge
         Developer: Amos Kong
+
         Contact: Amos Kong
 +
 
  
=== projects that are not started yet - no owner ===
 
  
 
* add documentation for macvlan and macvtap
 
* add documentation for macvlan and macvtap
Line 265: Line 370:
  
 
* Extend sndbuf scope to int64
 
* Extend sndbuf scope to int64
 
 
   Current sndbuf limit is INT_MAX in tap_set_sndbuf(),
 
   Current sndbuf limit is INT_MAX in tap_set_sndbuf(),
 
   large values (like 8388607T) can be converted rightly by qapi from qemu commandline,
 
   large values (like 8388607T) can be converted rightly by qapi from qemu commandline,
 
   If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'
 
   If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'
 
+
  Why is this useful?
 
   Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html
 
   Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html
 +
 +
* unit test for vhost-user
 +
  We don't have a unit test for vhost-user.
 +
  The idea is to implement a simple vhost-user backend over userspace stack.
 +
  And load pxe in guest.
 +
  Contact: MST and Jason Wang
 +
 +
* better qtest for virtio-net
 +
  We test only boot and hotplug for virtio-net.
 +
  Need to test more.
 +
  Contact: MST and Jason Wang
  
 
=== vague ideas: path to implementation not clear ===
 
=== vague ideas: path to implementation not clear ===

Latest revision as of 12:56, 15 March 2017

This page should cover all networking related activity in KVM, currently most info is related to virtio-net.

TODO: add bugzilla entry links.

projects in progress. contributions are still very wellcome!

  • virtio 1.0 support for linux guests
       required for maintainatibility
       mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com
       Developer: MST,Cornelia Huck
  • virtio 1.0 support in qemu
       required for maintainatibility
       mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com
       Developer: Cornelia Huck, MST
  • improve net polling for cpu overcommit
       exit busy loop when another process is runnable
       mid.gmane.org/20140822073653.GA7372@gmail.com
       mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com
       Another idea is make the busy_read/busy_poll dynamic like dynamic PLE  window.
       Developer: Jason Wang, MST
  • vhost-net/tun/macvtap cross endian support
       mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com
       Developer: Greg Kurz, MST
  • BQL/aggregation for virtio net
       dependencies: orphan packets less agressively, enable tx interrupt 
       Developers: MST, Jason
  • orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))
      virtio-net orphans all skbs during tx, this used to be optimal.
      Recent changes in guest networking stack and hardware advances
      such as APICv changed optimal behaviour for drivers.
      We need to revisit optimizations such as orphaning all packets early
      to have optimal behaviour.
      this should also fix pktgen which is currently broken with virtio net:
      orphaning all skbs makes pktgen wait for ever to the refcnt.
      Jason's idea: bring back tx interrupt (partially)
      Jason's idea: introduce a flag to tell pktgen not for wait
      Discussion here: https://patchwork.kernel.org/patch/1800711/
      MST's idea: add a .ndo_tx_polling not only for pktgen
      Developers: Jason Wang, MST
  • enable tx interrupt (conditionally?)
      Small packet TCP stream performance is not good. This is because
      virtio-net orphan the packet during ndo_start_xmit() which disable the 
      TCP small packet optimizations like TCP small Queue and AutoCork. The
      idea is enable the tx interrupt to TCP small packets.
      Jason's idea: switch between poll and tx interrupt mode based on recent statistics.
      MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.
      Developer: Jason Wang, MST

  • vhost-net polling
      mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com
      Developer: Razya Ladelsky
  • support more queues in tun and macvtap
      We limit TUN to 8 queues, but we really want 1 queue per guest CPU. The
      limit comes from net core, need to teach it to allocate array of
      pointers and not array of queues. Jason has an draft patch to use flex
      array.  Another thing is to move the flow caches out of tun_struct.
      http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com
      tun part is done.
      Developers: Pankaj Gupta, Jason Wang
  • enable multiqueue by default
      Multiqueue causes regression in some workloads, thus
      it is off by default. Documentation/networking/scaling.txt
      Detect and enable/disable
      automatically so we can make it on by default?
      depends on: BQL
      This is because GSO tends to batch less when mq is enabled.
      https://patchwork.kernel.org/patch/2235191/
      Developer: Jason Wang
  • rework on flow caches
      Current hlist implementation of flow caches has several limitations:
      1) at worst case, linear search will be bad
      2) not scale
      https://patchwork.kernel.org/patch/2025121/
      Developer: Jason Wang
      
  • bridge without promisc/allmulti mode in NIC
      given hardware support, teach bridge to program mac/vlan filtering in NIC
      Helps performance and security on noisy LANs
      http://comments.gmane.org/gmane.linux.network/266546
      Done for unicast, but not for multicast.
      Developer: Vlad Yasevich
  • Improve stats, make them more helpful for per analysis
      Developer: Sriram Narasimhan?
  • Enable LRO with bridging
      Enable GRO for packets coming to bridge from a tap interface
      Better support for windows LRO
      Extend virtio-header with statistics for GRO packets:
      number of packets coalesced and number of duplicate ACKs coalesced
      Developer: Dmitry Fleytman?
  • IPoIB infiniband bridging
      Plan: implement macvtap for ipoib and virtio-ipoib
      Developer: Marcel Apfelbaum
  • interrupt coalescing
      Reduce the number of interrupt
      Rx interrupt coalescing should be good for rx stream throughput.
      Tx interrupt coalescing will help the optimization of enabling tx
      interrupt conditionally.
      Developer: Jason Wang
  • sharing config interrupts
      Support more devices by sharing a single msi vector between multiple
      virtio devices.
      (Applies to virtio-blk too).
      Developer: Amos Kong
  • Multi-queue macvtap with real multiple queues
      Macvtap only provides multiple queues to user in the form of multiple
      sockets.  As each socket will perform dev_queue_xmit() and we don't
      really have multiple real queues on the device, we now have a lock
      contention.  This contention needs to be addressed.
      Developer: Vlad Yasevich
  • better xmit queueing for tun
      when guest is slower than host, tun drops packets aggressively. This is
      because keeping packets on the internal queue does not work well.
      Re-enable functionality to stop queue, probably with some watchdog to
      help with buggy guests.
      Developer: MST
  • Dev watchdog for virtio-net:
      Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.
      Developer: Julio Faracco <jcfaracco@gmail.com>
  • Extend virtio_net header for future offloads
      virtio_net header is currently fixed sized and only supports
      segmentation offloading.  It would be useful that would could
      attach other data to virtio_net header to support things like
      vlan acceleration, IPv6 fragment_id pass-through, rx and tx-hash
      pass-through and some other ideas.
      Developer: Vlad Yasevich <vyasevic@redhat.com>

projects in need of an owner

  • improve netdev polling for virtio.
 There are two kinds of netdev polling:
 - netpoll - used for debugging
 - rx busy polling for virtio-net [DONE]
   see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.
   Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. 
 contact: Jason Wang
  • drop vhostforce
 it's an optimization, probbaly not worth it anymore
  • avoid userspace virtio-net when vhost is enabled.
 ATM we run in userspace until DRIVER_OK
 this doubles our security attack surface,
 so it's best avoided.
  • feature negotiation for dpdk/vhost user
 feature negotiation seems to be broken
  • switch dpdk to qemu vhost user
 this seems like a better interface than
  character device in userspace,
  designed for out of process networking
  • netmap - like approach to zero copy networking
  is anything like this feasible on linux?
  • vhost-user: clean up protocol
 address multiple issues in vhost user protocol:
  missing VHOST_NET_SET_BACKEND
  make more messages synchronous (with a reply)
  VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL
   mid.gmane.org/541956B8.1070203@huawei.com
   mid.gmane.org/54192136.2010409@huawei.com
  Contact: MST
  • ethtool seftest support for virtio-net
       Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.
       http://mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com
       Contact: Jason Wang, Pankaj Gupta
  • vhost-net scalability tuning: threading for many VMs
     Plan: switch to workqueue shared by many VMs
     http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html

http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument

     Contact: Razya Ladelsky, Bandan Das
     Testing: netperf guest to guest
  • DPDK with vhost-user
 Support vhost-user in addition to vhost net cuse device
 Contact: Linhaifeng, MST
  • DPDK with vhost-net/user: fix offloads
 DPDK requires disabling offloads ATM,
 need to fix this.
 Contact: MST
  • reduce per-device memory allocations
 vhost device is very large due to need to
 keep large arrays of iovecs around.
 we do need large arrays for correctness,
 but we could move them out of line,
 and add short inline arrays for typical use-cases.
 contact: MST
  • batch tx completions in vhost
 vhost already batches up to 64 tx completions for zero copy
 batch non zero copy as well
 contact: Jason Wang
  • better parallelize small queues
 don't wait for ring full to kick.
 add api to detect ring almost full (e.g. 3/4) and kick
 depends on: BQL
 contact: MST
  • improve vhost-user unit test
 support running on machines without hugetlbfs
 support running with more vm memory layouts
 Contact: MST
  • tun: fix RX livelock
       it's easy for guest to starve out host networking
       open way to fix this is to use napi 
       Contact: MST
  • large-order allocations
  see 28d6427109d13b0f447cba5761f88d3548e83605
  contact: MST
  • reduce networking latency:
 allow handling short packets from softirq or VCPU context
 Plan:
   We are going through the scheduler 3 times
   (could be up to 5 if softirqd is involved)
   Consider RX: host irq -> io thread -> VCPU thread ->
   guest irq -> guest thread.
   This adds a lot of latency.
   We can cut it by some 1.5x if we do a bit of work
   either in the VCPU or softirq context.
 Testing: netperf TCP RR - should be improved drastically
          netperf TCP STREAM guest to host - no regression
 Contact: MST
  • device failover to allow migration with assigned devices
 https://fedoraproject.org/wiki/Features/Virt_Device_Failover
 Contact: Gal Hammer, Cole Robinson, Laine Stump, MST
  • Reuse vringh code for better maintainability
 This project seems abandoned?
 Contact: Rusty Russell
  • use kvm eventfd support for injecting level-triggered interrupts
 aim: enable vhost by default for level interrupts.
 The benefit is security: we want to avoid using userspace
 virtio net so that vhost-net is always used.
 Alex emulated (post & re-enable) level-triggered interrupt in KVM for
 skipping userspace. VFIO already enjoied the performance benefit,
 let's do it for virtio-pci. Current virtio-pci devices still use
 level-interrupt in userspace.
 see: kernel:
 7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts
qemu:
 68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers
          (virtio-pci didn't use the wrappers)
 e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration
 Contact: Amos Kong, MST       
  • Head of line blocking issue with zerocopy
      zerocopy has several defects that will cause head of line blocking problem:
      - limit the number of pending DMAs
      - complete in order
      This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:
      - boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)
      - setup tbf to limit the tap2 bandwidth to 10Mbit/s
      - start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host
      Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.
      For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.
      But we have have similar issues in other case:
      - The card has its own priority queues
      - Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.
      The final solution is to remove receive buffering at tun, and convert it to use NAPI
      Contact: Jason Wang, MST
      Reference: https://lkml.org/lkml/2014/1/17/105
  • network traffic throttling
 block implemented "continuous leaky bucket" for throttling
 we can use continuous leaky bucket to network
 IOPS/BPS * RX/TX/TOTAL
 Developer: Amos Kong
  • Allocate mac_table dynamically
 In the future, maybe we can allocate the mac_table dynamically instead
 of embed it in VirtIONet. Then we can just does a pointer swap and
 gfree() and can save a memcpy() here.
 Contact: Amos Kong
  • reduce conflict with VCPU thread
   if VCPU and networking run on same CPU,
   they conflict resulting in bad performance.
   Fix that, push vhost thread out to another CPU
   more aggressively.
   Contact: Amos Kong
  • rx mac filtering in tun
       the need for this is still not understood as we have filtering in bridge
       we have a small table of addresses, need to make it larger
       if we only need filtering for unicast (multicast is handled by IMP filtering)
       Contact: Amos Kong
  • vlan filtering in tun
       the need for this is still not understood as we have filtering in bridge
       Contact: Amos Kong


  • add documentation for macvlan and macvtap
  recent docs here:
  http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/
  need to integrate in iproute and kernel docs.
  • receive side zero copy
 The ideal is a NIC with accelerated RFS support,
 So we can feed the virtio rx buffers into the correct NIC queue.
 Depends on non promisc NIC support in bridge.
 Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"
 for a very old prototype
  • RDMA bridging
  • DMA emgine (IOAT) use in tun
 Old patch here: [PATCH RFC] tun: dma engine support
 It does not speed things up. Need to see why and
 what can be done.
  • virtio API extension: improve small packet/large buffer performance:
 support "reposting" buffers for mergeable buffers,
 support pool for indirect buffers
  • more GSO type support:
      Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL
  • ring aliasing:
 using vhost-net as a networking backend with virtio-net in QEMU
 being what's guest facing.
 This gives you the best of both worlds: QEMU acts as a first
 line of defense against a malicious guest while still getting the
 performance advantages of vhost-net (zero-copy).
 In fact a bit of complexity in vhost was put there in the vague hope to
 support something like this: virtio rings are not translated through
 regular memory tables, instead, vhost gets a pointer to ring address.
 This allows qemu acting as a man in the middle,
 verifying the descriptors but not touching the packet data.
  • non-virtio device support with vhost
 Use vhost interface for guests that don't use virtio-net
  • Extend sndbuf scope to int64
 Current sndbuf limit is INT_MAX in tap_set_sndbuf(),
 large values (like 8388607T) can be converted rightly by qapi from qemu commandline,
 If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'
 Why is this useful?
 Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html
  • unit test for vhost-user
  We don't have a unit test for vhost-user.
  The idea is to implement a simple vhost-user backend over userspace stack.
  And load pxe in guest.
  Contact: MST and Jason Wang
  • better qtest for virtio-net
  We test only boot and hotplug for virtio-net.
  Need to test more.
  Contact: MST and Jason Wang

vague ideas: path to implementation not clear

  • change tcp_tso_should_defer for kvm: batch more
 aggressively.
 in particular, see below
  • tcp: increase gso buffering for cubic,reno
   At the moment we push out an skb whenever the limit becomes
   large enough to send a full-sized TSO skb even if the skb,
   in fact, is not full-sized.
   The reason for this seems to be that some congestion avoidance
   protocols rely on the number of packets in flight to calculate
   CWND, so if we underuse the available CWND it shrinks
   which degrades performance:
   http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html
   However, there seems to be no reason to do this for
   protocols such as reno and cubic which don't rely on packets in flight,
   and so will simply increase CWND a bit more to compensate for the
   underuse.
  • ring redesign:
     find a way to test raw ring performance 
     fix cacheline bounces 
     reduce interrupts


  • irq/numa affinity:
    networking goes much faster with irq pinning:
    both with and without numa.
    what can be done to make the non-pinned setup go faster?
  • vlan filtering in bridge
       kernel part is done (Vlad Yasevich)
       teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)
  • tx coalescing
       Delay several packets before kick the device.
  • bridging on top of macvlan
 add code to forward LRO status from macvlan (not macvtap)
 back to the lowerdev, so that setting up forwarding
 from macvlan disables LRO on the lowerdev
  • virtio: preserve packets exactly with LRO
 LRO is not normally compatible with forwarding.
 virtio we are getting packets from a linux host,
 so we could thinkably preserve packets exactly
 even with LRO. I am guessing other hardware could be
 doing this as well.
  • vxlan
 What could we do here?
  • bridging without promisc mode with OVS

high level issues: not clear what the project is, yet

  • security: iptables

At the moment most people disables iptables to get good performance on 10G/s networking. Any way to improve experience?

  • performance

Going through scheduler and full networking stack twice (host+guest) adds a lot of overhead Any way to allow bypassing some layers?

  • manageability

Still hard to figure out VM networking, VM networking is through libvirt, host networking through NM Any way to integrate?

testing projects

Keeping networking stable is highest priority.

  • Write some unit tests for vhost-net/vhost-scsi
  • Run weekly test on upstream HEAD covering test matrix with autotest
  • Measure the effect of each of the above-mentioned optimizations
 - Use autotest network performance regression testing (that runs netperf)
 - Also test any wild idea that works. Some may be useful.
  • Migrate some of the performance regression autotest functionality into Netperf
 - Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...
 - Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)
 - Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.
 - Make the scripts more visible

non-virtio-net devices

  • e1000: stabilize

test matrix

DOA test matrix (all combinations should work):

       vhost: test both on and off, obviously
       test: hotplug/unplug, vlan/mac filtering, netperf,
            file copy both ways: scp, NFS, NTFS
       guests: linux: release and debug kernels, windows
       conditions: plain run, run while under migration,
               vhost on/off migration
       networking setup: simple, qos with cgroups
       host configuration: host-guest, external-guest