This page should cover all networking related activity in KVM, currently most info is related to virtio-net.
TODO: add bugzilla entry links.
=== projects in progress. contributions are still very wellcome!
- vhost-net scalability tuning: threading for many VMs
Plan: switch to workqueue shared by many VMs http://email@example.com/msg69868.html
Developer: Bandan Das Testing: netperf guest to guest
- multiqueue support in macvtap
multiqueue is only supported for tun. Add support for macvtap. Developer: Jason Wang
- support more queues
We limit TUN to 8 queues, but we really want 1 queue per guest CPU. The limit comes from net core, need to teach it to allocate array of pointers and not array of queues. Jason has an draft patch to use flex array. Another thing is to move the flow caches out of tun_struct. Developer: Jason Wang
- enable multiqueue by default
Multiqueue causes regression in some workloads, thus it is off by default. Detect and enable/disable automatically so we can make it on by default. This is because GSO tends to batch less when mq is enabled. https://patchwork.kernel.org/patch/2235191/ Developer: Jason Wang
- rework on flow caches
Current hlist implementation of flow caches has several limitations: 1) at worst case, linear search will be bad 2) not scale https://patchwork.kernel.org/patch/2025121/ Developer: Jason Wang
- eliminate the extra copy in virtio-net driver
We need do an extra copy of 128 bytes for every packets. This could be eliminated for small packets by: 1) use build_skb() and head frag 2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN ) Or use a dedicated queue for small packet receiving ? (reordering) Developer: Jason Wang
- make pktgen works for virtio-net ( or partially orphan )
virtio-net orphan the skb during tx, which will makes pktgen wait for ever to the refcnt. Jason's idea: introduce a flat to tell pktgen not for wait Discussion here: https://patchwork.kernel.org/patch/1800711/ MST's idea: add a .ndo_tx_polling not only for pktgen Developer: Jason Wang
- Add HW_VLAN_TX support for tap
Eliminate the extra data moving for tagged packets Developer: Jason Wang
- Announce self by guest driver
Send gARP by guest driver. Guest part is finished. Qemu is ongoing. V7 patches is here: http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html Developer: Jason Wang
- guest programmable mac/vlan filtering with macvtap
Developer: Amos Kong qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199 https://git.kernel.org/cgit/virt/kvm/mst/qemu.git/patch/?id=1c0fa6b709d02fe4f98d4ce7b55a6cc3c925791c Status: GuestProgrammableMacVlanFiltering
- bridge without promisc mode in NIC
given hardware support, teach bridge to program mac/vlan filtering in NIC Helps performance and security on noisy LANs http://comments.gmane.org/gmane.linux.network/266546 Developer: Vlad Yasevich
- reduce networking latency:
allow handling short packets from softirq or VCPU context Plan: We are going through the scheduler 3 times (could be up to 5 if softirqd is involved) Consider RX: host irq -> io thread -> VCPU thread -> guest irq -> guest thread. This adds a lot of latency. We can cut it by some 1.5x if we do a bit of work either in the VCPU or softirq context. Testing: netperf TCP RR - should be improved drastically netperf TCP STREAM guest to host - no regression Developer: MST
- Flexible buffers: put virtio header inline with packet data
https://patchwork.kernel.org/patch/1540471/ Developer: MST
- device failover to allow migration with assigned devices
https://fedoraproject.org/wiki/Features/Virt_Device_Failover Developer: Gal Hammer, Cole Robinson, Laine Stump, MST
- Reuse vringh code for better maintainability
Developer: Rusty Russell
- Improve stats, make them more helpful for per analysis
Developer: Sriram Narasimhan
- Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)
Developer: Amos Kong https://bugzilla.redhat.com/show_bug.cgi?id=922589
- Enable GRO for packets coming to bridge from a tap interface
Developer: Dmitry Fleytman
- Better support for windows LRO
Extend virtio-header with statistics for GRO packets: number of packets coalesced and number of duplicate ACKs coalesced Developer: Dmitry Fleytman
- IPoIB infiniband bridging
Plan: implement macvtap for ipoib and virtio-ipoib Developer: MST
- netdev polling for virtio.
There are two kinds of netdev polling: - netpoll - used for debugging - proposed low latency net polling See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html Developer: Jason Wang
- sharing config interrupts
Support mode devices by sharing a single msi vector between multiple virtio devices. (Applies to virtio-blk too). Developer: Amos Kong
- use kvm eventfd support for injecting level interrupts,
enable vhost by default for level interrupts Developer: Amos Kong
projects that are not started yet - no owner
- receive side zero copy
The ideal is a NIC with accelerated RFS support, So we can feed the virtio rx buffers into the correct NIC queue. Depends on non promisc NIC support in bridge. Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net" for a very old prototype
- RDMA bridging
- DMA emgine (IOAT) use in tun
Old patch here: [PATCH RFC] tun: dma engine support It does not speed things up. Need to see why and what can be done.
- virtio API extension: improve small packet/large buffer performance:
support "reposting" buffers for mergeable buffers, support pool for indirect buffers
- more GSO type support:
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL
- ring aliasing:
using vhost-net as a networking backend with virtio-net in QEMU being what's guest facing. This gives you the best of both worlds: QEMU acts as a first line of defense against a malicious guest while still getting the performance advantages of vhost-net (zero-copy). In fact a bit of complexity in vhost was put there in the vague hope to support something like this: virtio rings are not translated through regular memory tables, instead, vhost gets a pointer to ring address. This allows qemu acting as a man in the middle, verifying the descriptors but not touching the packet data.
- non-virtio device support with vhost
Use vhost interface for guests that don't use virtio-net
=== vague ideas: path to implementation not clear
- ring redesign:
find a way to test raw ring performance fix cacheline bounces reduce interrupts
- irq/numa affinity:
networking goes much faster with irq pinning: both with and without numa. what can be done to make the non-pinned setup go faster?
- reduce conflict with VCPU thread
if VCPU and networking run on same CPU, they conflict resulting in bad performance. Fix that, push vhost thread out to another CPU more aggressively.
- rx mac filtering in tun
the need for this is still not understood as we have filtering in bridge we have a small table of addresses, need to make it larger if we only need filtering for unicast (multicast is handled by IMP filtering)
- vlan filtering in tun
the need for this is still not understood as we have filtering in bridge
- vlan filtering in bridge
kernel part is done (Vlad Yasevich) teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)
- tx coalescing
Delay several packets before kick the device.
- interrupt coalescing
Reduce the number of interrupt
- bridging on top of macvlan
add code to forward LRO status from macvlan (not macvtap) back to the lowerdev, so that setting up forwarding from macvlan disables LRO on the lowerdev
- virtio: preserve packets exactly with LRO
LRO is not normally compatible with forwarding. virtio we are getting packets from a linux host, so we could thinkably preserve packets exactly even with LRO. I am guessing other hardware could be doing this as well.
What could we do here?
- bridging without promisc mode with OVS
Keeping networking stable is highest priority.
- Write some unit tests for vhost-net/vhost-scsi
- Run weekly test on upstream HEAD covering test matrix with autotest
- Measure the effect of each of the above-mentioned optimizations
- Use autotest network performance regression testing (that runs netperf) - Also test any wild idea that works. Some may be useful.
- Migrate some of the performance regression autotest functionality into Netperf
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ... - Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests) - Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue. - Make the scripts more visible
- e1000: stabilize
DOA test matrix (all combinations should work):
vhost: test both on and off, obviously test: hotplug/unplug, vlan/mac filtering, netperf, file copy both ways: scp, NFS, NTFS guests: linux: release and debug kernels, windows conditions: plain run, run while under migration, vhost on/off migration networking setup: simple, qos with cgroups host configuration: host-guest, external-guest