|Line 145:||Line 145:|
probably with some watchdog to help with buggy guests.
probably with some watchdog to help with buggy guests.
|Line 181:||Line 185:|
* vhost-net scalability tuning: threading for many VMs
* vhost-net scalability tuning: threading for many VMs
Revision as of 02:00, 14 November 2014
This page should cover all networking related activity in KVM, currently most info is related to virtio-net.
TODO: add bugzilla entry links.
projects in progress. contributions are still very wellcome!
- virtio 1.0 support for linux guests
required for maintainatibility email@example.com Developer: MST,Cornelia Huck
- virtio 1.0 support in qemu
required for maintainatibility firstname.lastname@example.org Developer: Cornelia Huck, MST
- improve net polling for cpu overcommit
exit busy loop when another process is runnable mid.gmane.org/20140822073653.GA7372@gmail.com email@example.com Developer: Jason Wang, MST
- vhost-net/tun/macvtap cross endian support
firstname.lastname@example.org Developer: Cédric Le Goater, MST
- BQL/aggregation for virtio net
dependencies: orphan packets less agressively, enable tx interrupt Developers: MST, Jason
- orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))
virtio-net orphans all skbs during tx, this used to be optimal. Recent changes in guest networking stack and hardware advances such as APICv changed optimal behaviour for drivers. We need to revisit optimizations such as orphaning all packets early to have optimal behaviour.
this should also fix pktgen which is currently broken with virtio net: orphaning all skbs makes pktgen wait for ever to the refcnt. Jason's idea: bring back tx interrupt (partially) Jason's idea: introduce a flag to tell pktgen not for wait Discussion here: https://patchwork.kernel.org/patch/1800711/ MST's idea: add a .ndo_tx_polling not only for pktgen Developers: Jason Wang, MST
- enable tx interrupt (conditionally?)
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets. Jason's idea: switch between poll and tx interrupt mode based on recent statistics. MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet. Developer: Jason Wang, MST
- vhost-net polling
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com Developer: Razya Ladelsky
- support more queues in tun
We limit TUN to 8 queues, but we really want 1 queue per guest CPU. The limit comes from net core, need to teach it to allocate array of pointers and not array of queues. Jason has an draft patch to use flex array. Another thing is to move the flow caches out of tun_struct. http://email@example.com Developers: Pankaj Gupta, Jason Wang
- enable multiqueue by default
Multiqueue causes regression in some workloads, thus it is off by default. Documentation/networking/scaling.txt Detect and enable/disable automatically so we can make it on by default? depends on: BQL This is because GSO tends to batch less when mq is enabled. https://patchwork.kernel.org/patch/2235191/ Developer: Jason Wang
- rework on flow caches
Current hlist implementation of flow caches has several limitations: 1) at worst case, linear search will be bad 2) not scale https://patchwork.kernel.org/patch/2025121/ Developer: Jason Wang
- ethtool seftest support for virtio-net
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost. http://firstname.lastname@example.org Developers: Hengjinxiao,Jason Wang
- bridge without promisc/allmulti mode in NIC
given hardware support, teach bridge to program mac/vlan filtering in NIC Helps performance and security on noisy LANs http://comments.gmane.org/gmane.linux.network/266546 Done for unicast, but not for multicast. Developer: Vlad Yasevich
- Improve stats, make them more helpful for per analysis
Developer: Sriram Narasimhan?
- Enable LRO with bridging
Enable GRO for packets coming to bridge from a tap interface Better support for windows LRO Extend virtio-header with statistics for GRO packets: number of packets coalesced and number of duplicate ACKs coalesced Developer: Dmitry Fleytman?
- IPoIB infiniband bridging
Plan: implement macvtap for ipoib and virtio-ipoib Developer: Marcel Apfelbaum
- interrupt coalescing
Reduce the number of interrupt Rx interrupt coalescing should be good for rx stream throughput. Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally. Developer: Jason Wang
- sharing config interrupts
Support more devices by sharing a single msi vector between multiple virtio devices. (Applies to virtio-blk too). Developer: Amos Kong
- Multi-queue macvtap with real multiple queues
Macvtap only provides multiple queues to user in the form of multiple sockets. As each socket will perform dev_queue_xmit() and we don't really have multiple real queues on the device, we now have a lock contention. This contention needs to be addressed. Developer: Vlad Yasevich
- better xmit queueing for tun
when guest is slower than host, tun drops packets aggressively. This is because keeping packets on the internal queue does not work well. re-enable functionality to stop queue, probably with some watchdog to help with buggy guests. Developer: MST
- Dev watchdog for virtio-net:
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early. Developer: Julio Faracco <email@example.com>
projects in need of an owner
- improve netdev polling for virtio.
There are two kinds of netdev polling: - netpoll - used for debugging - rx busy polling for virtio-net [DONE] see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement. Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. contact: Jason Wang
- drop vhostforce
it's an optimization, probbaly not worth it anymore
- feature negotiation for dpdk/vhost user
feature negotiation seems to be broken
- switch dpdk to qemu vhost user
this seems like a better interface than character device in userspace, designed for out of process networking
- netmap - like approach to zero copy networking
is anything like this feasible on linux?
- vhost-user: clean up protocol
address multiple issues in vhost user protocol: missing VHOST_NET_SET_BACKEND make more messages synchronous (with a reply) VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL mid.gmane.org/541956B8.firstname.lastname@example.org email@example.com Contact: MST
- vhost-net scalability tuning: threading for many VMs
Plan: switch to workqueue shared by many VMs http://firstname.lastname@example.org/msg69868.html
Contact: Razya Ladelsky, Bandan Das Testing: netperf guest to guest
- DPDK with vhost-user
Support vhost-user in addition to vhost net cuse device Contact: Linhaifeng, MST
- DPDK with vhost-net/user: fix offloads
DPDK requires disabling offloads ATM, need to fix this. Contact: MST
- reduce per-device memory allocations
vhost device is very large due to need to keep large arrays of iovecs around. we do need large arrays for correctness, but we could move them out of line, and add short inline arrays for typical use-cases. contact: MST
- batch tx completions in vhost
vhost already batches up to 64 tx completions for zero copy batch non zero copy as well contact: Jason Wang
- better parallelize small queues
don't wait for ring full to kick. add api to detect ring almost full (e.g. 3/4) and kick depends on: BQL contact: MST
- improve vhost-user unit test
support running on machines without hugetlbfs support running with more vm memory layouts Contact: MST
- tun: fix RX livelock
it's easy for guest to starve out host networking open way to fix this is to use napi Contact: MST
- large-order allocations
see 28d6427109d13b0f447cba5761f88d3548e83605 contact: MST
- reduce networking latency:
allow handling short packets from softirq or VCPU context Plan: We are going through the scheduler 3 times (could be up to 5 if softirqd is involved) Consider RX: host irq -> io thread -> VCPU thread -> guest irq -> guest thread. This adds a lot of latency. We can cut it by some 1.5x if we do a bit of work either in the VCPU or softirq context. Testing: netperf TCP RR - should be improved drastically netperf TCP STREAM guest to host - no regression Contact: MST
- device failover to allow migration with assigned devices
https://fedoraproject.org/wiki/Features/Virt_Device_Failover Contact: Gal Hammer, Cole Robinson, Laine Stump, MST
- Reuse vringh code for better maintainability
This project seems abandoned? Contact: Rusty Russell
- use kvm eventfd support for injecting level-triggered interrupts
aim: enable vhost by default for level interrupts. The benefit is security: we want to avoid using userspace virtio net so that vhost-net is always used.
Alex emulated (post & re-enable) level-triggered interrupt in KVM for skipping userspace. VFIO already enjoied the performance benefit, let's do it for virtio-pci. Current virtio-pci devices still use level-interrupt in userspace. see: kernel: 7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts qemu: 68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers (virtio-pci didn't use the wrappers) e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration
Contact: Amos Kong, MST
- Head of line blocking issue with zerocopy
zerocopy has several defects that will cause head of line blocking problem: - limit the number of pending DMAs - complete in order This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case: - boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0) - setup tbf to limit the tap2 bandwidth to 10Mbit/s - start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled. For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc. But we have have similar issues in other case: - The card has its own priority queues - Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled. The final solution is to remove receive buffering at tun, and convert it to use NAPI Contact: Jason Wang, MST Reference: https://lkml.org/lkml/2014/1/17/105
- network traffic throttling
block implemented "continuous leaky bucket" for throttling we can use continuous leaky bucket to network IOPS/BPS * RX/TX/TOTAL Developer: Amos Kong
- Allocate mac_table dynamically
In the future, maybe we can allocate the mac_table dynamically instead of embed it in VirtIONet. Then we can just does a pointer swap and gfree() and can save a memcpy() here. Contact: Amos Kong
- reduce conflict with VCPU thread
if VCPU and networking run on same CPU, they conflict resulting in bad performance. Fix that, push vhost thread out to another CPU more aggressively. Contact: Amos Kong
- rx mac filtering in tun
the need for this is still not understood as we have filtering in bridge we have a small table of addresses, need to make it larger if we only need filtering for unicast (multicast is handled by IMP filtering) Contact: Amos Kong
- vlan filtering in tun
the need for this is still not understood as we have filtering in bridge Contact: Amos Kong
- add documentation for macvlan and macvtap
recent docs here: http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/ need to integrate in iproute and kernel docs.
- receive side zero copy
The ideal is a NIC with accelerated RFS support, So we can feed the virtio rx buffers into the correct NIC queue. Depends on non promisc NIC support in bridge. Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net" for a very old prototype
- RDMA bridging
- DMA emgine (IOAT) use in tun
Old patch here: [PATCH RFC] tun: dma engine support It does not speed things up. Need to see why and what can be done.
- virtio API extension: improve small packet/large buffer performance:
support "reposting" buffers for mergeable buffers, support pool for indirect buffers
- more GSO type support:
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL
- ring aliasing:
using vhost-net as a networking backend with virtio-net in QEMU being what's guest facing. This gives you the best of both worlds: QEMU acts as a first line of defense against a malicious guest while still getting the performance advantages of vhost-net (zero-copy). In fact a bit of complexity in vhost was put there in the vague hope to support something like this: virtio rings are not translated through regular memory tables, instead, vhost gets a pointer to ring address. This allows qemu acting as a man in the middle, verifying the descriptors but not touching the packet data.
- non-virtio device support with vhost
Use vhost interface for guests that don't use virtio-net
- Extend sndbuf scope to int64
Current sndbuf limit is INT_MAX in tap_set_sndbuf(), large values (like 8388607T) can be converted rightly by qapi from qemu commandline, If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64' Why is this useful? Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html
vague ideas: path to implementation not clear
- change tcp_tso_should_defer for kvm: batch more
aggressively. in particular, see below
- tcp: increase gso buffering for cubic,reno
At the moment we push out an skb whenever the limit becomes large enough to send a full-sized TSO skb even if the skb, in fact, is not full-sized. The reason for this seems to be that some congestion avoidance protocols rely on the number of packets in flight to calculate CWND, so if we underuse the available CWND it shrinks which degrades performance: http://email@example.com/msg08738.html
However, there seems to be no reason to do this for protocols such as reno and cubic which don't rely on packets in flight, and so will simply increase CWND a bit more to compensate for the underuse.
- ring redesign:
find a way to test raw ring performance fix cacheline bounces reduce interrupts
- irq/numa affinity:
networking goes much faster with irq pinning: both with and without numa. what can be done to make the non-pinned setup go faster?
- vlan filtering in bridge
kernel part is done (Vlad Yasevich) teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)
- tx coalescing
Delay several packets before kick the device.
- bridging on top of macvlan
add code to forward LRO status from macvlan (not macvtap) back to the lowerdev, so that setting up forwarding from macvlan disables LRO on the lowerdev
- virtio: preserve packets exactly with LRO
LRO is not normally compatible with forwarding. virtio we are getting packets from a linux host, so we could thinkably preserve packets exactly even with LRO. I am guessing other hardware could be doing this as well.
What could we do here?
- bridging without promisc mode with OVS
high level issues: not clear what the project is, yet
- security: iptables
At the moment most people disables iptables to get good performance on 10G/s networking. Any way to improve experience?
Going through scheduler and full networking stack twice (host+guest) adds a lot of overhead Any way to allow bypassing some layers?
Still hard to figure out VM networking, VM networking is through libvirt, host networking through NM Any way to integrate?
Keeping networking stable is highest priority.
- Write some unit tests for vhost-net/vhost-scsi
- Run weekly test on upstream HEAD covering test matrix with autotest
- Measure the effect of each of the above-mentioned optimizations
- Use autotest network performance regression testing (that runs netperf) - Also test any wild idea that works. Some may be useful.
- Migrate some of the performance regression autotest functionality into Netperf
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ... - Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests) - Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue. - Make the scripts more visible
- e1000: stabilize
DOA test matrix (all combinations should work):
vhost: test both on and off, obviously test: hotplug/unplug, vlan/mac filtering, netperf, file copy both ways: scp, NFS, NTFS guests: linux: release and debug kernels, windows conditions: plain run, run while under migration, vhost on/off migration networking setup: simple, qos with cgroups host configuration: host-guest, external-guest