https://linux-kvm.org/api.php?action=feedcontributions&user=Mst&feedformat=atomKVM - User contributions [en]2024-03-29T15:45:44ZUser contributionsMediaWiki 1.39.5https://linux-kvm.org/index.php?title=NetworkingTodo&diff=158887NetworkingTodo2014-12-04T11:20:53Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* improve net polling for cpu overcommit<br />
exit busy loop when another process is runnable<br />
mid.gmane.org/20140822073653.GA7372@gmail.com<br />
mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com<br />
Another idea is make the busy_read/busy_poll dynamic like dynamic PLE window.<br />
Developer: Jason Wang, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default?<br />
depends on: BQL<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Developer: Julio Faracco <jcfaracco@gmail.com><br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
* improve netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
contact: Jason Wang<br />
<br />
* drop vhostforce<br />
it's an optimization, probbaly not worth it anymore<br />
<br />
* avoid userspace virtio-net when vhost is enabled.<br />
ATM we run in userspace until DRIVER_OK<br />
this doubles our security attack surface,<br />
so it's best avoided.<br />
<br />
* feature negotiation for dpdk/vhost user<br />
feature negotiation seems to be broken<br />
<br />
* switch dpdk to qemu vhost user<br />
this seems like a better interface than<br />
character device in userspace,<br />
designed for out of process networking<br />
<br />
* netmap - like approach to zero copy networking<br />
is anything like this feasible on linux?<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Contact: MST<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
http://mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Contact: Jason Wang<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Contact: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* DPDK with vhost-user<br />
Support vhost-user in addition to vhost net cuse device<br />
Contact: Linhaifeng, MST<br />
<br />
* DPDK with vhost-net/user: fix offloads<br />
DPDK requires disabling offloads ATM,<br />
need to fix this.<br />
Contact: MST<br />
<br />
* reduce per-device memory allocations<br />
vhost device is very large due to need to<br />
keep large arrays of iovecs around.<br />
we do need large arrays for correctness,<br />
but we could move them out of line,<br />
and add short inline arrays for typical use-cases.<br />
contact: MST<br />
<br />
* batch tx completions in vhost<br />
vhost already batches up to 64 tx completions for zero copy<br />
batch non zero copy as well<br />
contact: Jason Wang<br />
<br />
* better parallelize small queues<br />
don't wait for ring full to kick.<br />
add api to detect ring almost full (e.g. 3/4) and kick<br />
depends on: BQL<br />
contact: MST<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Contact: MST<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
Why is this useful?<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=119867NetworkingTodo2014-11-13T15:57:01Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* improve net polling for cpu overcommit<br />
exit busy loop when another process is runnable<br />
mid.gmane.org/20140822073653.GA7372@gmail.com<br />
mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com<br />
Developer: Jason Wang, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default?<br />
depends on: BQL<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
http://mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
* improve netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
contact: Jason Wang<br />
<br />
* drop vhostforce<br />
it's an optimization, probbaly not worth it anymore<br />
<br />
* feature negotiation for dpdk/vhost user<br />
feature negotiation seems to be broken<br />
<br />
* switch dpdk to qemu vhost user<br />
this seems like a better interface than<br />
character device in userspace,<br />
designed for out of process networking<br />
<br />
* netmap - like approach to zero copy networking<br />
is anything like this feasible on linux?<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Contact: MST<br />
<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Contact: Jason Wang<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Contact: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* DPDK with vhost-user<br />
Support vhost-user in addition to vhost net cuse device<br />
Contact: Linhaifeng, MST<br />
<br />
* DPDK with vhost-net/user: fix offloads<br />
DPDK requires disabling offloads ATM,<br />
need to fix this.<br />
Contact: MST<br />
<br />
* reduce per-device memory allocations<br />
vhost device is very large due to need to<br />
keep large arrays of iovecs around.<br />
we do need large arrays for correctness,<br />
but we could move them out of line,<br />
and add short inline arrays for typical use-cases.<br />
contact: MST<br />
<br />
* batch tx completions in vhost<br />
vhost already batches up to 64 tx completions for zero copy<br />
batch non zero copy as well<br />
contact: Jason Wang<br />
<br />
* better parallelize small queues<br />
don't wait for ring full to kick.<br />
add api to detect ring almost full (e.g. 3/4) and kick<br />
depends on: BQL<br />
contact: MST<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Contact: MST<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
Why is this useful?<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=118505NetworkingTodo2014-11-10T11:37:09Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* improve net polling for cpu overcommit<br />
exit busy loop when another process is runnable<br />
mid.gmane.org/20140822073653.GA7372@gmail.com<br />
mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com<br />
Developer: Jason Wang, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default?<br />
depends on: BQL<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
http://mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
* improve netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
contact: Jason Wang<br />
<br />
* drop vhostforce<br />
it's an optimization, probbaly not worth it anymore<br />
<br />
* feature negotiation for dpdk/vhost user<br />
feature negotiation seems to be broken<br />
<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Contact: MST<br />
<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Contact: Jason Wang<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Contact: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* DPDK with vhost-user<br />
Support vhost-user in addition to vhost net cuse device<br />
Contact: Linhaifeng, MST<br />
<br />
* DPDK with vhost-net/user: fix offloads<br />
DPDK requires disabling offloads ATM,<br />
need to fix this.<br />
Contact: MST<br />
<br />
* reduce per-device memory allocations<br />
vhost device is very large due to need to<br />
keep large arrays of iovecs around.<br />
we do need large arrays for correctness,<br />
but we could move them out of line,<br />
and add short inline arrays for typical use-cases.<br />
contact: MST<br />
<br />
* batch tx completions in vhost<br />
vhost already batches up to 64 tx completions for zero copy<br />
batch non zero copy as well<br />
contact: Jason Wang<br />
<br />
* better parallelize small queues<br />
don't wait for ring full to kick.<br />
add api to detect ring almost full (e.g. 3/4) and kick<br />
depends on: BQL<br />
contact: MST<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Contact: MST<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
Why is this useful?<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=118504NetworkingTodo2014-11-10T11:33:03Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* improve net polling for cpu overcommit<br />
exit busy loop when another process is runnable<br />
mid.gmane.org/20140822073653.GA7372@gmail.com<br />
mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com<br />
Developer: Jason Wang, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default?<br />
depends on: BQL<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
http://mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
Developer: Jason Wang<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
* drop vhostforce<br />
it's an optimization, probbaly not worth it anymore<br />
<br />
* feature negotiation for dpdk/vhost user<br />
feature negotiation seems to be broken<br />
<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Contact: MST<br />
<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Contact: Jason Wang<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Contact: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* DPDK with vhost-user<br />
Support vhost-user in addition to vhost net cuse device<br />
Contact: Linhaifeng, MST<br />
<br />
* DPDK with vhost-net/user: fix offloads<br />
DPDK requires disabling offloads ATM,<br />
need to fix this.<br />
Contact: MST<br />
<br />
* reduce per-device memory allocations<br />
vhost device is very large due to need to<br />
keep large arrays of iovecs around.<br />
we do need large arrays for correctness,<br />
but we could move them out of line,<br />
and add short inline arrays for typical use-cases.<br />
contact: MST<br />
<br />
* batch tx completions in vhost<br />
vhost already batches up to 64 tx completions for zero copy<br />
batch non zero copy as well<br />
contact: Jason Wang<br />
<br />
* better parallelize small queues<br />
don't wait for ring full to kick.<br />
add api to detect ring almost full (e.g. 3/4) and kick<br />
depends on: BQL<br />
contact: MST<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Contact: MST<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
Why is this useful?<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=118503NetworkingTodo2014-11-10T11:29:38Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* improve net polling for cpu overcommit<br />
exit busy loop when another process is runnable<br />
mid.gmane.org/20140822073653.GA7372@gmail.com<br />
mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com<br />
Developer: Jason Wang, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default?<br />
depends on: BQL<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
http://mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
Developer: Jason Wang<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Contact: MST<br />
<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Contact: Jason Wang<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Contact: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* DPDK with vhost-user<br />
Support vhost-user in addition to vhost net cuse device<br />
Contact: Linhaifeng, MST<br />
<br />
* DPDK with vhost-net/user: fix offloads<br />
DPDK requires disabling offloads ATM,<br />
need to fix this.<br />
Contact: MST<br />
<br />
* reduce per-device memory allocations<br />
vhost device is very large due to need to<br />
keep large arrays of iovecs around.<br />
we do need large arrays for correctness,<br />
but we could move them out of line,<br />
and add short inline arrays for typical use-cases.<br />
contact: MST<br />
<br />
* batch tx completions in vhost<br />
vhost already batches up to 64 tx completions for zero copy<br />
batch non zero copy as well<br />
contact: Jason Wang<br />
<br />
* better parallelize small queues<br />
don't wait for ring full to kick.<br />
add api to detect ring almost full (e.g. 3/4) and kick<br />
depends on: BQL<br />
contact: MST<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Contact: MST<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
Why is this useful?<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=118502NetworkingTodo2014-11-10T11:13:11Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* improve net polling for cpu overcommit<br />
exit busy loop when another process is runnable<br />
mid.gmane.org/20140822073653.GA7372@gmail.com<br />
mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com<br />
Developer: Jason Wang, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Developer: Jason Wang<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Developer: MST?<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
Developer: Jason Wang<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
* DPDK with vhost-user<br />
Support vhost-user in addition to vhost net cuse device<br />
Contact: Linhaifeng, MST<br />
<br />
* DPDK with vhost-net/user: fix offloads<br />
DPDK requires disabling offloads ATM,<br />
need to fix this.<br />
Contact: MST<br />
<br />
* reduce per-device memory allocations<br />
vhost device is very large due to need to<br />
keep large arrays of iovecs around.<br />
we do need large arrays for correctness,<br />
but we could move them out of line,<br />
and add short inline arrays for typical use-cases.<br />
contact: MST<br />
<br />
* batch tx completions in vhost<br />
vhost already batches up to 64 tx completions for zero copy<br />
batch non zero copy as well<br />
contact: Jason Wang<br />
<br />
* better parallelize small queues<br />
don't wait for ring full to kick.<br />
add api to detect ring almost full (e.g. 3/4) and kick<br />
depends on: BQL<br />
contact: MST<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Contact: MST<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
Why is this useful?<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=118501NetworkingTodo2014-11-10T10:59:05Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* improve net polling for cpu overcommit<br />
exit busy loop when another process is runnable<br />
mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com<br />
Developer: Jason Wang, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Developer: Jason Wang<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Developer: MST?<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
Developer: Jason Wang<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
* DPDK with vhost-user<br />
Support vhost-user in addition to vhost net cuse device<br />
Contact: Linhaifeng, MST<br />
<br />
* DPDK with vhost-net/user: fix offloads<br />
DPDK requires disabling offloads ATM,<br />
need to fix this.<br />
Contact: MST<br />
<br />
* reduce per-device memory allocations<br />
vhost device is very large due to need to<br />
keep large arrays of iovecs around.<br />
we do need large arrays for correctness,<br />
but we could move them out of line,<br />
and add short inline arrays for typical use-cases.<br />
contact: MST<br />
<br />
* batch tx completions in vhost<br />
vhost already batches up to 64 tx completions for zero copy<br />
batch non zero copy as well<br />
contact: Jason Wang<br />
<br />
* better parallelize small queues<br />
don't wait for ring full to kick.<br />
add api to detect ring almost full (e.g. 3/4) and kick<br />
depends on: BQL<br />
contact: MST<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Contact: MST<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
Why is this useful?<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=118500NetworkingTodo2014-11-10T10:48:55Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* improve net polling for cpu overcommit<br />
exit busy loop when another process is runnable<br />
mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com<br />
Developer: Jason Wang, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Developer: Jason Wang<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Developer: MST?<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
Developer: Jason Wang<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
* reduce per-device memory allocations<br />
vhost device is very large due to need to<br />
keep large arrays of iovecs around.<br />
we do need large arrays for correctness,<br />
but we could move them out of line,<br />
and add short inline arrays for typical use-cases.<br />
contact: MST<br />
<br />
* batch tx completions in vhost<br />
vhost already batches up to 64 tx completions for zero copy<br />
batch non zero copy as well<br />
contact: Jason Wang<br />
<br />
* better parallelize small queues<br />
don't wait for ring full to kick.<br />
add api to detect ring almost full (e.g. 3/4) and kick<br />
depends on: BQL<br />
contact: MST<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Contact: MST<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
Why is this useful?<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=118499NetworkingTodo2014-11-10T10:43:21Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* improve net polling for cpu overcommit<br />
exit busy loop when another process is runnable<br />
mid.gmane.org/1408608310-13579-2-git-send-email-jasowang@redhat.com<br />
Developer: Jason Wang, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Developer: Jason Wang<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Developer: MST?<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
Developer: Jason Wang<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Developer: MST?<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
Why is this useful?<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=118498NetworkingTodo2014-11-10T10:37:04Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support for linux guests<br />
required for maintainatibility<br />
mid.gmane.org/1414081380-14623-1-git-send-email-mst@redhat.com<br />
Developer: MST,Cornelia Huck<br />
* virtio 1.0 support in qemu<br />
required for maintainatibility<br />
mid.gmane.org/20141024103839.7162b93f.cornelia.huck@de.ibm.com<br />
Developer: Cornelia Huck, MST<br />
<br />
* vhost-net/tun/macvtap cross endian support<br />
mid.gmane.org/1414572130-17014-2-git-send-email-clg@fr.ibm.com<br />
Developer: Cédric Le Goater, MST<br />
<br />
* BQL/aggregation for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* ethtool seftest support for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Developer: Jason Wang<br />
<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 (applied by upstream)<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
http://git.qemu.org/?p=qemu.git;a=commit;h=b1be42803b31a913bab65bab563a8760ad2e7f7f<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc/allmulti mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Done for unicast, but not for multicast.<br />
Developer: Vlad Yasevich<br />
<br />
<br />
* vhost-user: clean up protocol<br />
address multiple issues in vhost user protocol:<br />
missing VHOST_NET_SET_BACKEND<br />
make more messages synchronous (with a reply)<br />
VHOST_SET_MEM_TABLE, VHOST_SET_VRING_CALL<br />
mid.gmane.org/541956B8.1070203@huawei.com<br />
mid.gmane.org/54192136.2010409@huawei.com<br />
Developer: MST?<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan?<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
Developer: Jason Wang<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
<br />
=== projects in need of an owner ===<br />
<br />
<br />
* improve vhost-user unit test<br />
support running on machines without hugetlbfs<br />
support running with more vm memory layouts<br />
Developer: MST?<br />
<br />
* tun: fix RX livelock<br />
it's easy for guest to starve out host networking<br />
open way to fix this is to use napi <br />
Contact: MST<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
contact: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=118497NetworkingTodo2014-11-10T09:17:42Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio 1.0 support in virtio net<br />
required for maintainatibility<br />
Developer: MST<br />
<br />
* BQL for virtio net<br />
dependencies: orphan packets less agressively, enable tx interrupt <br />
Developers: MST, Jason<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: bring back tx interrupt (partially)<br />
Jason's idea: introduce a flag to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* enable tx interrupt (conditionally?)<br />
Small packet TCP stream performance is not good. This is because virtio-net orphan the packet during ndo_start_xmit() which disable the TCP small packet optimizations like TCP small Queue and AutoCork. The idea is enable the tx interrupt to TCP small packets.<br />
Jason's idea: switch between poll and tx interrupt mode based on recent statistics.<br />
MST's idea: use a per descriptor flag for virtio to force interrupt for a specific packet.<br />
Developer: Jason Wang, MST<br />
<br />
<br />
* vhost-net polling<br />
mid.gmane.org/20141029123831.A80F338002D@moren.haifa.ibm.com<br />
Developer: Razya Ladelsky<br />
<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Razya Ladelsky, Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* support more queues in tun<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
http://mid.gmane.org/1408369040-1216-1-git-send-email-pagupta@redhat.com<br />
Developers: Pankaj Gupta, Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Documentation/networking/scaling.txt<br />
Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
<br />
* Write a ethtool seftest for virtio-net<br />
Implement selftest ethtool method for virtio-net for regression test e.g the CVEs found for tun/macvtap, qemu and vhost.<br />
mid.gmane.org/1409881866-14780-1-git-send-email-hjxiaohust@gmail.com<br />
Developers: Hengjinxiao,Jason Wang<br />
<br />
* Dev watchdog for virtio-net:<br />
Implement a watchdog for virtio-net. This will be useful for hunting host bugs early.<br />
Developer: Jason Wang<br />
<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 (applied by upstream)<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
http://git.qemu.org/?p=qemu.git;a=commit;h=b1be42803b31a913bab65bab563a8760ad2e7f7f<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
<br />
<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Enable LRO with bridging<br />
Enable GRO for packets coming to bridge from a tap interface<br />
Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman?<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: Marcel Apfelbaum<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- rx busy polling for virtio-net [DONE]<br />
see https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=91815639d8804d1eee7ce2e1f7f60b36771db2c9. 1 byte netperf TCP_RR shows 127% improvement.<br />
Future work is co-operate with host, and only does the busy polling when there's no other process in host cpu. <br />
Developer: Jason Wang<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
Rx interrupt coalescing should be good for rx stream throughput.<br />
Tx interrupt coalescing will help the optimization of enabling tx interrupt conditionally.<br />
Developer: Jason Wang<br />
<br />
<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
<br />
* Multi-queue macvtap with real multiple queues<br />
Macvtap only provides multiple queues to user in the form of multiple<br />
sockets. As each socket will perform dev_queue_xmit() and we don't<br />
really have multiple real queues on the device, we now have a lock<br />
contention. This contention needs to be addressed.<br />
Developer: Vlad Yasevich<br />
<br />
* better xmit queueing for tun<br />
when guest is slower than host, tun drops packets<br />
aggressively. This is because keeping packets on<br />
the internal queue does not work well.<br />
re-enable functionality to stop queue,<br />
probably with some watchdog to help with buggy guests.<br />
Developer: MST<br />
<br />
=== projects in need of an owner ===<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
Developer: MST<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Contact: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Contact: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
This project seems abandoned?<br />
Contact: Rusty Russell<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
see: kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Contact: Amos Kong, MST <br />
<br />
* Head of line blocking issue with zerocopy<br />
zerocopy has several defects that will cause head of line blocking problem:<br />
- limit the number of pending DMAs<br />
- complete in order<br />
This means is one of some of the DMAs were delayed, all other will also delayed. This could be reproduced with following case:<br />
- boot two VMS VM1(tap1) and VM2(tap2) on host1 (has eth0)<br />
- setup tbf to limit the tap2 bandwidth to 10Mbit/s<br />
- start two netperf instances one from VM1 to VM2, another from VM1 to an external host whose traffic go through eth0 on host<br />
Then you can see not only VM1 to VM2 is throttled, but also VM1 to external host were also throttled.<br />
For this issue, a solution is orphan the frags when en queuing to non work conserving qdisc.<br />
But we have have similar issues in other case:<br />
- The card has its own priority queues<br />
- Host has two interface, one is 1G another is 10G, so throttle 1G may lead traffic over 10G to be throttled.<br />
The final solution is to remove receive buffering at tun, and convert it to use NAPI<br />
Contact: Jason Wang, MST<br />
Reference: https://lkml.org/lkml/2014/1/17/105<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Contact: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Contact: Amos Kong<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
Contact: Amos Kong<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
Contact: Amos Kong<br />
<br />
<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=20593NetworkingTodo2014-06-09T14:32:08Z<p>Mst: macvlan doc task</p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
Developer: MST<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: brng back tx interrupt (partially)<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V8 new RFC posted here (limit the changes to virtio-net only)<br />
https://lists.gnu.org/archive/html/qemu-devel/2014-03/msg02648.html<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 (applied by upstream)<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
http://git.qemu.org/?p=qemu.git;a=commit;h=b1be42803b31a913bab65bab563a8760ad2e7f7f<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
Jason has a draft path to enable low latency polling for virito-net.<br />
May also consider it for tun/macvtap.<br />
Developer: Jason Wang<br />
<br />
* use kvm eventfd support for injecting level-triggered interrupts<br />
aim: enable vhost by default for level interrupts.<br />
The benefit is security: we want to avoid using userspace<br />
virtio net so that vhost-net is always used.<br />
<br />
Alex emulated (post & re-enable) level-triggered interrupt in KVM for<br />
skipping userspace. VFIO already enjoied the performance benefit,<br />
let's do it for virtio-pci. Current virtio-pci devices still use<br />
level-interrupt in userspace.<br />
<br />
kernel:<br />
7a84428af [PATCH] KVM: Add resampling irqfds for level triggered interrupts<br />
qemu:<br />
68919cac [PATCH] hw/vfio: set interrupts using pci irq wrappers<br />
(virtio-pci didn't use the wrappers)<br />
e1d1e586 [PATCH] vfio-pci: Add KVM INTx acceleration<br />
<br />
Developer: Amos Kong<br />
<br />
* sharing config interrupts<br />
Support more devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Developer: Amos Kong<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
Developer: Amos Kong<br />
<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* add documentation for macvlan and macvtap<br />
recent docs here:<br />
http://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/<br />
need to integrate in iproute and kernel docs.<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
* Extend sndbuf scope to int64<br />
<br />
Current sndbuf limit is INT_MAX in tap_set_sndbuf(),<br />
large values (like 8388607T) can be converted rightly by qapi from qemu commandline,<br />
If we want to support the large values, we should extend sndbuf limit from 'int' to 'int64'<br />
<br />
Upstream discussion: https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg04192.html<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=5392NetworkingTodo2014-03-20T10:15:42Z<p>Mst: clarify partial orphaning</p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
Developer: MST<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* orphan packets less agressively (was make pktgen works for virtio-net ( or partially orphan ))<br />
virtio-net orphans all skbs during tx, this used to be optimal.<br />
Recent changes in guest networking stack and hardware advances<br />
such as APICv changed optimal behaviour for drivers.<br />
We need to revisit optimizations such as orphaning all packets early<br />
to have optimal behaviour.<br />
<br />
this should also fix pktgen which is currently broken with virtio net:<br />
orphaning all skbs makes pktgen wait for ever to the refcnt.<br />
Jason's idea: brng back tx interrupt (partially)<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developers: Jason Wang, MST<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 (applied by upstream)<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
http://git.qemu.org/?p=qemu.git;a=commit;h=b1be42803b31a913bab65bab563a8760ad2e7f7f<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
Developer: Jason Wang<br />
<br />
* sharing config interrupts<br />
Support mode devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
Developer: Amos Kong<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Developer: Amos Kong<br />
<br />
<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=5032NetworkingTodo2014-02-06T14:47:49Z<p>Mst: and more unowned projects</p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
Developer: MST<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 (applied by upstream)<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
http://git.qemu.org/?p=qemu.git;a=commit;h=b1be42803b31a913bab65bab563a8760ad2e7f7f<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
Developer: Jason Wang<br />
<br />
* sharing config interrupts<br />
Support mode devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
Developer: Amos Kong<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Developer: Amos Kong<br />
<br />
<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* change tcp_tso_should_defer for kvm: batch more<br />
aggressively.<br />
in particular, see below<br />
<br />
* tcp: increase gso buffering for cubic,reno<br />
At the moment we push out an skb whenever the limit becomes<br />
large enough to send a full-sized TSO skb even if the skb,<br />
in fact, is not full-sized.<br />
The reason for this seems to be that some congestion avoidance<br />
protocols rely on the number of packets in flight to calculate<br />
CWND, so if we underuse the available CWND it shrinks<br />
which degrades performance:<br />
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html<br />
<br />
However, there seems to be no reason to do this for<br />
protocols such as reno and cubic which don't rely on packets in flight,<br />
and so will simply increase CWND a bit more to compensate for the<br />
underuse.<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=5028NetworkingTodo2014-02-02T21:47:10Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
Developer: MST<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 (applied by upstream)<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
http://git.qemu.org/?p=qemu.git;a=commit;h=b1be42803b31a913bab65bab563a8760ad2e7f7f<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
Developer: Jason Wang<br />
<br />
* sharing config interrupts<br />
Support mode devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
Developer: Amos Kong<br />
<br />
* network traffic throttling<br />
block implemented "continuous leaky bucket" for throttling<br />
we can use continuous leaky bucket to network<br />
IOPS/BPS * RX/TX/TOTAL<br />
Developer: Amos Kong<br />
<br />
* Allocate mac_table dynamically<br />
<br />
In the future, maybe we can allocate the mac_table dynamically instead<br />
of embed it in VirtIONet. Then we can just does a pointer swap and<br />
gfree() and can save a memcpy() here.<br />
Developer: Amos Kong<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
https://bugzilla.redhat.com/show_bug.cgi?id=922589<br />
Status: patches applied<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
=== high level issues: not clear what the project is, yet ===<br />
<br />
* security: iptables<br />
At the moment most people disables iptables to get<br />
good performance on 10G/s networking.<br />
Any way to improve experience?<br />
<br />
* performance<br />
Going through scheduler and full networking stack twice<br />
(host+guest) adds a lot of overhead<br />
Any way to allow bypassing some layers?<br />
<br />
* manageability<br />
Still hard to figure out VM networking,<br />
VM networking is through libvirt, host networking through NM<br />
Any way to integrate?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4980NetworkingTodo2013-11-11T10:37:51Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* large-order allocations<br />
see 28d6427109d13b0f447cba5761f88d3548e83605<br />
Developer: MST<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203 (applied by upstream)<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
http://git.qemu.org/?p=qemu.git;a=commit;h=b1be42803b31a913bab65bab563a8760ad2e7f7f<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
Developer: Jason Wang<br />
<br />
* sharing config interrupts<br />
Support mode devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
Developer: Amos Kong<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
https://bugzilla.redhat.com/show_bug.cgi?id=922589<br />
Status: patches applied<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4873NetworkingTodo2013-09-17T14:54:53Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
https://git.kernel.org/cgit/virt/kvm/mst/qemu.git/patch/?id=1c0fa6b709d02fe4f98d4ce7b55a6cc3c925791c<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
https://bugzilla.redhat.com/show_bug.cgi?id=922589<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
Developer: Jason Wang<br />
<br />
* sharing config interrupts<br />
Support mode devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
Developer: Amos Kong<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
Developer: Amos Kong<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
Search for "Xin Xiaohui: Provide a zero-copy method on KVM virtio-net"<br />
for a very old prototype<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
* Migrate some of the performance regression autotest functionality into Netperf<br />
- Get the CPU-utilization of the Host and the other-party, and add them to the report. This is also true for other Host measures, such as vmexits, interrupts, ...<br />
- Run Netperf in demo-mode, and measure only the time when all the sessions are active (could be many seconds after the beginning of the tests)<br />
- Packaging of Netperf in Fedora / RHEL (exists in Fedora). Licensing could be an issue.<br />
- Make the scripts more visible<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=PCITodo&diff=4865PCITodo2013-08-22T09:19:02Z<p>Mst: </p>
<hr />
<div>This page should cover all PCI related activity in KVM.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio device as PCI Express device<br />
Issue: Express spec requires device can work without IO,<br />
virtio requires IO at the moment.<br />
Plan: add support for memory BARs.<br />
Developer: Michael S. Tsirkin<br />
<br />
* Hotplug for devices behind PCI bridges<br />
Issue: QEMU lacks support for device hotplug behind<br />
pci bridges.<br />
<br />
Plan:<br />
- each bus gets assigned a number 0-255<br />
- generated ACPI code writes this number<br />
to a new BSEL register, then uses existing<br />
UP/DOWN registers to probe slot status;<br />
to eject, write number to BSEL register,<br />
then slot into existing EJ<br />
This is to address the ACPI spec requirement to<br />
avoid config cycle access to any bus except PCI roots.<br />
<br />
Note: ACPI doesn't support adding or removing bridges by hotplug.<br />
We should and prevent removal of bridges by hotplug,<br />
unless they were added by hotplug previously<br />
(and so, are not described by ACPI).<br />
Developer: Michael S. Tsirkin<br />
<br />
* Hotplug for Q35<br />
Issue: QEMU does not support hotplug for Q35<br />
Plan: since we need to support hotplug of PCI devices,<br />
let's use ACPI hotplug for everything<br />
Use same interface as we do for PCI, this way<br />
same ACPI code can be reused.<br />
<br />
Developer: Michael S. Tsirkin<br />
<br />
* Support for different PCI express link width/speed settings<br />
Issue: QEMU currently emulates all links at minimal<br />
width and speed. This means we don't need to emulate<br />
link negotiation, but might in theory confuse guests<br />
for assigned devices.<br />
The issue is complicated by the fact that real link speed<br />
might be limited by the slot where assigned device is put.<br />
Plan: add management interface to control the max link<br />
speed and width for the slot.<br />
Teach management to query this at slot level.<br />
For device, query it from device itself.<br />
Support link width/speed negotiation as per spec.<br />
Developer: Alex Williamson<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* PCI interrupts should be active-low<br />
Issue: PCI INT#x interrupts are normally active-low.<br />
QEMU emulates them as active high. Works fine for<br />
windows and linux but not guaranteed for other guests. <br />
See http://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/<br />
<br />
Plan: add support for active-low interrupts in KVM.<br />
Enable this for PCI interrupts.<br />
Change DSDT appropriately.<br />
<br />
Developer: <br />
Testing: stress-test devices with INT#x interrupts<br />
with interrupt sharing in particular<br />
<br />
* PCI master-abort is not emulated correctly<br />
Issue: access to disabled PCI memory normally returns<br />
all-ones (or read) and sets master abort<br />
detected bit in bridge.<br />
For express, it can also trigger AER reporting<br />
when enabled.<br />
QEMU does not emulate any of this: reads return 0,<br />
writes.<br />
Plan: add catch-all memory region with low priority<br />
in bridge, and trigger the required actions.<br />
<br />
* Better modeling for PCI INT#x<br />
Issue: for a device deep down a bridge hierarchy,<br />
we scan the tree upwards on each interrupt,<br />
calling map_irq at each level, this is bad for performance.<br />
Behaviour is also open-coded at each level, this is ugly.<br />
Plan: something similar to MemoryRegion API:<br />
add objects that represent PCI INT#x pings<br />
(maybe pins in general) model their connection at<br />
each level. Each time there's a change, re-map<br />
them. On data path, use pre-computed irq# to<br />
send/clear the interrupt quickly.<br />
<br />
* Subtractive decoding support<br />
Support subtractive decoding in PCI bridges.<br />
<br />
* Support VGA behind a PCI bridge<br />
Support VGA devices behind PCI bridges.<br />
Good for things like multiple VGA cards.<br />
Requires subtractive decoding.<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
* Way to figure out proper PCI connectivity options.<br />
Issue: How do you know where you can connect a device?<br />
For PCI, this includes the legal bus addresses,<br />
hotplug support for bus,<br />
how the secondary bus is named,<br />
and whether bridges support required addressing modes.<br />
For PCI Express, there are additional options:<br />
root or downstream port,<br />
virtual bridge in root complex/upstream port.<br />
management tools end up hard-coding this information,<br />
based simply on device name, but that's ugly.<br />
Vague idea: add interfaces to figure out what can be<br />
connected to what and how, or at least the function of each device.<br />
People to contact: Laine Stump<br />
<br />
<br />
* Fix AHCI for stability<br />
Not related to PCI directly but modern chipsets<br />
with PCI Express support all use AHCI.<br />
Issue1: AHCI is unstable with windows guests<br />
(win7 fails to boot sometimes)<br />
Issue2: guests sometimes crash when doing ping pong migration<br />
People to contact: Alexander Graf</div>Msthttps://linux-kvm.org/index.php?title=PCITodo&diff=4864PCITodo2013-08-21T13:42:26Z<p>Mst: </p>
<hr />
<div>This page should cover all PCI related activity in KVM.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome! ===<br />
<br />
* virtio device as PCI Express device<br />
Issue: Express spec requires device can work without IO,<br />
virtio requires IO at the moment.<br />
Plan: add support for memory BARs.<br />
Developer: Michael S. Tsirkin<br />
<br />
* Hotplug for devices behind PCI bridges<br />
Issue: QEMU lacks support for device hotplug behind<br />
pci bridges.<br />
<br />
Plan:<br />
- each bus gets assigned a number 0-255<br />
- generated ACPI code writes this number<br />
to a new BSEL register, then uses existing<br />
UP/DOWN registers to probe slot status;<br />
to eject, write number to BSEL register,<br />
then slot into existing EJ<br />
This is to address the ACPI spec requirement to<br />
avoid config cycle access to any bus except PCI roots.<br />
<br />
Note: ACPI doesn't support adding or removing bridges by hotplug.<br />
We should and prevent removal of bridges by hotplug,<br />
unless they were added by hotplug previously<br />
(and so, are not described by ACPI).<br />
Developer: Michael S. Tsirkin<br />
<br />
* Hotplug for Q35<br />
Issue: QEMU does not support hotplug for Q35<br />
Plan: since we need to support hotplug of PCI devices,<br />
let's use ACPI hotplug for everything<br />
Use same interface as we do for PCI, this way<br />
same ACPI code can be reused.<br />
<br />
Developer: Michael S. Tsirkin<br />
<br />
* Support for different PCI express link width/speed settings<br />
Issue: QEMU currently emulates all links at minimal<br />
width and speed. This means we don't need to emulate<br />
link negotiation, but might in theory confuse guests<br />
for assigned devices.<br />
The issue is complicated by the fact that real link speed<br />
might be limited by the slot where assigned device is put.<br />
Plan: add management interface to control the max link<br />
speed and width for the slot.<br />
Teach management to query this at slot level.<br />
For device, query it from device itself.<br />
Support link width/speed negotiation as per spec.<br />
Developer: Alex Williamson<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* PCI interrupts should be active-low<br />
Issue: PCI INT#x interrupts are normally active-low.<br />
QEMU emulates them as active high. Works fine for<br />
windows and linux but not guaranteed for other guests. <br />
See http://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/<br />
<br />
Plan: add support for active-low interrupts in KVM.<br />
Enable this for PCI interrupts.<br />
Change DSDT appropriately.<br />
<br />
Developer: <br />
Testing: stress-test devices with INT#x interrupts<br />
with interrupt sharing in particular<br />
<br />
* PCI master-abort is not emulated correctly<br />
Issue: access to disabled PCI memory normally returns<br />
all-ones (or read) and sets master abort<br />
detected bit in bridge.<br />
For express, it can also trigger AER reporting<br />
when enabled.<br />
QEMU does not emulate any of this: reads return 0,<br />
writes.<br />
Plan: add catch-all memory region with low priority<br />
in bridge, and trigger the required actions.<br />
<br />
* Better modeling for PCI INT#x<br />
Issue: for a device deep down a bridge hierarchy,<br />
we scan the tree upwards on each interrupt,<br />
calling map_irq at each level, this is bad for performance.<br />
Behaviour is also open-coded at each level, this is ugly.<br />
Plan: something similar to MemoryRegion API:<br />
add objects that represent PCI INT#x pings<br />
(maybe pins in general) model their connection at<br />
each level. Each time there's a change, re-map<br />
them. On data path, use pre-computed irq# to<br />
send/clear the interrupt quickly.<br />
<br />
<br />
=== vague ideas: path to implementation not clear ===<br />
* Way to figure out proper PCI connectivity options.<br />
Issue: How do you know where you can connect a device?<br />
For PCI, this includes the legal bus addresses,<br />
hotplug support for bus,<br />
how the secondary bus is named,<br />
and whether bridges support required addressing modes.<br />
For PCI Express, there are additional options:<br />
root or downstream port,<br />
virtual bridge in root complex/upstream port.<br />
management tools end up hard-coding this information,<br />
based simply on device name, but that's ugly.<br />
Vague idea: add interfaces to figure out what can be<br />
connected to what and how, or at least the function of each device.<br />
People to contact: Laine Stump<br />
<br />
<br />
* Fix AHCI for stability<br />
Not related to PCI directly but modern chipsets<br />
with PCI Express support all use AHCI.<br />
Issue1: AHCI is unstable with windows guests<br />
(win7 fails to boot sometimes)<br />
Issue2: guests sometimes crash when doing ping pong migration<br />
People to contact: Alexander Graf</div>Msthttps://linux-kvm.org/index.php?title=TODO&diff=4863TODO2013-08-21T10:43:19Z<p>Mst: </p>
<hr />
<div>=ToDo=<br />
<br />
The following items need some love. Please post to the list if you are interested in helping out: <br />
<br />
* Emulate MSR_IA32_DEBUGCTL for guests which use it<br />
* Bring up Windows 95 and Windows 98 guests<br />
* Implement ACPI memory hotplug<br />
* Improve ballooning to try to use 2MB pages when possible ( in progress - kern.devel@gmail.com )<br />
<br />
==== Networking TODO: ====<br />
* Has its [[NetworkingTodo|own page]]<br />
<br />
==== PCI TODO: ====<br />
* Has its [[PCITodo|own page]]<br />
<br />
==== MMU related: ====<br />
* Improve mmu page eviction algorithm (currently FIFO, change to approximate LRU).<br />
* Add a read-only memory type.<br />
** possible using mprotect()?<br />
* Implement AM20 for dos and the like.<br />
* O(1) write protection by protecting the PML4Es, then on demand PDPTEs, PDEs, and PTEs<br />
* Simpler variant: don't drop large ptes when write protecting; just write protect them. When taking a write fault, either drop the large pte, or convert it to small ptes and write protect those (like O(1) write protection).<br />
* O(1) mmu invalidation using a generation number<br />
<br />
==== x86 emulator updates: ====<br />
* On-demand register access, really, copying all registers all the time is gross.<br />
** Can be done by adding 'available' and 'dirty' bitmasks<br />
* Implement mmx and sse memory move instructions; useful for guests that use multimedia extensions for accessing vga (partially done)<br />
* Implement an operation queue for the emulator. The emulator often calls userspace to perform a read or a write, but due to inversion of control it actually restarts instead of continuing. The queue would allow it to replay all previous operations until it reaches the point it last stopped.<br />
** if this is done, we can retire ->read_std() in favour of ->read_emulated().<br />
* convert more instructions to direct dispatch (function pointer in decode table)<br />
* move init_emulate_ctxt() into x86_decode_insn() and other emulator entry points<br />
<br />
==== Interactivity improvements: ====<br />
* If for several frames in a row a large proportion of the framebuffer pages are changing, then for the next few frames don't bother to get the dirty page log from kvm, but instead assume that all pages are dirty. This will reduce page fault overhead on highly interactive workloads.<br />
* When detecting keyboard/video/mouse activity, scale up the frame rate; when activity dies down, scale it back down (applicable to qemu as well).<br />
<br />
==== Pass-through/VT-d related: ====<br />
* Enhance KVM QEMU to return error messages if user attempts to pass-through unsupported devices:<br />
** Devices with shared host IOAPIC interrupt<br />
** Conventional PCI devices<br />
** Devices without FLR capability<br />
* QEMU PCI pass-through patch needs to be enhanced to same functionality as corresponding file in Xen<br />
** Remove direct HW access by QEMU for probing PCI BAR size<br />
** PCI handling of various PCI configuration registers<br />
** Other enhancements that was done in Xen<br />
* Host shared interrupt support<br />
* VT-d2 support (WIP in Linux Kernel)<br />
** Queued invalidation<br />
** Interrupt remapping<br />
** ATS<br />
* USB 2.0 (EHCI) support<br />
<br />
==== Bug fixes: ====<br />
* Less sexy but ever important, fixing bugs is one of the most important contributions<br />
<br />
==== Random improvements ====<br />
* Utilize the SVM interrupt queue to avoid extra exits when guest interrupts are disabled<br />
<br />
==== For the adventurous: ====<br />
* Emulate the VMX instruction sets on qemu. This would be very beneficial to debugging kvm ( working on this - kern.devel@gmail.com ).<br />
* Add [http://lagarcavilla.org/vmgl/ vmgl] support to qemu. Port to virtio. Write a Windows driver.<br />
* Keep this TODO up to date<br />
<br />
==== Nested VMX ====<br />
* Implement performance features such as EPT and VPID<br />
<br />
== KVM Safe Mode ==<br />
<br />
An ioctl() from userspace that tells KVM to disable one or more of the following features:<br />
<br />
* shadow paging (force direct mapping)<br />
* instruction emulation (require virtio or mmio hypercall)<br />
* task switches<br />
* mode switches (long mode / legacy mode / real mode)<br />
* IDT/GDT/LDT changes<br />
* IDT/GDT/LDT write protect<br />
* write protect important MSRs (*STAR etc)<br />
<br />
The idea is both to protect the guest from attacks, and to protect the host from the guest.<br />
<br />
__NOTOC__</div>Msthttps://linux-kvm.org/index.php?title=PCITodo&diff=4862PCITodo2013-08-21T10:38:55Z<p>Mst: add PCI TODO</p>
<hr />
<div>This page should cover all PCI related activity in KVM.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* virtio device as PCI Express device<br />
Issue: Express spec requires device can work without IO,<br />
virtio requires IO at the moment.<br />
Plan: add support for memory BARs.<br />
Developer: Michael S. Tsirkin<br />
<br />
* Hotplug for devices behind PCI bridges<br />
Issue: QEMU lacks support for device hotplug behind<br />
pci bridges.<br />
<br />
Plan:<br />
- each bus gets assigned a number 0-255<br />
- generated ACPI code writes this number<br />
to a new BSEL register, then uses existing<br />
UP/DOWN registers to probe slot status;<br />
to eject, write number to BSEL register,<br />
then slot into existing EJ<br />
This is to address the ACPI spec requirement to<br />
avoid config cycle access to any bus except PCI roots.<br />
<br />
Note: ACPI doesn't support adding or removing bridges by hotplug.<br />
We should and prevent removal of bridges by hotplug,<br />
unless they were added by hotplug previously<br />
(and so, are not described by ACPI).<br />
Developer: Michael S. Tsirkin<br />
<br />
* Hotplug for Q35<br />
Issue: QEMU does not support hotplug for Q35<br />
Plan: since we need to support hotplug of PCI devices,<br />
let's use ACPI hotplug for everything<br />
Use same interface as we do for PCI, this way<br />
same ACPI code can be reused.<br />
<br />
Developer: Michael S. Tsirkin<br />
<br />
* Support for different PCI express link width/speed settings<br />
Issue: QEMU currently emulates all links at minimal<br />
width and speed. This means we don't need to emulate<br />
link negotiation, but might in theory confuse guests<br />
for assigned devices.<br />
The issue is complicated by the fact that real link speed<br />
might be limited by the slot where assigned device is put.<br />
Plan: add management interface to control the max link<br />
speed and width for the slot.<br />
Teach management to query this at slot level.<br />
For device, query it from device itself.<br />
Support link width/speed negotiation as per spec.<br />
Developer: Alex Williamson<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* PCI interrupts should be active-low<br />
Issue: PCI INT#x interrupts are normally active-low.<br />
QEMU emulates them as active high. Works fine for<br />
windows and linux but not guaranteed for other guests. <br />
See http://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/<br />
<br />
Plan: add support for active-low interrupts in KVM.<br />
Enable this for PCI interrupts.<br />
Change DSDT appropriately.<br />
<br />
Developer: <br />
Testing: stress-test devices with INT#x interrupts<br />
with interrupt sharing in particular<br />
<br />
* PCI master-abort is not emulated correctly<br />
Issue: access to disabled PCI memory normally returns<br />
all-ones (or read) and sets master abort<br />
detected bit in bridge.<br />
For express, it can also trigger AER reporting<br />
when enabled.<br />
QEMU does not emulate any of this: reads return 0,<br />
writes.<br />
Plan: add catch-all memory region with low priority<br />
in bridge, and trigger the required actions.<br />
<br />
* Better modeling for PCI INT#x<br />
Issue: for a device deep down a bridge hierarchy,<br />
we scan the tree upwards on each interrupt,<br />
calling map_irq at each level, this is bad for performance.<br />
Behaviour is also open-coded at each level, this is ugly.<br />
Plan: something similar to MemoryRegion API:<br />
add objects that represent PCI INT#x pings<br />
(maybe pins in general) model their connection at<br />
each level. Each time there's a change, re-map<br />
them. On data path, use pre-computed irq# to<br />
send/clear the interrupt quickly.<br />
<br />
<br />
=== vague ideas: path to implementation not clear<br />
* Way to figure out proper PCI connectivity options.<br />
Issue: How do you know where you can connect a device?<br />
For PCI, this includes the legal bus addresses,<br />
hotplug support for bus,<br />
how the secondary bus is named,<br />
and whether bridges support required addressing modes.<br />
For PCI Express, there are additional options:<br />
root or downstream port,<br />
virtual bridge in root complex/upstream port.<br />
management tools end up hard-coding this information,<br />
based simply on device name, but that's ugly.<br />
Vague idea: add interfaces to figure out what can be<br />
connected to what and how, or at least the function of each device.<br />
People to contact: Laine Stump<br />
<br />
<br />
* Fix AHCI for stability<br />
Not related to PCI directly but modern chipsets<br />
with PCI Express support all use AHCI.<br />
Issue1: AHCI is unstable with windows guests<br />
(win7 fails to boot sometimes)<br />
Issue2: guests sometimes crash when doing ping pong migration<br />
People to contact: Alexander Graf</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4847NetworkingTodo2013-07-22T13:53:44Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
https://git.kernel.org/cgit/virt/kvm/mst/qemu.git/patch/?id=1c0fa6b709d02fe4f98d4ce7b55a6cc3c925791c<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
https://bugzilla.redhat.com/show_bug.cgi?id=922589<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
=== projects that are not started yet - no owner ===<br />
* sharing config interrupts<br />
Support mode devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
* bridging without promisc mode with OVS<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4846NetworkingTodo2013-07-22T13:51:55Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
https://git.kernel.org/cgit/virt/kvm/mst/qemu.git/patch/?id=1c0fa6b709d02fe4f98d4ce7b55a6cc3c925791c<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
https://bugzilla.redhat.com/show_bug.cgi?id=922589<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
=== projects that are not started yet - no owner ===<br />
* sharing config interrupts<br />
Support mode devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do) (existed NIC_RX_FILTER_CHANGED event contains vlan-tables)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* virtio: preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4844NetworkingTodo2013-07-22T13:35:13Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
https://git.kernel.org/cgit/virt/kvm/mst/qemu.git/patch/?id=1c0fa6b709d02fe4f98d4ce7b55a6cc3c925791c<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
https://bugzilla.redhat.com/show_bug.cgi?id=922589<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
=== projects that are not started yet - no owner ===<br />
* sharing config interrupts<br />
Support mode devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
* non-virtio device support with vhost<br />
Use vhost interface for guests that don't use virtio-net<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
kernel part is done (Vlad Yasevich)<br />
teach qemu to notify libvirt to enable the filter (still to do)<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4835NetworkingTodo2013-07-08T09:21:50Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
https://git.kernel.org/cgit/virt/kvm/mst/qemu.git/patch/?id=1c0fa6b709d02fe4f98d4ce7b55a6cc3c925791c<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
https://bugzilla.redhat.com/show_bug.cgi?id=922589<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
=== projects that are not started yet - no owner ===<br />
* sharing config interrupts<br />
Support mode devices by sharing a single msi vector<br />
between multiple virtio devices.<br />
(Applies to virtio-blk too).<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
* vxlan<br />
What could we do here?<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4833NetworkingTodo2013-06-25T15:12:19Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
https://git.kernel.org/cgit/virt/kvm/mst/qemu.git/patch/?id=1c0fa6b709d02fe4f98d4ce7b55a6cc3c925791c<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
https://bugzilla.redhat.com/show_bug.cgi?id=922589<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Write some unit tests for vhost-net/vhost-scsi<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4832NetworkingTodo2013-06-24T13:58:46Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Bandan Das<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Amos Kong<br />
qemu: https://bugzilla.redhat.com/show_bug.cgi?id=848203<br />
libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=848199<br />
https://git.kernel.org/cgit/virt/kvm/mst/qemu.git/patch/?id=1c0fa6b709d02fe4f98d4ce7b55a6cc3c925791c<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
https://bugzilla.redhat.com/show_bug.cgi?id=922589<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
Developer: Dmitry Fleytman<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
Developer: Dmitry Fleytman<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
Developer: MST<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4802NetworkingTodo2013-06-10T07:06:47Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Shirley Ma?, MST?<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
* bridging on top of macvlan <br />
add code to forward LRO status from macvlan (not macvtap)<br />
back to the lowerdev, so that setting up forwarding<br />
from macvlan disables LRO on the lowerdev<br />
<br />
* preserve packets exactly with LRO<br />
LRO is not normally compatible with forwarding.<br />
virtio we are getting packets from a linux host,<br />
so we could thinkably preserve packets exactly<br />
even with LRO. I am guessing other hardware could be<br />
doing this as well.<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4801NetworkingTodo2013-06-10T06:55:27Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Shirley Ma?, MST?<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* Enable GRO for packets coming to bridge from a tap interface<br />
<br />
* Better support for windows LRO<br />
Extend virtio-header with statistics for GRO packets:<br />
number of packets coalesced and number of duplicate ACKs coalesced<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
* Measure the effect of each of the above-mentioned optimizations<br />
- Use autotest network performance regression testing (that runs netperf)<br />
- Also test any wild idea that works. Some may be useful.<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4787NetworkingTodo2013-05-24T14:02:25Z<p>Mst: another project with no owner</p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Shirley Ma?, MST?<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
* ring aliasing:<br />
using vhost-net as a networking backend with virtio-net in QEMU<br />
being what's guest facing.<br />
This gives you the best of both worlds: QEMU acts as a first<br />
line of defense against a malicious guest while still getting the<br />
performance advantages of vhost-net (zero-copy).<br />
In fact a bit of complexity in vhost was put there in the vague hope to<br />
support something like this: virtio rings are not translated through<br />
regular memory tables, instead, vhost gets a pointer to ring address.<br />
This allows qemu acting as a man in the middle,<br />
verifying the descriptors but not touching the packet data.<br />
<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4785NetworkingTodo2013-05-24T11:06:27Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Shirley Ma?, MST?<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default.<br />
This is because GSO tends to batch less when mq is enabled.<br />
https://patchwork.kernel.org/patch/2235191/<br />
Developer: Jason Wang<br />
<br />
* rework on flow caches<br />
Current hlist implementation of flow caches has several limitations:<br />
1) at worst case, linear search will be bad<br />
2) not scale<br />
https://patchwork.kernel.org/patch/2025121/<br />
Developer: Jason Wang<br />
<br />
* eliminate the extra copy in virtio-net driver<br />
We need do an extra copy of 128 bytes for every packets. <br />
This could be eliminated for small packets by:<br />
1) use build_skb() and head frag<br />
2) bigger vnet header length ( >= NET_SKB_PAD + NET_IP_ALIGN )<br />
Or use a dedicated queue for small packet receiving ? (reordering)<br />
Developer: Jason Wang<br />
<br />
* make pktgen works for virtio-net ( or partially orphan )<br />
virtio-net orphan the skb during tx,<br />
which will makes pktgen wait for ever to the refcnt.<br />
Jason's idea: introduce a flat to tell pktgen not for wait<br />
Discussion here: https://patchwork.kernel.org/patch/1800711/<br />
MST's idea: add a .ndo_tx_polling not only for pktgen<br />
Developer: Jason Wang<br />
<br />
* Add HW_VLAN_TX support for tap<br />
Eliminate the extra data moving for tagged packets<br />
Developer: Jason Wang<br />
<br />
* Announce self by guest driver<br />
Send gARP by guest driver. Guest part is finished.<br />
Qemu is ongoing.<br />
V7 patches is here:<br />
http://lists.nongnu.org/archive/html/qemu-devel/2013-03/msg01127.html<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* netdev polling for virtio.<br />
There are two kinds of netdev polling:<br />
- netpoll - used for debugging<br />
- proposed low latency net polling<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
* more GSO type support:<br />
Kernel not support more type of GSO: FCOE, GRE, UDP_TUNNEL<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
Jason has an draft patch to use flex array.<br />
Another thing is to move the flow caches out of tun_struct.<br />
Developer: Jason Wang<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
* tx coalescing<br />
Delay several packets before kick the device.<br />
<br />
* interrupt coalescing<br />
Reduce the number of interrupt<br />
<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4782NetworkingTodo2013-05-23T21:52:46Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
<br />
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument<br />
<br />
Developer: Shirley Ma?, MST?<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* netdev polling for virtio.<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4781NetworkingTodo2013-05-23T21:44:27Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
Developer: Shirley Ma?, MST<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* netdev polling for virtio.<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
Old patch here: [PATCH RFC] tun: dma engine support<br />
It does not speed things up. Need to see why and<br />
what can be done.<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4780NetworkingTodo2013-05-23T21:42:30Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
Developer: Shirley Ma?, MST<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
* Bug: e1000 & rtl8139: Change macaddr in guest, but not update to qemu (info network)<br />
Developer: Amos Kong<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* netdev polling for virtio.<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* DMA emgine (IOAT) use in tun<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues, but we really want<br />
1 queue per guest CPU. The limit comes from net<br />
core, need to teach it to allocate array of<br />
pointers and not array of queues.<br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4774NetworkingTodo2013-05-23T10:43:24Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
www.mail-archive.com/kvm@vger.kernel.org/msg69868.html<br />
Developer: Shirley Ma?, MST<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* netdev polling for virtio.<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* DMA emgine (IOAT) use in tun<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues <br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4773NetworkingTodo2013-05-23T10:41:36Z<p>Mst: add links and more info. link to low latency net patches</p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
Developer: Shirley Ma?, MST<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
http://comments.gmane.org/gmane.linux.network/266546<br />
Developer: Vlad Yasevich<br />
<br />
* reduce networking latency:<br />
allow handling short packets from softirq or VCPU context<br />
Plan:<br />
We are going through the scheduler 3 times<br />
(could be up to 5 if softirqd is involved)<br />
Consider RX: host irq -> io thread -> VCPU thread -><br />
guest irq -> guest thread.<br />
This adds a lot of latency.<br />
We can cut it by some 1.5x if we do a bit of work<br />
either in the VCPU or softirq context.<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
https://patchwork.kernel.org/patch/1540471/<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* netdev polling for virtio.<br />
See http://lkml.indiana.edu/hypermail/linux/kernel/1303.0/00553.html<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* DMA emgine (IOAT) use in tun<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues <br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4772NetworkingTodo2013-05-23T08:48:37Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
TODO: add bugzilla entry links.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
Developer: Shirley Ma?, MST<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
Developer: Vlad Yasevich<br />
<br />
* allow handling short packets from softirq or VCPU context<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* DMA emgine (IOAT) use in tun<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues <br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4771NetworkingTodo2013-05-23T08:47:25Z<p>Mst: add Narasimhan, Sriram</p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
Developer: Shirley Ma?, MST<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
Developer: Vlad Yasevich<br />
<br />
* allow handling short packets from softirq or VCPU context<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
* Improve stats, make them more helpful for per analysis<br />
Developer: Sriram Narasimhan<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* DMA emgine (IOAT) use in tun<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues <br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=4770NetworkingTodo2013-05-23T08:42:38Z<p>Mst: rewrote the page. TODO: add BZs, detailed project descriptions.</p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
=== projects in progress. contributions are still very wellcome!<br />
<br />
* vhost-net scalability tuning: threading for many VMs<br />
Plan: switch to workqueue shared by many VMs<br />
Developer: Shirley Ma?, MST<br />
Testing: netperf guest to guest<br />
<br />
* multiqueue support in macvtap<br />
multiqueue is only supported for tun.<br />
Add support for macvtap.<br />
Developer: Jason Wang<br />
<br />
* enable multiqueue by default<br />
Multiqueue causes regression in some workloads, thus<br />
it is off by default. Detect and enable/disable<br />
automatically so we can make it on by default<br />
Developer: Jason Wang<br />
<br />
* guest programmable mac/vlan filtering with macvtap<br />
Developer: Dragos Tatulea?, Amos Kong<br />
Status: [[GuestProgrammableMacVlanFiltering]]<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
Helps performance and security on noisy LANs<br />
Developer: Vlad Yasevich<br />
<br />
* allow handling short packets from softirq or VCPU context<br />
Testing: netperf TCP RR - should be improved drastically<br />
netperf TCP STREAM guest to host - no regression<br />
Developer: MST<br />
<br />
* Flexible buffers: put virtio header inline with packet data<br />
Developer: MST<br />
<br />
* device failover to allow migration with assigned devices<br />
https://fedoraproject.org/wiki/Features/Virt_Device_Failover<br />
Developer: Gal Hammer, Cole Robinson, Laine Stump, MST<br />
<br />
* Reuse vringh code for better maintainability<br />
Developer: Rusty Russell<br />
<br />
=== projects that are not started yet - no owner ===<br />
<br />
* receive side zero copy<br />
The ideal is a NIC with accelerated RFS support,<br />
So we can feed the virtio rx buffers into the correct NIC queue.<br />
Depends on non promisc NIC support in bridge.<br />
<br />
* IPoIB infiniband bridging<br />
Plan: implement macvtap for ipoib and virtio-ipoib<br />
<br />
* RDMA bridging<br />
<br />
* use kvm eventfd support for injecting level interrupts,<br />
enable vhost by default for level interrupts<br />
<br />
* DMA emgine (IOAT) use in tun<br />
<br />
* virtio API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
<br />
=== vague ideas: path to implementation not clear<br />
<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
<br />
<br />
* support more queues<br />
We limit TUN to 8 queues <br />
<br />
* irq/numa affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
* reduce conflict with VCPU thread<br />
if VCPU and networking run on same CPU,<br />
they conflict resulting in bad performance.<br />
Fix that, push vhost thread out to another CPU<br />
more aggressively.<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
<br />
=== testing projects ===<br />
Keeping networking stable is highest priority.<br />
<br />
* Run weekly test on upstream HEAD covering test matrix with autotest<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
<br />
=== test matrix ===<br />
<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest</div>Msthttps://linux-kvm.org/index.php?title=KVM_Forum_2011&diff=3646KVM Forum 20112011-06-27T10:28:50Z<p>Mst: Redirect to KVM_Forum_2011_WIP</p>
<hr />
<div>#REDIRECT [[KVM_Forum_2011_WIP]]<br />
<br />
= KVM Forum 2011: outdated page =<br />
= Vancouver Canada, August 15-16, 2011 =<br />
The KVM Forum 2011 will be held <br />
at the Hyatt Regency Vancouver in Vancouver, Canada on August 15-16, 2011. We will be co-located with LinuxCon North America 2011<br />
<br />
http://events.linuxfoundation.org/events/linuxcon<br />
<br />
== Scope ==<br />
KVM is an industry leading open source hypervisor that provides an ideal<br />
platform for datacenter virtualization, virtual desktop infrastructure,<br />
and cloud computing. Once again, it's time to bring together the<br />
community of developers and users that define the KVM ecosystem for<br />
our annual technical conference. We will discuss the current state of<br />
affairs and plan for the future of KVM, its surrounding infrastructure,<br />
and management tools. So mark your calendar and join us in advancing KVM.<br />
<br />
http://events.linuxfoundation.org/events/kvm-forum/<br />
<br />
== CFP ==<br />
[[KVMForum2011CFP|KVM Forum 2011 CFP]] (now closed, see [[#Schedule|Schedule]])<br />
<br />
== Registration ==<br />
<br />
Please visit this page to register:<br />
<br />
http://events.linuxfoundation.org/events/kvm-forum/register<br />
<br />
== Hotel and Travel ==<br />
The KVM Forum 2011 will be held in Vancouver BC at the Hyatt Regency Vancouver.<br />
See the Linux Foundation's KVM Forum page for more details on hotels and travel.<br />
<br />
http://events.linuxfoundation.org/events/kvm-forum/travel<br />
<br />
== Schedule ==<br />
<br />
'''Monday, August 15th'''<br />
{|<br />
! Time !! Title !! Speaker <br />
|-<br />
|09:00 - 09:15 || colspan="2" align="center"| Welcome<br />
|-<br />
|09:15 - 09:30 || Keynote || <br />
|-<br />
|09:30 - 10:00 || || <br />
|-<br />
|10:00 - 10:30 || || <br />
|-<br />
| 10:30 - 10:45 || colspan="2" align="center"| Break<br />
|-<br />
| 10:45 - 11:15 || || <br />
|-<br />
| 11:15 - 11:45 || || <br />
|-<br />
| 11:45 - 12:15 || || <br />
|-<br />
| 12:15 - 13:30 || colspan="2" align="center"| Lunch<br />
|}<br />
{|<br />
! !! colspan="2"|Track 1 !! colspan="2"|Track 2<br />
|-<br />
! Time !! Title !! Speaker !! Title !! Speaker<br />
|-<br />
| 13:30 - 14:00 || || || || <br />
|-<br />
| 14:00 - 14:30 || || || || <br />
|-<br />
| 14:30 - 15:00 || || || || <br />
|-<br />
| 15:00 - 15:20 || colspan="4" align="center"|Break<br />
|-<br />
|15:20 - 15:50 || || || || <br />
|-<br />
|15:50 - 16:20 || || || || <br />
|-<br />
|16:20 - 16:50 || || || || <br />
|-<br />
|16:50 - 17:10 || colspan="4" align="center"|Break<br />
|-<br />
|17:10 - 19:00 || colspan="4" align="center"|BoFs<br />
|}<br />
<br />
'''Tuesday, August 16th'''<br />
{|<br />
! Time !! Title !! Speaker<br />
|-<br />
| 9:00 - 9:15 || Keynote || <br />
|-<br />
| 9:15 - 9:45 || || <br />
|-<br />
| 9:45 - 10:15 || || <br />
|-<br />
| 10:15 - 10:45 || || <br />
|-<br />
| 10:45 - 11:00 || colspan="2" align="center" | Break<br />
|-<br />
| 11:00 - 11:30 || || <br />
|-<br />
| 11:30 - 12:00 || || <br />
|-<br />
| 12:00 - 12:30 || || <br />
|-<br />
| 12:30 - 13:45 || colspan="2" align="center" | Lunch<br />
|}<br />
{|<br />
! !! colspan="2"|Track 1 !! colspan="2"|Track 2<br />
|-<br />
! Time !! Title !! Speaker !! Title !! Speaker<br />
|-<br />
| 13:45 - 14:15 || || || ||<br />
|-<br />
| 14:15 - 14:45 || || || || <br />
|-<br />
| 14:45 - 15:15 || || || || <br />
|-<br />
| 15:15 - 15:30 || colspan="4" align="center"|Break<br />
|-<br />
| 15:30 - 16:00 || || || || <br />
|-<br />
| 16:00 - 16:30 || || || || <br />
|-<br />
| 16:30 - 17:00 || || || || <br />
|-<br />
| 17:00 - 17:30 || || || || <br />
|-<br />
| 17:15 - 17:30 || colspan="4" align="center"|Closing<br />
|-<br />
| 17:30 - 19:00 || colspan="4" align="center"|BoFs<br />
|}</div>Msthttps://linux-kvm.org/index.php?title=File:Apic-wiring-mess.odp&diff=3645File:Apic-wiring-mess.odp2011-06-26T08:23:56Z<p>Mst: test</p>
<hr />
<div>test</div>Msthttps://linux-kvm.org/index.php?title=NetworkingPerformanceTesting&diff=3454NetworkingPerformanceTesting2010-12-15T19:27:04Z<p>Mst: </p>
<hr />
<div>== Networking Performance Testing ==<br />
This is a summary of performance acceptance criteria for changes in hypervisor virt networking. The matrix of configurations we are interested in is built combining possible options. Naturally the bigger a change the more exhaustive would we want the coverage to be.<br />
<br />
We can get different configurations by selecting different options in the following categories: [[#Networking setup|Networking setup]], [[#CPU setup|CPU setup]], [[#Guest setup|Guest setup]], [[#Traffic load|Traffic load]].<br />
For each of these we are interested in a set of [[#Performance metrics|Performance metrics]].<br />
A test would need to be performed under a controlled Hardware configuration,<br />
for each relevant [[#Hypervisor setup|Hypervisor setup]] and/or [[#Guest setup|Guest setup]] (depending on which change is tested) on the same hardware.<br />
Ideally we'd note the [[#Hardware configuration|Hardware configuration]] and person performing the test to increase the chance it can be reproduced later.<br />
<br />
== Performance metrics ==<br />
Generally for a given setup and traffic<br />
we want to know the [[#Latency|Latency]] and the [[#CPU load|CPU load]].<br />
We generally might care about minimal, average (or median) and maximum<br />
latencies.<br />
<br />
=== Latency ===<br />
Latency is generally time until you get a response. For some workloads you don't measure latencies directly, instead you measure peak throughput.<br />
<br />
=== CPU load ===<br />
The only metric that makes sense is probably host system load,<br />
of which the only someone quantifiable component seems to be the CPU load.<br />
Need take into account the fact that CPU speed might change<br />
with time, so load should probably be in seconds<br />
(%CPU/speed) rather than plain %CPU.<br />
<br />
Some derive metrics from this are:<br />
==== peak throughput ====<br />
How high we can load the system<br />
until latencies sharply become unreasonable<br />
==== service demand ====<br />
Load divided by CPU utilization<br />
<br />
== Networking setup ==<br />
<br />
== CPU setup ==<br />
<br />
== Guest setup ==<br />
<br />
== Hypervisor setup ==<br />
<br />
== Traffic load ==<br />
<br />
== Available tools ==<br />
<br />
== Hardware configuration ==<br />
<br />
<br />
<mst> yes<br />
<jasonwang> can we let the perf team to do that?<br />
<mst> they likely won't do it in time<br />
<mst> I started making up a list of what we need to measure<br />
<mst> have a bit of time to discuss?<br />
<jasonwang> you mean we need to do it ourself?<br />
<mst> at least part of it<br />
<jasonwang> I'm sorry, I need to attend the autotest meeting in 10 minutes<br />
<jasonwang> mst ok<br />
<mst> will have time afterward?<br />
<mst> I know it's late in your TZ<br />
<jasonwang> ok<br />
<mst> cool, then I'll stay connected on irc just ping me<br />
<jasonwang> ok<br />
<mst> thanks!<br />
<jasonwang> you are welcome<br />
<jasonwang> hi, just back from the meeting<br />
<mst> hi<br />
<mst> okay so let's see what we have<br />
<jasonwang> okay<br />
<mst> first we have the various connection options<br />
<jasonwang> yes<br />
<mst> we can do:<br />
<mst> host to guest<br />
<mst> guest to host<br />
<mst> ext to guest<br />
<mst> ext to host<br />
<mst> guest to guest on local<br />
<jasonwang> ok<br />
<mst> guest to guest across the net<br />
<mst> for comparison it's probably useful to do "baremetal": loopback and external<->host<br />
<jasonwang> yes<br />
<mst> a bit more advanced: bidirectional tests<br />
<mst> many to many is probably to hard to setup<br />
<jasonwang> yes, so we need only test some key options<br />
<mst> yes, for now let's focus on things that are easy to define<br />
<mst> ok now what kind of traffic we care about<br />
<jasonwang> (ext)host to guest, guest to (ext)host ?<br />
<mst> no I mean scheduler is heavily involved<br />
<jasonwang> so guest to guest on local is also needed?<br />
<mst> yes, think so<br />
<mst> so I think we need to try just defaults<br />
<mst> (no pinning)<br />
<jasonwang> yes, that is usual case<br />
<mst> as well as pinned scenario where qemu is pinned to cpus<br />
<jasonwang> ok<br />
<mst> and for external pinning irqs as well<br />
<jasonwang> set irq affinity?<br />
<mst> do you know whether virsh let you pin the iothread?<br />
<mst> yes, affinity<br />
<jasonwang> no, I don't use virsh<br />
<mst> need to find out, only pin what virsh let us pin<br />
<jasonwang> okay<br />
<mst> note vhost-net thread is created on demand, so it is not very practical to pin it<br />
<mst> if we do need this capability it will have to be added, I am hoping scheduler does the right thing<br />
<jasonwang> yes, it's a workqueue in RHEL6.1<br />
<mst> workqueue is just a list + thread, or we can change it if we like<br />
<jasonwang> do you man if we need we can use a dedicated thread like upstream which is easy to be pinned?<br />
<mst> upstream is not easier to be pinned<br />
<mst> the issue is mostly that thread is only created on driver OK now<br />
<jasonwang> yes<br />
<mst> so guest can destroy it and recreate and it loses what you set<br />
<mst> in benchmark it works but not for real users<br />
<jasonwang> yes, agree<br />
<mst> maybe cgroups can be used somehow since it inherits the cgroups of the owner<br />
<mst> another option is to let qemu control the pinning<br />
<mst> either let it specify the thread to do the work<br />
<mst> or just add ioctl for pinning<br />
<jasonwang> looks possible<br />
<mst> in mark wagner's tests it seemed to work well without<br />
<mst> so need to see if it's needed, it's not hard to add this interface<br />
<mst> but once we add it must maintain forever<br />
<mst> so I think irq affinity and cpu pinning are two options to try tweaking<br />
<jasonwang> yes, have saw some performance discussion of vhost upstream<br />
<mst> need to make sure we try on a numa box<br />
<mst> at the moment kernel structures are allocated on first use<br />
<jasonwang> yes<br />
<mst> I hope it all fits in cache so should not matter<br />
<mst> but need to check, not yet sure what exactly<br />
<jasonwang> yes, things would be more complicated when using numa<br />
<mst> not sure what exactly are the configurations to check<br />
<mst> ok so we have the network setup and we have the cpu setup<br />
<mst> let thing is traffic to check<br />
<mst> let->last<br />
<jasonwang> yes, TCP_STREAM/UDP_STREAM/TCP_RR and something else?<br />
<mst> let's focus on the protocols first<br />
<mst> so we can do TCP, this has a strange property of coalescing messages<br />
<mst> but OTOH it's the most used protocol<br />
<mst> and it has hard requirements e.g. on the ordering of packets<br />
<jasonwang> yes, TCP must to be tested<br />
<mst> UDP is only working well up to mtu packet size<br />
<mst> but otherwise it let us do pretty low level stuff<br />
<jasonwang> yes, agree<br />
<mst> ICMP is very low level (good), has a disadvantage that it might be special-cased in hardware and software (bad)<br />
<mst> what kind of traffic we care about? ideally a range of message sizes, and a range of loads<br />
<mst> (in terms of messages per second)<br />
<jasonwang> yes<br />
<mst> what do we want to measure?<br />
<jasonwang> bandwidth and latency<br />
<mst> I think this not really it<br />
<mst> this is what tools like to give us<br />
<jasonwang> yes and maybe also the cpu usage<br />
<mst> if you think about it in terms of an application, it is always latency that you care about in the end<br />
<mst> e.g. I have this huge file what is the latency to send it over the network<br />
<mst> and for us also what is the cpu load, you are right<br />
<jasonwang> yes<br />
<mst> so for a given traffic, which we can approximate by setting message size (both ways) protocol and messages per second<br />
<mst> we want to know the latency and the cpu load<br />
<jasonwang> yes<br />
<mst> and we want the peak e.g. we want to know how high we can go in messages per second until latencies become unreasonable<br />
<mst> this last is a bit subjective<br />
<mst> but generally any system would gadually become less responsive with more load<br />
<mst> then at some point it just breaks<br />
<mst> cou load is a bit hard to define<br />
<mst> cpu<br />
<jasonwang> yes and it looks hard to do the measuring then<br />
<mst> I think in the end, what we care about is how many cpu cycles the host burns<br />
<jasonwang> yes, but how to measure that?<br />
<mst> well we have simple things like /proc/stat<br />
<jasonwang> understood and maybe perf can also help<br />
<mst> yes quite possibly<br />
<mst> in other words we'll need to measure this in parallel while test is running<br />
<mst> netperf can report local/remote CPU<br />
<mst> but I do not understand what it really means<br />
<mst> especially for a guest<br />
<jasonwang> yes, if we want to use netperf it's better to know how it does the calculation<br />
<mst> well it just looks at /proc/stat AFAIK<br />
<jasonwang> yes, I try to take a look at its source<br />
<mst> this is the default but it has other heuristics<br />
<mst> that can be configured at compile time<br />
<jasonwang> ok, understand<br />
<mst> ok and I think load divided by CPU is a useful metric<br />
<jasonwang> so the ideal result is to get how many cpu cycles does vhost spend on send or receive a KB<br />
<mst> netperf can report service demand<br />
<mst> I do not understand what it is<br />
<jasonwang> From its manual its how many us the cpu spend on a KB<br />
<mst> well the answer will be it depends :)<br />
<mst> also, we have packet loss<br />
<mst> I think at some level we only care about packets that were delivered<br />
<mst> so e.g. with UDP we only care about received messages<br />
<jasonwang> yes, the packet loss may have concerns with guest drivers<br />
<mst> with TCP if you look at messages, there's no loss<br />
<jasonwang> yes TCP have flow control itself<br />
<mst> ok so let's see what tools we have<br />
<mst> the simplest is flood ping<br />
<jasonwang> yes, it's very simple and easy to use<br />
<mst> it gives you control over message size, packets per second, gets you back latency<br />
<mst> it is always bidirectional I think<br />
<mst> and we need to measure CPU ourselves<br />
<mst> that last seems to be true anyway<br />
<jasonwang> yes, maybe easy to be understand and analysis than netperf<br />
<mst> packet loss when it occurs complicates things<br />
<mst> e.g. with 50% packet loss the real load is anywhere in between<br />
<jasonwang> yes<br />
<mst> that's the only problem: it's always bidirectional so tx/rx problems are hard to separate<br />
<jasonwang> yes, vhost is currently half-duplex<br />
<mst> I am also not sure it detect reordering<br />
<jasonwang> yes, it has sequence no.<br />
<jasonwang> but for ping, as you've said it's ICMP and was not the most of the cases<br />
<mst> ok, next we have netperf<br />
<mst> afaik it can do two things<br />
<mst> it can try sending as many packets as it can<br />
<jasonwang> yes<br />
<mst> or it can send a single one back and forth<br />
<mst> not a lot of data, but ok<br />
<jasonwang> yes<br />
<mst> and similar with UDP<br />
<mst> got to go have lunch<br />
<mst> So I will try and write all this up<br />
<mst> do you have any hardware for testing?<br />
<mst> if yes we'll add it too, I'll put up a wiki<br />
<mst> back in half an hour<br />
<jasonwang> yes, write all things up would help<br />
<jasonwang> go home now, please send me mail<br />
* jasonwang has quit (Quit: Leaving)<br />
<br />
* Loaded log from Wed Dec 15 15:07:24 2010</div>Msthttps://linux-kvm.org/index.php?title=NetworkingPerformanceTesting&diff=3453NetworkingPerformanceTesting2010-12-15T19:26:47Z<p>Mst: </p>
<hr />
<div>== Networking Performance Testing ==<br />
This is a summary of performance acceptance criteria for changes in hypervisor virt networking. The matrix of configurations we are interested in is built combining possible options. Naturally the bigger a change the more exhaustive would we want the coverage to be.<br />
<br />
We can get different configurations by selecting different options in the following categories: [[#Networking setup|Networking setup]], [[#CPU setup|CPU setup]], [[#Guest setup|Guest setup]], [[#Traffic load|Traffic load]].<br />
For each of these we are interested in a set of [[#Performance metrics|Performance metrics]].<br />
A test would need to be performed under a controlled Hardware configuration,<br />
for each relevant [[#Hypervisor setup|Hypervisor setup]] and/or [[#Guest setup|Guest setup]] (depending on which change is tested) on the same hardware.<br />
Ideally we'd note the [[#Hardware configuration|Hardware configuration]] and person performing the test to increase the chance it can be reproduced later.<br />
<br />
== Performance metrics ==<br />
Generally for a given setup and traffic<br />
we want to know the [[#Latency|Latency]] and the [[#CPU load|CPU load]].<br />
We generally might care about minimal, average (or median) and maximum<br />
latencies.<br />
<br />
=== Latency ===<br />
Latency is generally time until you get a response. For some workloads you don't measure latencies directly, instead you measure peak throughput.<br />
<br />
=== CPU load ===<br />
The only metric that makes sense is probably host system load,<br />
of which the only someone quantifiable component seems to be the CPU load.<br />
Need take into account the fact that CPU speed might change<br />
with time, so load should probably be in seconds<br />
(%CPU/speed) rather than plain %CPU.<br />
<br />
Some derive metrics from this are:<br />
==== peak throughput ====<br />
How high we can load the system<br />
until latencies sharply become unreasonable<br />
==== service demand ====<br />
Load divided by CPU utilization<br />
<br />
== Networking setup ==<br />
<br />
== CPU setup ==<br />
<br />
== Guest setup ==<br />
<br />
== Hypervisor setup ==<br />
<br />
== Traffic load ==<br />
<br />
=== Available tools ===<br />
<br />
== Hardware configuration ==<br />
<br />
<br />
<mst> yes<br />
<jasonwang> can we let the perf team to do that?<br />
<mst> they likely won't do it in time<br />
<mst> I started making up a list of what we need to measure<br />
<mst> have a bit of time to discuss?<br />
<jasonwang> you mean we need to do it ourself?<br />
<mst> at least part of it<br />
<jasonwang> I'm sorry, I need to attend the autotest meeting in 10 minutes<br />
<jasonwang> mst ok<br />
<mst> will have time afterward?<br />
<mst> I know it's late in your TZ<br />
<jasonwang> ok<br />
<mst> cool, then I'll stay connected on irc just ping me<br />
<jasonwang> ok<br />
<mst> thanks!<br />
<jasonwang> you are welcome<br />
<jasonwang> hi, just back from the meeting<br />
<mst> hi<br />
<mst> okay so let's see what we have<br />
<jasonwang> okay<br />
<mst> first we have the various connection options<br />
<jasonwang> yes<br />
<mst> we can do:<br />
<mst> host to guest<br />
<mst> guest to host<br />
<mst> ext to guest<br />
<mst> ext to host<br />
<mst> guest to guest on local<br />
<jasonwang> ok<br />
<mst> guest to guest across the net<br />
<mst> for comparison it's probably useful to do "baremetal": loopback and external<->host<br />
<jasonwang> yes<br />
<mst> a bit more advanced: bidirectional tests<br />
<mst> many to many is probably to hard to setup<br />
<jasonwang> yes, so we need only test some key options<br />
<mst> yes, for now let's focus on things that are easy to define<br />
<mst> ok now what kind of traffic we care about<br />
<jasonwang> (ext)host to guest, guest to (ext)host ?<br />
<mst> no I mean scheduler is heavily involved<br />
<jasonwang> so guest to guest on local is also needed?<br />
<mst> yes, think so<br />
<mst> so I think we need to try just defaults<br />
<mst> (no pinning)<br />
<jasonwang> yes, that is usual case<br />
<mst> as well as pinned scenario where qemu is pinned to cpus<br />
<jasonwang> ok<br />
<mst> and for external pinning irqs as well<br />
<jasonwang> set irq affinity?<br />
<mst> do you know whether virsh let you pin the iothread?<br />
<mst> yes, affinity<br />
<jasonwang> no, I don't use virsh<br />
<mst> need to find out, only pin what virsh let us pin<br />
<jasonwang> okay<br />
<mst> note vhost-net thread is created on demand, so it is not very practical to pin it<br />
<mst> if we do need this capability it will have to be added, I am hoping scheduler does the right thing<br />
<jasonwang> yes, it's a workqueue in RHEL6.1<br />
<mst> workqueue is just a list + thread, or we can change it if we like<br />
<jasonwang> do you man if we need we can use a dedicated thread like upstream which is easy to be pinned?<br />
<mst> upstream is not easier to be pinned<br />
<mst> the issue is mostly that thread is only created on driver OK now<br />
<jasonwang> yes<br />
<mst> so guest can destroy it and recreate and it loses what you set<br />
<mst> in benchmark it works but not for real users<br />
<jasonwang> yes, agree<br />
<mst> maybe cgroups can be used somehow since it inherits the cgroups of the owner<br />
<mst> another option is to let qemu control the pinning<br />
<mst> either let it specify the thread to do the work<br />
<mst> or just add ioctl for pinning<br />
<jasonwang> looks possible<br />
<mst> in mark wagner's tests it seemed to work well without<br />
<mst> so need to see if it's needed, it's not hard to add this interface<br />
<mst> but once we add it must maintain forever<br />
<mst> so I think irq affinity and cpu pinning are two options to try tweaking<br />
<jasonwang> yes, have saw some performance discussion of vhost upstream<br />
<mst> need to make sure we try on a numa box<br />
<mst> at the moment kernel structures are allocated on first use<br />
<jasonwang> yes<br />
<mst> I hope it all fits in cache so should not matter<br />
<mst> but need to check, not yet sure what exactly<br />
<jasonwang> yes, things would be more complicated when using numa<br />
<mst> not sure what exactly are the configurations to check<br />
<mst> ok so we have the network setup and we have the cpu setup<br />
<mst> let thing is traffic to check<br />
<mst> let->last<br />
<jasonwang> yes, TCP_STREAM/UDP_STREAM/TCP_RR and something else?<br />
<mst> let's focus on the protocols first<br />
<mst> so we can do TCP, this has a strange property of coalescing messages<br />
<mst> but OTOH it's the most used protocol<br />
<mst> and it has hard requirements e.g. on the ordering of packets<br />
<jasonwang> yes, TCP must to be tested<br />
<mst> UDP is only working well up to mtu packet size<br />
<mst> but otherwise it let us do pretty low level stuff<br />
<jasonwang> yes, agree<br />
<mst> ICMP is very low level (good), has a disadvantage that it might be special-cased in hardware and software (bad)<br />
<mst> what kind of traffic we care about? ideally a range of message sizes, and a range of loads<br />
<mst> (in terms of messages per second)<br />
<jasonwang> yes<br />
<mst> what do we want to measure?<br />
<jasonwang> bandwidth and latency<br />
<mst> I think this not really it<br />
<mst> this is what tools like to give us<br />
<jasonwang> yes and maybe also the cpu usage<br />
<mst> if you think about it in terms of an application, it is always latency that you care about in the end<br />
<mst> e.g. I have this huge file what is the latency to send it over the network<br />
<mst> and for us also what is the cpu load, you are right<br />
<jasonwang> yes<br />
<mst> so for a given traffic, which we can approximate by setting message size (both ways) protocol and messages per second<br />
<mst> we want to know the latency and the cpu load<br />
<jasonwang> yes<br />
<mst> and we want the peak e.g. we want to know how high we can go in messages per second until latencies become unreasonable<br />
<mst> this last is a bit subjective<br />
<mst> but generally any system would gadually become less responsive with more load<br />
<mst> then at some point it just breaks<br />
<mst> cou load is a bit hard to define<br />
<mst> cpu<br />
<jasonwang> yes and it looks hard to do the measuring then<br />
<mst> I think in the end, what we care about is how many cpu cycles the host burns<br />
<jasonwang> yes, but how to measure that?<br />
<mst> well we have simple things like /proc/stat<br />
<jasonwang> understood and maybe perf can also help<br />
<mst> yes quite possibly<br />
<mst> in other words we'll need to measure this in parallel while test is running<br />
<mst> netperf can report local/remote CPU<br />
<mst> but I do not understand what it really means<br />
<mst> especially for a guest<br />
<jasonwang> yes, if we want to use netperf it's better to know how it does the calculation<br />
<mst> well it just looks at /proc/stat AFAIK<br />
<jasonwang> yes, I try to take a look at its source<br />
<mst> this is the default but it has other heuristics<br />
<mst> that can be configured at compile time<br />
<jasonwang> ok, understand<br />
<mst> ok and I think load divided by CPU is a useful metric<br />
<jasonwang> so the ideal result is to get how many cpu cycles does vhost spend on send or receive a KB<br />
<mst> netperf can report service demand<br />
<mst> I do not understand what it is<br />
<jasonwang> From its manual its how many us the cpu spend on a KB<br />
<mst> well the answer will be it depends :)<br />
<mst> also, we have packet loss<br />
<mst> I think at some level we only care about packets that were delivered<br />
<mst> so e.g. with UDP we only care about received messages<br />
<jasonwang> yes, the packet loss may have concerns with guest drivers<br />
<mst> with TCP if you look at messages, there's no loss<br />
<jasonwang> yes TCP have flow control itself<br />
<mst> ok so let's see what tools we have<br />
<mst> the simplest is flood ping<br />
<jasonwang> yes, it's very simple and easy to use<br />
<mst> it gives you control over message size, packets per second, gets you back latency<br />
<mst> it is always bidirectional I think<br />
<mst> and we need to measure CPU ourselves<br />
<mst> that last seems to be true anyway<br />
<jasonwang> yes, maybe easy to be understand and analysis than netperf<br />
<mst> packet loss when it occurs complicates things<br />
<mst> e.g. with 50% packet loss the real load is anywhere in between<br />
<jasonwang> yes<br />
<mst> that's the only problem: it's always bidirectional so tx/rx problems are hard to separate<br />
<jasonwang> yes, vhost is currently half-duplex<br />
<mst> I am also not sure it detect reordering<br />
<jasonwang> yes, it has sequence no.<br />
<jasonwang> but for ping, as you've said it's ICMP and was not the most of the cases<br />
<mst> ok, next we have netperf<br />
<mst> afaik it can do two things<br />
<mst> it can try sending as many packets as it can<br />
<jasonwang> yes<br />
<mst> or it can send a single one back and forth<br />
<mst> not a lot of data, but ok<br />
<jasonwang> yes<br />
<mst> and similar with UDP<br />
<mst> got to go have lunch<br />
<mst> So I will try and write all this up<br />
<mst> do you have any hardware for testing?<br />
<mst> if yes we'll add it too, I'll put up a wiki<br />
<mst> back in half an hour<br />
<jasonwang> yes, write all things up would help<br />
<jasonwang> go home now, please send me mail<br />
* jasonwang has quit (Quit: Leaving)<br />
<br />
* Loaded log from Wed Dec 15 15:07:24 2010</div>Msthttps://linux-kvm.org/index.php?title=NetworkingPerformanceTesting&diff=3452NetworkingPerformanceTesting2010-12-15T19:24:19Z<p>Mst: </p>
<hr />
<div>== Networking Performance Testing ==<br />
This is a summary of performance acceptance criteria for changes in hypervisor virt networking. The matrix of configurations we are interested in is built combining possible options. Naturally the bigger a change the more exhaustive would we want the coverage to be.<br />
<br />
We can get different configurations by selecting different options in the following categories: [[#Networking setup|Networking setup]], [[#CPU setup|CPU setup]], [[#Guest setup|Guest setup]], [[#Traffic load|Traffic load]].<br />
For each of these we are interested in a set of [[#Performance metrics|Performance metrics]].<br />
A test would need to be performed under a controlled Hardware configuration,<br />
for each relevant [[#Hypervisor setup|Hypervisor setup]] and/or [[#Guest setup|Guest setup]] (depending on which change is tested) on the same hardware.<br />
Ideally we'd note the [[#Hardware configuration|Hardware configuration]] and person performing the test to increase the chance it can be reproduced later.<br />
<br />
== Performance metrics ==<br />
Generally for a given setup and traffic<br />
we want to know the [[#Latency|Latency]] and the [[#CPU load|CPU load]].<br />
We generally might care about minimal, average (or median) and maximum<br />
latencies.<br />
<br />
=== Latency ===<br />
Latency is generally time until you get a response. For some workloads you don't measure latencies directly, instead you measure peak throughput.<br />
<br />
=== CPU load ===<br />
The only metric that makes sense is probably host system load,<br />
of which the only someone quantifiable component seems to be the CPU load.<br />
Need take into account the fact that CPU speed might change<br />
with time, so load should probably be in seconds<br />
(%CPU/speed) rather than plain %CPU.<br />
<br />
Some derive metrics from this are:<br />
==== peak throughput ====<br />
How high we can load the system<br />
until latencies sharply become unreasonable<br />
==== service demand ====<br />
Load divided by CPU utilization<br />
<br />
== Networking setup ==<br />
<br />
== CPU setup ==<br />
<br />
== Guest setup ==<br />
<br />
== Hypervisor setup ==<br />
<br />
== Traffic load ==<br />
<br />
== Hardware configuration ==<br />
<br />
<br />
<mst> yes<br />
<jasonwang> can we let the perf team to do that?<br />
<mst> they likely won't do it in time<br />
<mst> I started making up a list of what we need to measure<br />
<mst> have a bit of time to discuss?<br />
<jasonwang> you mean we need to do it ourself?<br />
<mst> at least part of it<br />
<jasonwang> I'm sorry, I need to attend the autotest meeting in 10 minutes<br />
<jasonwang> mst ok<br />
<mst> will have time afterward?<br />
<mst> I know it's late in your TZ<br />
<jasonwang> ok<br />
<mst> cool, then I'll stay connected on irc just ping me<br />
<jasonwang> ok<br />
<mst> thanks!<br />
<jasonwang> you are welcome<br />
<jasonwang> hi, just back from the meeting<br />
<mst> hi<br />
<mst> okay so let's see what we have<br />
<jasonwang> okay<br />
<mst> first we have the various connection options<br />
<jasonwang> yes<br />
<mst> we can do:<br />
<mst> host to guest<br />
<mst> guest to host<br />
<mst> ext to guest<br />
<mst> ext to host<br />
<mst> guest to guest on local<br />
<jasonwang> ok<br />
<mst> guest to guest across the net<br />
<mst> for comparison it's probably useful to do "baremetal": loopback and external<->host<br />
<jasonwang> yes<br />
<mst> a bit more advanced: bidirectional tests<br />
<mst> many to many is probably to hard to setup<br />
<jasonwang> yes, so we need only test some key options<br />
<mst> yes, for now let's focus on things that are easy to define<br />
<mst> ok now what kind of traffic we care about<br />
<jasonwang> (ext)host to guest, guest to (ext)host ?<br />
<mst> no I mean scheduler is heavily involved<br />
<jasonwang> so guest to guest on local is also needed?<br />
<mst> yes, think so<br />
<mst> so I think we need to try just defaults<br />
<mst> (no pinning)<br />
<jasonwang> yes, that is usual case<br />
<mst> as well as pinned scenario where qemu is pinned to cpus<br />
<jasonwang> ok<br />
<mst> and for external pinning irqs as well<br />
<jasonwang> set irq affinity?<br />
<mst> do you know whether virsh let you pin the iothread?<br />
<mst> yes, affinity<br />
<jasonwang> no, I don't use virsh<br />
<mst> need to find out, only pin what virsh let us pin<br />
<jasonwang> okay<br />
<mst> note vhost-net thread is created on demand, so it is not very practical to pin it<br />
<mst> if we do need this capability it will have to be added, I am hoping scheduler does the right thing<br />
<jasonwang> yes, it's a workqueue in RHEL6.1<br />
<mst> workqueue is just a list + thread, or we can change it if we like<br />
<jasonwang> do you man if we need we can use a dedicated thread like upstream which is easy to be pinned?<br />
<mst> upstream is not easier to be pinned<br />
<mst> the issue is mostly that thread is only created on driver OK now<br />
<jasonwang> yes<br />
<mst> so guest can destroy it and recreate and it loses what you set<br />
<mst> in benchmark it works but not for real users<br />
<jasonwang> yes, agree<br />
<mst> maybe cgroups can be used somehow since it inherits the cgroups of the owner<br />
<mst> another option is to let qemu control the pinning<br />
<mst> either let it specify the thread to do the work<br />
<mst> or just add ioctl for pinning<br />
<jasonwang> looks possible<br />
<mst> in mark wagner's tests it seemed to work well without<br />
<mst> so need to see if it's needed, it's not hard to add this interface<br />
<mst> but once we add it must maintain forever<br />
<mst> so I think irq affinity and cpu pinning are two options to try tweaking<br />
<jasonwang> yes, have saw some performance discussion of vhost upstream<br />
<mst> need to make sure we try on a numa box<br />
<mst> at the moment kernel structures are allocated on first use<br />
<jasonwang> yes<br />
<mst> I hope it all fits in cache so should not matter<br />
<mst> but need to check, not yet sure what exactly<br />
<jasonwang> yes, things would be more complicated when using numa<br />
<mst> not sure what exactly are the configurations to check<br />
<mst> ok so we have the network setup and we have the cpu setup<br />
<mst> let thing is traffic to check<br />
<mst> let->last<br />
<jasonwang> yes, TCP_STREAM/UDP_STREAM/TCP_RR and something else?<br />
<mst> let's focus on the protocols first<br />
<mst> so we can do TCP, this has a strange property of coalescing messages<br />
<mst> but OTOH it's the most used protocol<br />
<mst> and it has hard requirements e.g. on the ordering of packets<br />
<jasonwang> yes, TCP must to be tested<br />
<mst> UDP is only working well up to mtu packet size<br />
<mst> but otherwise it let us do pretty low level stuff<br />
<jasonwang> yes, agree<br />
<mst> ICMP is very low level (good), has a disadvantage that it might be special-cased in hardware and software (bad)<br />
<mst> what kind of traffic we care about? ideally a range of message sizes, and a range of loads<br />
<mst> (in terms of messages per second)<br />
<jasonwang> yes<br />
<mst> what do we want to measure?<br />
<jasonwang> bandwidth and latency<br />
<mst> I think this not really it<br />
<mst> this is what tools like to give us<br />
<jasonwang> yes and maybe also the cpu usage<br />
<mst> if you think about it in terms of an application, it is always latency that you care about in the end<br />
<mst> e.g. I have this huge file what is the latency to send it over the network<br />
<mst> and for us also what is the cpu load, you are right<br />
<jasonwang> yes<br />
<mst> so for a given traffic, which we can approximate by setting message size (both ways) protocol and messages per second<br />
<mst> we want to know the latency and the cpu load<br />
<jasonwang> yes<br />
<mst> and we want the peak e.g. we want to know how high we can go in messages per second until latencies become unreasonable<br />
<mst> this last is a bit subjective<br />
<mst> but generally any system would gadually become less responsive with more load<br />
<mst> then at some point it just breaks<br />
<mst> cou load is a bit hard to define<br />
<mst> cpu<br />
<jasonwang> yes and it looks hard to do the measuring then<br />
<mst> I think in the end, what we care about is how many cpu cycles the host burns<br />
<jasonwang> yes, but how to measure that?<br />
<mst> well we have simple things like /proc/stat<br />
<jasonwang> understood and maybe perf can also help<br />
<mst> yes quite possibly<br />
<mst> in other words we'll need to measure this in parallel while test is running<br />
<mst> netperf can report local/remote CPU<br />
<mst> but I do not understand what it really means<br />
<mst> especially for a guest<br />
<jasonwang> yes, if we want to use netperf it's better to know how it does the calculation<br />
<mst> well it just looks at /proc/stat AFAIK<br />
<jasonwang> yes, I try to take a look at its source<br />
<mst> this is the default but it has other heuristics<br />
<mst> that can be configured at compile time<br />
<jasonwang> ok, understand<br />
<mst> ok and I think load divided by CPU is a useful metric<br />
<jasonwang> so the ideal result is to get how many cpu cycles does vhost spend on send or receive a KB<br />
<mst> netperf can report service demand<br />
<mst> I do not understand what it is<br />
<jasonwang> From its manual its how many us the cpu spend on a KB<br />
<mst> well the answer will be it depends :)<br />
<mst> also, we have packet loss<br />
<mst> I think at some level we only care about packets that were delivered<br />
<mst> so e.g. with UDP we only care about received messages<br />
<jasonwang> yes, the packet loss may have concerns with guest drivers<br />
<mst> with TCP if you look at messages, there's no loss<br />
<jasonwang> yes TCP have flow control itself<br />
<mst> ok so let's see what tools we have<br />
<mst> the simplest is flood ping<br />
<jasonwang> yes, it's very simple and easy to use<br />
<mst> it gives you control over message size, packets per second, gets you back latency<br />
<mst> it is always bidirectional I think<br />
<mst> and we need to measure CPU ourselves<br />
<mst> that last seems to be true anyway<br />
<jasonwang> yes, maybe easy to be understand and analysis than netperf<br />
<mst> packet loss when it occurs complicates things<br />
<mst> e.g. with 50% packet loss the real load is anywhere in between<br />
<jasonwang> yes<br />
<mst> that's the only problem: it's always bidirectional so tx/rx problems are hard to separate<br />
<jasonwang> yes, vhost is currently half-duplex<br />
<mst> I am also not sure it detect reordering<br />
<jasonwang> yes, it has sequence no.<br />
<jasonwang> but for ping, as you've said it's ICMP and was not the most of the cases<br />
<mst> ok, next we have netperf<br />
<mst> afaik it can do two things<br />
<mst> it can try sending as many packets as it can<br />
<jasonwang> yes<br />
<mst> or it can send a single one back and forth<br />
<mst> not a lot of data, but ok<br />
<jasonwang> yes<br />
<mst> and similar with UDP<br />
<mst> got to go have lunch<br />
<mst> So I will try and write all this up<br />
<mst> do you have any hardware for testing?<br />
<mst> if yes we'll add it too, I'll put up a wiki<br />
<mst> back in half an hour<br />
<jasonwang> yes, write all things up would help<br />
<jasonwang> go home now, please send me mail<br />
* jasonwang has quit (Quit: Leaving)<br />
<br />
* Loaded log from Wed Dec 15 15:07:24 2010</div>Msthttps://linux-kvm.org/index.php?title=NetworkingPerformanceTesting&diff=3451NetworkingPerformanceTesting2010-12-15T19:23:29Z<p>Mst: </p>
<hr />
<div>== Networking Performance Testing ==<br />
This is a summary of performance acceptance criteria for changes in hypervisor virt networking. The matrix of configurations we are interested in is built combining possible options. Naturally the bigger a change the more exhaustive would we want the coverage to be.<br />
<br />
We can get different configurations by selecting different options in the following categories: [[#Networking setup|Networking setup]], [[#CPU setup|CPU setup]], [[#Guest setup|Guest setup]], [[#Traffic load|Traffic load]].<br />
For each of these we are interested in a set of [[#Performance metrics|Performance metrics]].<br />
A test would need to be performed under a controlled Hardware configuration,<br />
for each relevant [[#Hypervisor setup|Hypervisor setup]] and/or [[#Guest setup|Guest setup]] (depending on which change is tested) on the same hardware.<br />
Ideally we'd note the [[#Hardware configuration|Hardware configuration]] and person performing the test to increase the chance it can be reproduced later.<br />
<br />
== Performance metrics ==<br />
Generally for a given setup and traffic<br />
we want to know the [[#Latency|Latency]] and the [[#CPU load|CPU load]].<br />
We generally might care about minimal, average (or median) and maximum<br />
latencies.<br />
<br />
=== Latency ===<br />
Latency is generally time until you get a response. For some workloads you don't measure latencies directly, instead you measure peak throughput.<br />
<br />
=== CPU load ===<br />
The only metric that makes sense is probably host system load,<br />
of which the only someone quantifiable component seems to be the CPU load.<br />
Need take into account the fact that CPU speed might change<br />
with time, so load should probably be in seconds<br />
(%CPU/speed) rather than plain %CPU.<br />
<br />
Some derive metrics from this are:<br />
==== peak throughput ====<br />
how high we can load the system<br />
until latencies sharply become unreasonable<br />
==== service demand ====<br />
load divided by CPU<br />
<br />
== Networking setup ==<br />
<br />
== CPU setup ==<br />
<br />
== Guest setup ==<br />
<br />
== Hypervisor setup ==<br />
<br />
== Traffic load ==<br />
<br />
== Hardware configuration ==<br />
<br />
<br />
<mst> yes<br />
<jasonwang> can we let the perf team to do that?<br />
<mst> they likely won't do it in time<br />
<mst> I started making up a list of what we need to measure<br />
<mst> have a bit of time to discuss?<br />
<jasonwang> you mean we need to do it ourself?<br />
<mst> at least part of it<br />
<jasonwang> I'm sorry, I need to attend the autotest meeting in 10 minutes<br />
<jasonwang> mst ok<br />
<mst> will have time afterward?<br />
<mst> I know it's late in your TZ<br />
<jasonwang> ok<br />
<mst> cool, then I'll stay connected on irc just ping me<br />
<jasonwang> ok<br />
<mst> thanks!<br />
<jasonwang> you are welcome<br />
<jasonwang> hi, just back from the meeting<br />
<mst> hi<br />
<mst> okay so let's see what we have<br />
<jasonwang> okay<br />
<mst> first we have the various connection options<br />
<jasonwang> yes<br />
<mst> we can do:<br />
<mst> host to guest<br />
<mst> guest to host<br />
<mst> ext to guest<br />
<mst> ext to host<br />
<mst> guest to guest on local<br />
<jasonwang> ok<br />
<mst> guest to guest across the net<br />
<mst> for comparison it's probably useful to do "baremetal": loopback and external<->host<br />
<jasonwang> yes<br />
<mst> a bit more advanced: bidirectional tests<br />
<mst> many to many is probably to hard to setup<br />
<jasonwang> yes, so we need only test some key options<br />
<mst> yes, for now let's focus on things that are easy to define<br />
<mst> ok now what kind of traffic we care about<br />
<jasonwang> (ext)host to guest, guest to (ext)host ?<br />
<mst> no I mean scheduler is heavily involved<br />
<jasonwang> so guest to guest on local is also needed?<br />
<mst> yes, think so<br />
<mst> so I think we need to try just defaults<br />
<mst> (no pinning)<br />
<jasonwang> yes, that is usual case<br />
<mst> as well as pinned scenario where qemu is pinned to cpus<br />
<jasonwang> ok<br />
<mst> and for external pinning irqs as well<br />
<jasonwang> set irq affinity?<br />
<mst> do you know whether virsh let you pin the iothread?<br />
<mst> yes, affinity<br />
<jasonwang> no, I don't use virsh<br />
<mst> need to find out, only pin what virsh let us pin<br />
<jasonwang> okay<br />
<mst> note vhost-net thread is created on demand, so it is not very practical to pin it<br />
<mst> if we do need this capability it will have to be added, I am hoping scheduler does the right thing<br />
<jasonwang> yes, it's a workqueue in RHEL6.1<br />
<mst> workqueue is just a list + thread, or we can change it if we like<br />
<jasonwang> do you man if we need we can use a dedicated thread like upstream which is easy to be pinned?<br />
<mst> upstream is not easier to be pinned<br />
<mst> the issue is mostly that thread is only created on driver OK now<br />
<jasonwang> yes<br />
<mst> so guest can destroy it and recreate and it loses what you set<br />
<mst> in benchmark it works but not for real users<br />
<jasonwang> yes, agree<br />
<mst> maybe cgroups can be used somehow since it inherits the cgroups of the owner<br />
<mst> another option is to let qemu control the pinning<br />
<mst> either let it specify the thread to do the work<br />
<mst> or just add ioctl for pinning<br />
<jasonwang> looks possible<br />
<mst> in mark wagner's tests it seemed to work well without<br />
<mst> so need to see if it's needed, it's not hard to add this interface<br />
<mst> but once we add it must maintain forever<br />
<mst> so I think irq affinity and cpu pinning are two options to try tweaking<br />
<jasonwang> yes, have saw some performance discussion of vhost upstream<br />
<mst> need to make sure we try on a numa box<br />
<mst> at the moment kernel structures are allocated on first use<br />
<jasonwang> yes<br />
<mst> I hope it all fits in cache so should not matter<br />
<mst> but need to check, not yet sure what exactly<br />
<jasonwang> yes, things would be more complicated when using numa<br />
<mst> not sure what exactly are the configurations to check<br />
<mst> ok so we have the network setup and we have the cpu setup<br />
<mst> let thing is traffic to check<br />
<mst> let->last<br />
<jasonwang> yes, TCP_STREAM/UDP_STREAM/TCP_RR and something else?<br />
<mst> let's focus on the protocols first<br />
<mst> so we can do TCP, this has a strange property of coalescing messages<br />
<mst> but OTOH it's the most used protocol<br />
<mst> and it has hard requirements e.g. on the ordering of packets<br />
<jasonwang> yes, TCP must to be tested<br />
<mst> UDP is only working well up to mtu packet size<br />
<mst> but otherwise it let us do pretty low level stuff<br />
<jasonwang> yes, agree<br />
<mst> ICMP is very low level (good), has a disadvantage that it might be special-cased in hardware and software (bad)<br />
<mst> what kind of traffic we care about? ideally a range of message sizes, and a range of loads<br />
<mst> (in terms of messages per second)<br />
<jasonwang> yes<br />
<mst> what do we want to measure?<br />
<jasonwang> bandwidth and latency<br />
<mst> I think this not really it<br />
<mst> this is what tools like to give us<br />
<jasonwang> yes and maybe also the cpu usage<br />
<mst> if you think about it in terms of an application, it is always latency that you care about in the end<br />
<mst> e.g. I have this huge file what is the latency to send it over the network<br />
<mst> and for us also what is the cpu load, you are right<br />
<jasonwang> yes<br />
<mst> so for a given traffic, which we can approximate by setting message size (both ways) protocol and messages per second<br />
<mst> we want to know the latency and the cpu load<br />
<jasonwang> yes<br />
<mst> and we want the peak e.g. we want to know how high we can go in messages per second until latencies become unreasonable<br />
<mst> this last is a bit subjective<br />
<mst> but generally any system would gadually become less responsive with more load<br />
<mst> then at some point it just breaks<br />
<mst> cou load is a bit hard to define<br />
<mst> cpu<br />
<jasonwang> yes and it looks hard to do the measuring then<br />
<mst> I think in the end, what we care about is how many cpu cycles the host burns<br />
<jasonwang> yes, but how to measure that?<br />
<mst> well we have simple things like /proc/stat<br />
<jasonwang> understood and maybe perf can also help<br />
<mst> yes quite possibly<br />
<mst> in other words we'll need to measure this in parallel while test is running<br />
<mst> netperf can report local/remote CPU<br />
<mst> but I do not understand what it really means<br />
<mst> especially for a guest<br />
<jasonwang> yes, if we want to use netperf it's better to know how it does the calculation<br />
<mst> well it just looks at /proc/stat AFAIK<br />
<jasonwang> yes, I try to take a look at its source<br />
<mst> this is the default but it has other heuristics<br />
<mst> that can be configured at compile time<br />
<jasonwang> ok, understand<br />
<mst> ok and I think load divided by CPU is a useful metric<br />
<jasonwang> so the ideal result is to get how many cpu cycles does vhost spend on send or receive a KB<br />
<mst> netperf can report service demand<br />
<mst> I do not understand what it is<br />
<jasonwang> From its manual its how many us the cpu spend on a KB<br />
<mst> well the answer will be it depends :)<br />
<mst> also, we have packet loss<br />
<mst> I think at some level we only care about packets that were delivered<br />
<mst> so e.g. with UDP we only care about received messages<br />
<jasonwang> yes, the packet loss may have concerns with guest drivers<br />
<mst> with TCP if you look at messages, there's no loss<br />
<jasonwang> yes TCP have flow control itself<br />
<mst> ok so let's see what tools we have<br />
<mst> the simplest is flood ping<br />
<jasonwang> yes, it's very simple and easy to use<br />
<mst> it gives you control over message size, packets per second, gets you back latency<br />
<mst> it is always bidirectional I think<br />
<mst> and we need to measure CPU ourselves<br />
<mst> that last seems to be true anyway<br />
<jasonwang> yes, maybe easy to be understand and analysis than netperf<br />
<mst> packet loss when it occurs complicates things<br />
<mst> e.g. with 50% packet loss the real load is anywhere in between<br />
<jasonwang> yes<br />
<mst> that's the only problem: it's always bidirectional so tx/rx problems are hard to separate<br />
<jasonwang> yes, vhost is currently half-duplex<br />
<mst> I am also not sure it detect reordering<br />
<jasonwang> yes, it has sequence no.<br />
<jasonwang> but for ping, as you've said it's ICMP and was not the most of the cases<br />
<mst> ok, next we have netperf<br />
<mst> afaik it can do two things<br />
<mst> it can try sending as many packets as it can<br />
<jasonwang> yes<br />
<mst> or it can send a single one back and forth<br />
<mst> not a lot of data, but ok<br />
<jasonwang> yes<br />
<mst> and similar with UDP<br />
<mst> got to go have lunch<br />
<mst> So I will try and write all this up<br />
<mst> do you have any hardware for testing?<br />
<mst> if yes we'll add it too, I'll put up a wiki<br />
<mst> back in half an hour<br />
<jasonwang> yes, write all things up would help<br />
<jasonwang> go home now, please send me mail<br />
* jasonwang has quit (Quit: Leaving)<br />
<br />
* Loaded log from Wed Dec 15 15:07:24 2010</div>Msthttps://linux-kvm.org/index.php?title=NetworkingPerformanceTesting&diff=3450NetworkingPerformanceTesting2010-12-15T19:22:26Z<p>Mst: </p>
<hr />
<div>== Networking Performance Testing ==<br />
This is a summary of performance acceptance criteria for changes in hypervisor virt networking. The matrix of configurations we are interested in is built combining possible options. Naturally the bigger a change the more exhaustive would we want the coverage to be.<br />
<br />
We can get different configurations by selecting different options in the following categories: [[#Networking setup|Networking setup]], [[#CPU setup|CPU setup]], [[#Guest setup|Guest setup]], [[#Traffic load|Traffic load]].<br />
For each of these we are interested in a set of [[#Performance metrics|Performance metrics]].<br />
A test would need to be performed under a controlled Hardware configuration,<br />
for each relevant [[#Hypervisor setup|Hypervisor setup]] and/or [[#Guest setup|Guest setup]] (depending on which change is tested) on the same hardware.<br />
Ideally we'd note the [[#Hardware configuration|Hardware configuration]] and person performing the test to increase the chance it can be reproduced later.<br />
<br />
== Performance metrics ==<br />
Generally for a given setup and traffic<br />
we want to know the [[#Latency|Latency]] and the [[#CPU load|CPU load]].<br />
We generally might care about minimal, average (or median) and maximum<br />
latencies.<br />
<br />
Some derive metrics from this are:<br />
==== *peak throughput* i.e. how high we can go<br />
until latencies sharply become unreasonable<br />
==== *service demand*: load divided by CPU<br />
<br />
=== Latency ===<br />
Latency is generally time until you get a response. For some workloads you don't measure latencies directly, instead you measure peak throughput.<br />
<br />
=== CPU load ===<br />
The only metric that makes sense is probably host system load,<br />
of which the only someone quantifiable component seems to be the CPU load.<br />
Need take into account the fact that CPU speed might change<br />
with time, so load should probably be in seconds<br />
(%CPU/speed) rather than plain %CPU.<br />
<br />
== Networking setup ==<br />
<br />
== CPU setup ==<br />
<br />
== Guest setup ==<br />
<br />
== Hypervisor setup ==<br />
<br />
== Traffic load ==<br />
<br />
== Hardware configuration ==<br />
<br />
<br />
<mst> yes<br />
<jasonwang> can we let the perf team to do that?<br />
<mst> they likely won't do it in time<br />
<mst> I started making up a list of what we need to measure<br />
<mst> have a bit of time to discuss?<br />
<jasonwang> you mean we need to do it ourself?<br />
<mst> at least part of it<br />
<jasonwang> I'm sorry, I need to attend the autotest meeting in 10 minutes<br />
<jasonwang> mst ok<br />
<mst> will have time afterward?<br />
<mst> I know it's late in your TZ<br />
<jasonwang> ok<br />
<mst> cool, then I'll stay connected on irc just ping me<br />
<jasonwang> ok<br />
<mst> thanks!<br />
<jasonwang> you are welcome<br />
<jasonwang> hi, just back from the meeting<br />
<mst> hi<br />
<mst> okay so let's see what we have<br />
<jasonwang> okay<br />
<mst> first we have the various connection options<br />
<jasonwang> yes<br />
<mst> we can do:<br />
<mst> host to guest<br />
<mst> guest to host<br />
<mst> ext to guest<br />
<mst> ext to host<br />
<mst> guest to guest on local<br />
<jasonwang> ok<br />
<mst> guest to guest across the net<br />
<mst> for comparison it's probably useful to do "baremetal": loopback and external<->host<br />
<jasonwang> yes<br />
<mst> a bit more advanced: bidirectional tests<br />
<mst> many to many is probably to hard to setup<br />
<jasonwang> yes, so we need only test some key options<br />
<mst> yes, for now let's focus on things that are easy to define<br />
<mst> ok now what kind of traffic we care about<br />
<jasonwang> (ext)host to guest, guest to (ext)host ?<br />
<mst> no I mean scheduler is heavily involved<br />
<jasonwang> so guest to guest on local is also needed?<br />
<mst> yes, think so<br />
<mst> so I think we need to try just defaults<br />
<mst> (no pinning)<br />
<jasonwang> yes, that is usual case<br />
<mst> as well as pinned scenario where qemu is pinned to cpus<br />
<jasonwang> ok<br />
<mst> and for external pinning irqs as well<br />
<jasonwang> set irq affinity?<br />
<mst> do you know whether virsh let you pin the iothread?<br />
<mst> yes, affinity<br />
<jasonwang> no, I don't use virsh<br />
<mst> need to find out, only pin what virsh let us pin<br />
<jasonwang> okay<br />
<mst> note vhost-net thread is created on demand, so it is not very practical to pin it<br />
<mst> if we do need this capability it will have to be added, I am hoping scheduler does the right thing<br />
<jasonwang> yes, it's a workqueue in RHEL6.1<br />
<mst> workqueue is just a list + thread, or we can change it if we like<br />
<jasonwang> do you man if we need we can use a dedicated thread like upstream which is easy to be pinned?<br />
<mst> upstream is not easier to be pinned<br />
<mst> the issue is mostly that thread is only created on driver OK now<br />
<jasonwang> yes<br />
<mst> so guest can destroy it and recreate and it loses what you set<br />
<mst> in benchmark it works but not for real users<br />
<jasonwang> yes, agree<br />
<mst> maybe cgroups can be used somehow since it inherits the cgroups of the owner<br />
<mst> another option is to let qemu control the pinning<br />
<mst> either let it specify the thread to do the work<br />
<mst> or just add ioctl for pinning<br />
<jasonwang> looks possible<br />
<mst> in mark wagner's tests it seemed to work well without<br />
<mst> so need to see if it's needed, it's not hard to add this interface<br />
<mst> but once we add it must maintain forever<br />
<mst> so I think irq affinity and cpu pinning are two options to try tweaking<br />
<jasonwang> yes, have saw some performance discussion of vhost upstream<br />
<mst> need to make sure we try on a numa box<br />
<mst> at the moment kernel structures are allocated on first use<br />
<jasonwang> yes<br />
<mst> I hope it all fits in cache so should not matter<br />
<mst> but need to check, not yet sure what exactly<br />
<jasonwang> yes, things would be more complicated when using numa<br />
<mst> not sure what exactly are the configurations to check<br />
<mst> ok so we have the network setup and we have the cpu setup<br />
<mst> let thing is traffic to check<br />
<mst> let->last<br />
<jasonwang> yes, TCP_STREAM/UDP_STREAM/TCP_RR and something else?<br />
<mst> let's focus on the protocols first<br />
<mst> so we can do TCP, this has a strange property of coalescing messages<br />
<mst> but OTOH it's the most used protocol<br />
<mst> and it has hard requirements e.g. on the ordering of packets<br />
<jasonwang> yes, TCP must to be tested<br />
<mst> UDP is only working well up to mtu packet size<br />
<mst> but otherwise it let us do pretty low level stuff<br />
<jasonwang> yes, agree<br />
<mst> ICMP is very low level (good), has a disadvantage that it might be special-cased in hardware and software (bad)<br />
<mst> what kind of traffic we care about? ideally a range of message sizes, and a range of loads<br />
<mst> (in terms of messages per second)<br />
<jasonwang> yes<br />
<mst> what do we want to measure?<br />
<jasonwang> bandwidth and latency<br />
<mst> I think this not really it<br />
<mst> this is what tools like to give us<br />
<jasonwang> yes and maybe also the cpu usage<br />
<mst> if you think about it in terms of an application, it is always latency that you care about in the end<br />
<mst> e.g. I have this huge file what is the latency to send it over the network<br />
<mst> and for us also what is the cpu load, you are right<br />
<jasonwang> yes<br />
<mst> so for a given traffic, which we can approximate by setting message size (both ways) protocol and messages per second<br />
<mst> we want to know the latency and the cpu load<br />
<jasonwang> yes<br />
<mst> and we want the peak e.g. we want to know how high we can go in messages per second until latencies become unreasonable<br />
<mst> this last is a bit subjective<br />
<mst> but generally any system would gadually become less responsive with more load<br />
<mst> then at some point it just breaks<br />
<mst> cou load is a bit hard to define<br />
<mst> cpu<br />
<jasonwang> yes and it looks hard to do the measuring then<br />
<mst> I think in the end, what we care about is how many cpu cycles the host burns<br />
<jasonwang> yes, but how to measure that?<br />
<mst> well we have simple things like /proc/stat<br />
<jasonwang> understood and maybe perf can also help<br />
<mst> yes quite possibly<br />
<mst> in other words we'll need to measure this in parallel while test is running<br />
<mst> netperf can report local/remote CPU<br />
<mst> but I do not understand what it really means<br />
<mst> especially for a guest<br />
<jasonwang> yes, if we want to use netperf it's better to know how it does the calculation<br />
<mst> well it just looks at /proc/stat AFAIK<br />
<jasonwang> yes, I try to take a look at its source<br />
<mst> this is the default but it has other heuristics<br />
<mst> that can be configured at compile time<br />
<jasonwang> ok, understand<br />
<mst> ok and I think load divided by CPU is a useful metric<br />
<jasonwang> so the ideal result is to get how many cpu cycles does vhost spend on send or receive a KB<br />
<mst> netperf can report service demand<br />
<mst> I do not understand what it is<br />
<jasonwang> From its manual its how many us the cpu spend on a KB<br />
<mst> well the answer will be it depends :)<br />
<mst> also, we have packet loss<br />
<mst> I think at some level we only care about packets that were delivered<br />
<mst> so e.g. with UDP we only care about received messages<br />
<jasonwang> yes, the packet loss may have concerns with guest drivers<br />
<mst> with TCP if you look at messages, there's no loss<br />
<jasonwang> yes TCP have flow control itself<br />
<mst> ok so let's see what tools we have<br />
<mst> the simplest is flood ping<br />
<jasonwang> yes, it's very simple and easy to use<br />
<mst> it gives you control over message size, packets per second, gets you back latency<br />
<mst> it is always bidirectional I think<br />
<mst> and we need to measure CPU ourselves<br />
<mst> that last seems to be true anyway<br />
<jasonwang> yes, maybe easy to be understand and analysis than netperf<br />
<mst> packet loss when it occurs complicates things<br />
<mst> e.g. with 50% packet loss the real load is anywhere in between<br />
<jasonwang> yes<br />
<mst> that's the only problem: it's always bidirectional so tx/rx problems are hard to separate<br />
<jasonwang> yes, vhost is currently half-duplex<br />
<mst> I am also not sure it detect reordering<br />
<jasonwang> yes, it has sequence no.<br />
<jasonwang> but for ping, as you've said it's ICMP and was not the most of the cases<br />
<mst> ok, next we have netperf<br />
<mst> afaik it can do two things<br />
<mst> it can try sending as many packets as it can<br />
<jasonwang> yes<br />
<mst> or it can send a single one back and forth<br />
<mst> not a lot of data, but ok<br />
<jasonwang> yes<br />
<mst> and similar with UDP<br />
<mst> got to go have lunch<br />
<mst> So I will try and write all this up<br />
<mst> do you have any hardware for testing?<br />
<mst> if yes we'll add it too, I'll put up a wiki<br />
<mst> back in half an hour<br />
<jasonwang> yes, write all things up would help<br />
<jasonwang> go home now, please send me mail<br />
* jasonwang has quit (Quit: Leaving)<br />
<br />
* Loaded log from Wed Dec 15 15:07:24 2010</div>Msthttps://linux-kvm.org/index.php?title=NetworkingPerformanceTesting&diff=3449NetworkingPerformanceTesting2010-12-15T19:20:59Z<p>Mst: </p>
<hr />
<div>== Networking Performance Testing ==<br />
This is a summary of performance acceptance criteria for changes in hypervisor virt networking. The matrix of configurations we are interested in is built combining possible options. Naturally the bigger a change the more exhaustive would we want the coverage to be.<br />
<br />
We can get different configurations by selecting different options in the following categories: [[#Networking setup|Networking setup]], [[#CPU setup|CPU setup]], [[#Guest setup|Guest setup]], [[#Traffic load|Traffic load]].<br />
For each of these we are interested in a set of [[#Performance metrics|Performance metrics]].<br />
A test would need to be performed under a controlled Hardware configuration,<br />
for each relevant [[#Hypervisor setup|Hypervisor setup]] and/or [[#Guest setup|Guest setup]] (depending on which change is tested) on the same hardware.<br />
Ideally we'd note the [[#Hardware configuration|Hardware configuration]] and person performing the test to increase the chance it can be reproduced later.<br />
<br />
== Performance metrics ==<br />
Generally for a given setup and traffic<br />
we want to know the [[#Latency|Latency]] and the [[#CPU load|CPU load]].<br />
We generally might care about minimal, average (or median) and maximum<br />
latencies.<br />
<br />
Some derive metrics from this are:<br />
- *peak throughput* i.e. how high we can go<br />
until latencies sharply become unreasonable<br />
- *service demand*: load divided by CPU<br />
<br />
=== Latency ===<br />
Latency is generally time until you get a response. For some workloads you don't measure latencies directly, instead you measure peak throughput.<br />
<br />
=== CPU load ===<br />
The only metric that makes sense is probably host system load,<br />
of which the only someone quantifiable component seems to be the CPU load.<br />
Need take into account the fact that CPU speed might change<br />
with time, so load should probably be in seconds<br />
(%CPU/speed) rather than plain %CPU.<br />
<br />
== Networking setup ==<br />
<br />
== CPU setup ==<br />
<br />
== Guest setup ==<br />
<br />
== Hypervisor setup ==<br />
<br />
== Traffic load ==<br />
<br />
== Hardware configuration ==<br />
<br />
<br />
<mst> yes<br />
<jasonwang> can we let the perf team to do that?<br />
<mst> they likely won't do it in time<br />
<mst> I started making up a list of what we need to measure<br />
<mst> have a bit of time to discuss?<br />
<jasonwang> you mean we need to do it ourself?<br />
<mst> at least part of it<br />
<jasonwang> I'm sorry, I need to attend the autotest meeting in 10 minutes<br />
<jasonwang> mst ok<br />
<mst> will have time afterward?<br />
<mst> I know it's late in your TZ<br />
<jasonwang> ok<br />
<mst> cool, then I'll stay connected on irc just ping me<br />
<jasonwang> ok<br />
<mst> thanks!<br />
<jasonwang> you are welcome<br />
<jasonwang> hi, just back from the meeting<br />
<mst> hi<br />
<mst> okay so let's see what we have<br />
<jasonwang> okay<br />
<mst> first we have the various connection options<br />
<jasonwang> yes<br />
<mst> we can do:<br />
<mst> host to guest<br />
<mst> guest to host<br />
<mst> ext to guest<br />
<mst> ext to host<br />
<mst> guest to guest on local<br />
<jasonwang> ok<br />
<mst> guest to guest across the net<br />
<mst> for comparison it's probably useful to do "baremetal": loopback and external<->host<br />
<jasonwang> yes<br />
<mst> a bit more advanced: bidirectional tests<br />
<mst> many to many is probably to hard to setup<br />
<jasonwang> yes, so we need only test some key options<br />
<mst> yes, for now let's focus on things that are easy to define<br />
<mst> ok now what kind of traffic we care about<br />
<jasonwang> (ext)host to guest, guest to (ext)host ?<br />
<mst> no I mean scheduler is heavily involved<br />
<jasonwang> so guest to guest on local is also needed?<br />
<mst> yes, think so<br />
<mst> so I think we need to try just defaults<br />
<mst> (no pinning)<br />
<jasonwang> yes, that is usual case<br />
<mst> as well as pinned scenario where qemu is pinned to cpus<br />
<jasonwang> ok<br />
<mst> and for external pinning irqs as well<br />
<jasonwang> set irq affinity?<br />
<mst> do you know whether virsh let you pin the iothread?<br />
<mst> yes, affinity<br />
<jasonwang> no, I don't use virsh<br />
<mst> need to find out, only pin what virsh let us pin<br />
<jasonwang> okay<br />
<mst> note vhost-net thread is created on demand, so it is not very practical to pin it<br />
<mst> if we do need this capability it will have to be added, I am hoping scheduler does the right thing<br />
<jasonwang> yes, it's a workqueue in RHEL6.1<br />
<mst> workqueue is just a list + thread, or we can change it if we like<br />
<jasonwang> do you man if we need we can use a dedicated thread like upstream which is easy to be pinned?<br />
<mst> upstream is not easier to be pinned<br />
<mst> the issue is mostly that thread is only created on driver OK now<br />
<jasonwang> yes<br />
<mst> so guest can destroy it and recreate and it loses what you set<br />
<mst> in benchmark it works but not for real users<br />
<jasonwang> yes, agree<br />
<mst> maybe cgroups can be used somehow since it inherits the cgroups of the owner<br />
<mst> another option is to let qemu control the pinning<br />
<mst> either let it specify the thread to do the work<br />
<mst> or just add ioctl for pinning<br />
<jasonwang> looks possible<br />
<mst> in mark wagner's tests it seemed to work well without<br />
<mst> so need to see if it's needed, it's not hard to add this interface<br />
<mst> but once we add it must maintain forever<br />
<mst> so I think irq affinity and cpu pinning are two options to try tweaking<br />
<jasonwang> yes, have saw some performance discussion of vhost upstream<br />
<mst> need to make sure we try on a numa box<br />
<mst> at the moment kernel structures are allocated on first use<br />
<jasonwang> yes<br />
<mst> I hope it all fits in cache so should not matter<br />
<mst> but need to check, not yet sure what exactly<br />
<jasonwang> yes, things would be more complicated when using numa<br />
<mst> not sure what exactly are the configurations to check<br />
<mst> ok so we have the network setup and we have the cpu setup<br />
<mst> let thing is traffic to check<br />
<mst> let->last<br />
<jasonwang> yes, TCP_STREAM/UDP_STREAM/TCP_RR and something else?<br />
<mst> let's focus on the protocols first<br />
<mst> so we can do TCP, this has a strange property of coalescing messages<br />
<mst> but OTOH it's the most used protocol<br />
<mst> and it has hard requirements e.g. on the ordering of packets<br />
<jasonwang> yes, TCP must to be tested<br />
<mst> UDP is only working well up to mtu packet size<br />
<mst> but otherwise it let us do pretty low level stuff<br />
<jasonwang> yes, agree<br />
<mst> ICMP is very low level (good), has a disadvantage that it might be special-cased in hardware and software (bad)<br />
<mst> what kind of traffic we care about? ideally a range of message sizes, and a range of loads<br />
<mst> (in terms of messages per second)<br />
<jasonwang> yes<br />
<mst> what do we want to measure?<br />
<jasonwang> bandwidth and latency<br />
<mst> I think this not really it<br />
<mst> this is what tools like to give us<br />
<jasonwang> yes and maybe also the cpu usage<br />
<mst> if you think about it in terms of an application, it is always latency that you care about in the end<br />
<mst> e.g. I have this huge file what is the latency to send it over the network<br />
<mst> and for us also what is the cpu load, you are right<br />
<jasonwang> yes<br />
<mst> so for a given traffic, which we can approximate by setting message size (both ways) protocol and messages per second<br />
<mst> we want to know the latency and the cpu load<br />
<jasonwang> yes<br />
<mst> and we want the peak e.g. we want to know how high we can go in messages per second until latencies become unreasonable<br />
<mst> this last is a bit subjective<br />
<mst> but generally any system would gadually become less responsive with more load<br />
<mst> then at some point it just breaks<br />
<mst> cou load is a bit hard to define<br />
<mst> cpu<br />
<jasonwang> yes and it looks hard to do the measuring then<br />
<mst> I think in the end, what we care about is how many cpu cycles the host burns<br />
<jasonwang> yes, but how to measure that?<br />
<mst> well we have simple things like /proc/stat<br />
<jasonwang> understood and maybe perf can also help<br />
<mst> yes quite possibly<br />
<mst> in other words we'll need to measure this in parallel while test is running<br />
<mst> netperf can report local/remote CPU<br />
<mst> but I do not understand what it really means<br />
<mst> especially for a guest<br />
<jasonwang> yes, if we want to use netperf it's better to know how it does the calculation<br />
<mst> well it just looks at /proc/stat AFAIK<br />
<jasonwang> yes, I try to take a look at its source<br />
<mst> this is the default but it has other heuristics<br />
<mst> that can be configured at compile time<br />
<jasonwang> ok, understand<br />
<mst> ok and I think load divided by CPU is a useful metric<br />
<jasonwang> so the ideal result is to get how many cpu cycles does vhost spend on send or receive a KB<br />
<mst> netperf can report service demand<br />
<mst> I do not understand what it is<br />
<jasonwang> From its manual its how many us the cpu spend on a KB<br />
<mst> well the answer will be it depends :)<br />
<mst> also, we have packet loss<br />
<mst> I think at some level we only care about packets that were delivered<br />
<mst> so e.g. with UDP we only care about received messages<br />
<jasonwang> yes, the packet loss may have concerns with guest drivers<br />
<mst> with TCP if you look at messages, there's no loss<br />
<jasonwang> yes TCP have flow control itself<br />
<mst> ok so let's see what tools we have<br />
<mst> the simplest is flood ping<br />
<jasonwang> yes, it's very simple and easy to use<br />
<mst> it gives you control over message size, packets per second, gets you back latency<br />
<mst> it is always bidirectional I think<br />
<mst> and we need to measure CPU ourselves<br />
<mst> that last seems to be true anyway<br />
<jasonwang> yes, maybe easy to be understand and analysis than netperf<br />
<mst> packet loss when it occurs complicates things<br />
<mst> e.g. with 50% packet loss the real load is anywhere in between<br />
<jasonwang> yes<br />
<mst> that's the only problem: it's always bidirectional so tx/rx problems are hard to separate<br />
<jasonwang> yes, vhost is currently half-duplex<br />
<mst> I am also not sure it detect reordering<br />
<jasonwang> yes, it has sequence no.<br />
<jasonwang> but for ping, as you've said it's ICMP and was not the most of the cases<br />
<mst> ok, next we have netperf<br />
<mst> afaik it can do two things<br />
<mst> it can try sending as many packets as it can<br />
<jasonwang> yes<br />
<mst> or it can send a single one back and forth<br />
<mst> not a lot of data, but ok<br />
<jasonwang> yes<br />
<mst> and similar with UDP<br />
<mst> got to go have lunch<br />
<mst> So I will try and write all this up<br />
<mst> do you have any hardware for testing?<br />
<mst> if yes we'll add it too, I'll put up a wiki<br />
<mst> back in half an hour<br />
<jasonwang> yes, write all things up would help<br />
<jasonwang> go home now, please send me mail<br />
* jasonwang has quit (Quit: Leaving)<br />
<br />
* Loaded log from Wed Dec 15 15:07:24 2010</div>Msthttps://linux-kvm.org/index.php?title=NetworkingPerformanceTesting&diff=3448NetworkingPerformanceTesting2010-12-15T17:29:37Z<p>Mst: </p>
<hr />
<div>== Networking Performance Testing ==<br />
This is a summary of performance acceptance criteria for changes in hypervisor virt networking. The matrix of configurations we are interested in is built combining possible options. Naturally the bigger a change the more exhaustive would we want the coverage to be.<br />
<br />
We can get different configurations by selecting different options in the following categories: [[#Networking setup|Networking setup]], [[#CPU setup|CPU setup]], [[#Guest setup|Guest setup]], [[#Traffic load|Traffic load]].<br />
For each of these we are interested in a set of [[#Performance metrics|Performance metrics]].<br />
A test would need to be performed under a controlled Hardware configuration,<br />
for each relevant [[#Hypervisor setup|Hypervisor setup]] and/or [[#Guest setup|Guest setup]] (depending on which change is tested) on the same hardware.<br />
Ideally we'd note the [[#Hardware configuration|Hardware configuration]] and person performing the test to increase the chance it can be reproduced later.<br />
<br />
== Performance metrics ==<br />
<br />
== Networking setup ==<br />
<br />
== CPU setup ==<br />
<br />
== Guest setup ==<br />
<br />
== Hypervisor setup ==<br />
<br />
== Traffic load ==<br />
<br />
== Hardware configuration ==<br />
<br />
<br />
<mst> yes<br />
<jasonwang> can we let the perf team to do that?<br />
<mst> they likely won't do it in time<br />
<mst> I started making up a list of what we need to measure<br />
<mst> have a bit of time to discuss?<br />
<jasonwang> you mean we need to do it ourself?<br />
<mst> at least part of it<br />
<jasonwang> I'm sorry, I need to attend the autotest meeting in 10 minutes<br />
<jasonwang> mst ok<br />
<mst> will have time afterward?<br />
<mst> I know it's late in your TZ<br />
<jasonwang> ok<br />
<mst> cool, then I'll stay connected on irc just ping me<br />
<jasonwang> ok<br />
<mst> thanks!<br />
<jasonwang> you are welcome<br />
<jasonwang> hi, just back from the meeting<br />
<mst> hi<br />
<mst> okay so let's see what we have<br />
<jasonwang> okay<br />
<mst> first we have the various connection options<br />
<jasonwang> yes<br />
<mst> we can do:<br />
<mst> host to guest<br />
<mst> guest to host<br />
<mst> ext to guest<br />
<mst> ext to host<br />
<mst> guest to guest on local<br />
<jasonwang> ok<br />
<mst> guest to guest across the net<br />
<mst> for comparison it's probably useful to do "baremetal": loopback and external<->host<br />
<jasonwang> yes<br />
<mst> a bit more advanced: bidirectional tests<br />
<mst> many to many is probably to hard to setup<br />
<jasonwang> yes, so we need only test some key options<br />
<mst> yes, for now let's focus on things that are easy to define<br />
<mst> ok now what kind of traffic we care about<br />
<jasonwang> (ext)host to guest, guest to (ext)host ?<br />
<mst> no I mean scheduler is heavily involved<br />
<jasonwang> so guest to guest on local is also needed?<br />
<mst> yes, think so<br />
<mst> so I think we need to try just defaults<br />
<mst> (no pinning)<br />
<jasonwang> yes, that is usual case<br />
<mst> as well as pinned scenario where qemu is pinned to cpus<br />
<jasonwang> ok<br />
<mst> and for external pinning irqs as well<br />
<jasonwang> set irq affinity?<br />
<mst> do you know whether virsh let you pin the iothread?<br />
<mst> yes, affinity<br />
<jasonwang> no, I don't use virsh<br />
<mst> need to find out, only pin what virsh let us pin<br />
<jasonwang> okay<br />
<mst> note vhost-net thread is created on demand, so it is not very practical to pin it<br />
<mst> if we do need this capability it will have to be added, I am hoping scheduler does the right thing<br />
<jasonwang> yes, it's a workqueue in RHEL6.1<br />
<mst> workqueue is just a list + thread, or we can change it if we like<br />
<jasonwang> do you man if we need we can use a dedicated thread like upstream which is easy to be pinned?<br />
<mst> upstream is not easier to be pinned<br />
<mst> the issue is mostly that thread is only created on driver OK now<br />
<jasonwang> yes<br />
<mst> so guest can destroy it and recreate and it loses what you set<br />
<mst> in benchmark it works but not for real users<br />
<jasonwang> yes, agree<br />
<mst> maybe cgroups can be used somehow since it inherits the cgroups of the owner<br />
<mst> another option is to let qemu control the pinning<br />
<mst> either let it specify the thread to do the work<br />
<mst> or just add ioctl for pinning<br />
<jasonwang> looks possible<br />
<mst> in mark wagner's tests it seemed to work well without<br />
<mst> so need to see if it's needed, it's not hard to add this interface<br />
<mst> but once we add it must maintain forever<br />
<mst> so I think irq affinity and cpu pinning are two options to try tweaking<br />
<jasonwang> yes, have saw some performance discussion of vhost upstream<br />
<mst> need to make sure we try on a numa box<br />
<mst> at the moment kernel structures are allocated on first use<br />
<jasonwang> yes<br />
<mst> I hope it all fits in cache so should not matter<br />
<mst> but need to check, not yet sure what exactly<br />
<jasonwang> yes, things would be more complicated when using numa<br />
<mst> not sure what exactly are the configurations to check<br />
<mst> ok so we have the network setup and we have the cpu setup<br />
<mst> let thing is traffic to check<br />
<mst> let->last<br />
<jasonwang> yes, TCP_STREAM/UDP_STREAM/TCP_RR and something else?<br />
<mst> let's focus on the protocols first<br />
<mst> so we can do TCP, this has a strange property of coalescing messages<br />
<mst> but OTOH it's the most used protocol<br />
<mst> and it has hard requirements e.g. on the ordering of packets<br />
<jasonwang> yes, TCP must to be tested<br />
<mst> UDP is only working well up to mtu packet size<br />
<mst> but otherwise it let us do pretty low level stuff<br />
<jasonwang> yes, agree<br />
<mst> ICMP is very low level (good), has a disadvantage that it might be special-cased in hardware and software (bad)<br />
<mst> what kind of traffic we care about? ideally a range of message sizes, and a range of loads<br />
<mst> (in terms of messages per second)<br />
<jasonwang> yes<br />
<mst> what do we want to measure?<br />
<jasonwang> bandwidth and latency<br />
<mst> I think this not really it<br />
<mst> this is what tools like to give us<br />
<jasonwang> yes and maybe also the cpu usage<br />
<mst> if you think about it in terms of an application, it is always latency that you care about in the end<br />
<mst> e.g. I have this huge file what is the latency to send it over the network<br />
<mst> and for us also what is the cpu load, you are right<br />
<jasonwang> yes<br />
<mst> so for a given traffic, which we can approximate by setting message size (both ways) protocol and messages per second<br />
<mst> we want to know the latency and the cpu load<br />
<jasonwang> yes<br />
<mst> and we want the peak e.g. we want to know how high we can go in messages per second until latencies become unreasonable<br />
<mst> this last is a bit subjective<br />
<mst> but generally any system would gadually become less responsive with more load<br />
<mst> then at some point it just breaks<br />
<mst> cou load is a bit hard to define<br />
<mst> cpu<br />
<jasonwang> yes and it looks hard to do the measuring then<br />
<mst> I think in the end, what we care about is how many cpu cycles the host burns<br />
<jasonwang> yes, but how to measure that?<br />
<mst> well we have simple things like /proc/stat<br />
<jasonwang> understood and maybe perf can also help<br />
<mst> yes quite possibly<br />
<mst> in other words we'll need to measure this in parallel while test is running<br />
<mst> netperf can report local/remote CPU<br />
<mst> but I do not understand what it really means<br />
<mst> especially for a guest<br />
<jasonwang> yes, if we want to use netperf it's better to know how it does the calculation<br />
<mst> well it just looks at /proc/stat AFAIK<br />
<jasonwang> yes, I try to take a look at its source<br />
<mst> this is the default but it has other heuristics<br />
<mst> that can be configured at compile time<br />
<jasonwang> ok, understand<br />
<mst> ok and I think load divided by CPU is a useful metric<br />
<jasonwang> so the ideal result is to get how many cpu cycles does vhost spend on send or receive a KB<br />
<mst> netperf can report service demand<br />
<mst> I do not understand what it is<br />
<jasonwang> From its manual its how many us the cpu spend on a KB<br />
<mst> well the answer will be it depends :)<br />
<mst> also, we have packet loss<br />
<mst> I think at some level we only care about packets that were delivered<br />
<mst> so e.g. with UDP we only care about received messages<br />
<jasonwang> yes, the packet loss may have concerns with guest drivers<br />
<mst> with TCP if you look at messages, there's no loss<br />
<jasonwang> yes TCP have flow control itself<br />
<mst> ok so let's see what tools we have<br />
<mst> the simplest is flood ping<br />
<jasonwang> yes, it's very simple and easy to use<br />
<mst> it gives you control over message size, packets per second, gets you back latency<br />
<mst> it is always bidirectional I think<br />
<mst> and we need to measure CPU ourselves<br />
<mst> that last seems to be true anyway<br />
<jasonwang> yes, maybe easy to be understand and analysis than netperf<br />
<mst> packet loss when it occurs complicates things<br />
<mst> e.g. with 50% packet loss the real load is anywhere in between<br />
<jasonwang> yes<br />
<mst> that's the only problem: it's always bidirectional so tx/rx problems are hard to separate<br />
<jasonwang> yes, vhost is currently half-duplex<br />
<mst> I am also not sure it detect reordering<br />
<jasonwang> yes, it has sequence no.<br />
<jasonwang> but for ping, as you've said it's ICMP and was not the most of the cases<br />
<mst> ok, next we have netperf<br />
<mst> afaik it can do two things<br />
<mst> it can try sending as many packets as it can<br />
<jasonwang> yes<br />
<mst> or it can send a single one back and forth<br />
<mst> not a lot of data, but ok<br />
<jasonwang> yes<br />
<mst> and similar with UDP<br />
<mst> got to go have lunch<br />
<mst> So I will try and write all this up<br />
<mst> do you have any hardware for testing?<br />
<mst> if yes we'll add it too, I'll put up a wiki<br />
<mst> back in half an hour<br />
<jasonwang> yes, write all things up would help<br />
<jasonwang> go home now, please send me mail<br />
* jasonwang has quit (Quit: Leaving)<br />
<br />
* Loaded log from Wed Dec 15 15:07:24 2010</div>Msthttps://linux-kvm.org/index.php?title=NetworkingPerformanceTesting&diff=3447NetworkingPerformanceTesting2010-12-15T17:25:50Z<p>Mst: headers filed in</p>
<hr />
<div>== Networking Performance Testing ==<br />
This is a summary of performance acceptance criteria for changes in hypervisor virt networking. The matrix of configurations we are interested in is built combining possible options. Naturally the bigger a change the more exhaustive would we want the coverage to be.<br />
<br />
We can get different configurations by selecting different options in the following categories: [[#Networking setup]], [[#CPU setup], [[#Guest setup]], [[#Traffic load]].<br />
For each of these we are interested in a set of [[#Performance metrics]].<br />
A test would need to be performed under a controlled Hardware configuration,<br />
for each relevant [[#Hypervisor setup]] and/or [[#Guest setup]] (depending on which change is tested) on the same hardware.<br />
Ideally we'd note the [[#Hardware configuration]] and person performing the test to increase the chance it can be reproduced later.<br />
<br />
== Performance metrics ==<br />
<br />
== Networking setup ==<br />
<br />
== CPU setup ==<br />
<br />
== Guest setup ==<br />
<br />
== Hypervisor setup ==<br />
<br />
== Traffic load ==<br />
<br />
== Hardware configuration ==<br />
<br />
<br />
<mst> yes<br />
<jasonwang> can we let the perf team to do that?<br />
<mst> they likely won't do it in time<br />
<mst> I started making up a list of what we need to measure<br />
<mst> have a bit of time to discuss?<br />
<jasonwang> you mean we need to do it ourself?<br />
<mst> at least part of it<br />
<jasonwang> I'm sorry, I need to attend the autotest meeting in 10 minutes<br />
<jasonwang> mst ok<br />
<mst> will have time afterward?<br />
<mst> I know it's late in your TZ<br />
<jasonwang> ok<br />
<mst> cool, then I'll stay connected on irc just ping me<br />
<jasonwang> ok<br />
<mst> thanks!<br />
<jasonwang> you are welcome<br />
<jasonwang> hi, just back from the meeting<br />
<mst> hi<br />
<mst> okay so let's see what we have<br />
<jasonwang> okay<br />
<mst> first we have the various connection options<br />
<jasonwang> yes<br />
<mst> we can do:<br />
<mst> host to guest<br />
<mst> guest to host<br />
<mst> ext to guest<br />
<mst> ext to host<br />
<mst> guest to guest on local<br />
<jasonwang> ok<br />
<mst> guest to guest across the net<br />
<mst> for comparison it's probably useful to do "baremetal": loopback and external<->host<br />
<jasonwang> yes<br />
<mst> a bit more advanced: bidirectional tests<br />
<mst> many to many is probably to hard to setup<br />
<jasonwang> yes, so we need only test some key options<br />
<mst> yes, for now let's focus on things that are easy to define<br />
<mst> ok now what kind of traffic we care about<br />
<jasonwang> (ext)host to guest, guest to (ext)host ?<br />
<mst> no I mean scheduler is heavily involved<br />
<jasonwang> so guest to guest on local is also needed?<br />
<mst> yes, think so<br />
<mst> so I think we need to try just defaults<br />
<mst> (no pinning)<br />
<jasonwang> yes, that is usual case<br />
<mst> as well as pinned scenario where qemu is pinned to cpus<br />
<jasonwang> ok<br />
<mst> and for external pinning irqs as well<br />
<jasonwang> set irq affinity?<br />
<mst> do you know whether virsh let you pin the iothread?<br />
<mst> yes, affinity<br />
<jasonwang> no, I don't use virsh<br />
<mst> need to find out, only pin what virsh let us pin<br />
<jasonwang> okay<br />
<mst> note vhost-net thread is created on demand, so it is not very practical to pin it<br />
<mst> if we do need this capability it will have to be added, I am hoping scheduler does the right thing<br />
<jasonwang> yes, it's a workqueue in RHEL6.1<br />
<mst> workqueue is just a list + thread, or we can change it if we like<br />
<jasonwang> do you man if we need we can use a dedicated thread like upstream which is easy to be pinned?<br />
<mst> upstream is not easier to be pinned<br />
<mst> the issue is mostly that thread is only created on driver OK now<br />
<jasonwang> yes<br />
<mst> so guest can destroy it and recreate and it loses what you set<br />
<mst> in benchmark it works but not for real users<br />
<jasonwang> yes, agree<br />
<mst> maybe cgroups can be used somehow since it inherits the cgroups of the owner<br />
<mst> another option is to let qemu control the pinning<br />
<mst> either let it specify the thread to do the work<br />
<mst> or just add ioctl for pinning<br />
<jasonwang> looks possible<br />
<mst> in mark wagner's tests it seemed to work well without<br />
<mst> so need to see if it's needed, it's not hard to add this interface<br />
<mst> but once we add it must maintain forever<br />
<mst> so I think irq affinity and cpu pinning are two options to try tweaking<br />
<jasonwang> yes, have saw some performance discussion of vhost upstream<br />
<mst> need to make sure we try on a numa box<br />
<mst> at the moment kernel structures are allocated on first use<br />
<jasonwang> yes<br />
<mst> I hope it all fits in cache so should not matter<br />
<mst> but need to check, not yet sure what exactly<br />
<jasonwang> yes, things would be more complicated when using numa<br />
<mst> not sure what exactly are the configurations to check<br />
<mst> ok so we have the network setup and we have the cpu setup<br />
<mst> let thing is traffic to check<br />
<mst> let->last<br />
<jasonwang> yes, TCP_STREAM/UDP_STREAM/TCP_RR and something else?<br />
<mst> let's focus on the protocols first<br />
<mst> so we can do TCP, this has a strange property of coalescing messages<br />
<mst> but OTOH it's the most used protocol<br />
<mst> and it has hard requirements e.g. on the ordering of packets<br />
<jasonwang> yes, TCP must to be tested<br />
<mst> UDP is only working well up to mtu packet size<br />
<mst> but otherwise it let us do pretty low level stuff<br />
<jasonwang> yes, agree<br />
<mst> ICMP is very low level (good), has a disadvantage that it might be special-cased in hardware and software (bad)<br />
<mst> what kind of traffic we care about? ideally a range of message sizes, and a range of loads<br />
<mst> (in terms of messages per second)<br />
<jasonwang> yes<br />
<mst> what do we want to measure?<br />
<jasonwang> bandwidth and latency<br />
<mst> I think this not really it<br />
<mst> this is what tools like to give us<br />
<jasonwang> yes and maybe also the cpu usage<br />
<mst> if you think about it in terms of an application, it is always latency that you care about in the end<br />
<mst> e.g. I have this huge file what is the latency to send it over the network<br />
<mst> and for us also what is the cpu load, you are right<br />
<jasonwang> yes<br />
<mst> so for a given traffic, which we can approximate by setting message size (both ways) protocol and messages per second<br />
<mst> we want to know the latency and the cpu load<br />
<jasonwang> yes<br />
<mst> and we want the peak e.g. we want to know how high we can go in messages per second until latencies become unreasonable<br />
<mst> this last is a bit subjective<br />
<mst> but generally any system would gadually become less responsive with more load<br />
<mst> then at some point it just breaks<br />
<mst> cou load is a bit hard to define<br />
<mst> cpu<br />
<jasonwang> yes and it looks hard to do the measuring then<br />
<mst> I think in the end, what we care about is how many cpu cycles the host burns<br />
<jasonwang> yes, but how to measure that?<br />
<mst> well we have simple things like /proc/stat<br />
<jasonwang> understood and maybe perf can also help<br />
<mst> yes quite possibly<br />
<mst> in other words we'll need to measure this in parallel while test is running<br />
<mst> netperf can report local/remote CPU<br />
<mst> but I do not understand what it really means<br />
<mst> especially for a guest<br />
<jasonwang> yes, if we want to use netperf it's better to know how it does the calculation<br />
<mst> well it just looks at /proc/stat AFAIK<br />
<jasonwang> yes, I try to take a look at its source<br />
<mst> this is the default but it has other heuristics<br />
<mst> that can be configured at compile time<br />
<jasonwang> ok, understand<br />
<mst> ok and I think load divided by CPU is a useful metric<br />
<jasonwang> so the ideal result is to get how many cpu cycles does vhost spend on send or receive a KB<br />
<mst> netperf can report service demand<br />
<mst> I do not understand what it is<br />
<jasonwang> From its manual its how many us the cpu spend on a KB<br />
<mst> well the answer will be it depends :)<br />
<mst> also, we have packet loss<br />
<mst> I think at some level we only care about packets that were delivered<br />
<mst> so e.g. with UDP we only care about received messages<br />
<jasonwang> yes, the packet loss may have concerns with guest drivers<br />
<mst> with TCP if you look at messages, there's no loss<br />
<jasonwang> yes TCP have flow control itself<br />
<mst> ok so let's see what tools we have<br />
<mst> the simplest is flood ping<br />
<jasonwang> yes, it's very simple and easy to use<br />
<mst> it gives you control over message size, packets per second, gets you back latency<br />
<mst> it is always bidirectional I think<br />
<mst> and we need to measure CPU ourselves<br />
<mst> that last seems to be true anyway<br />
<jasonwang> yes, maybe easy to be understand and analysis than netperf<br />
<mst> packet loss when it occurs complicates things<br />
<mst> e.g. with 50% packet loss the real load is anywhere in between<br />
<jasonwang> yes<br />
<mst> that's the only problem: it's always bidirectional so tx/rx problems are hard to separate<br />
<jasonwang> yes, vhost is currently half-duplex<br />
<mst> I am also not sure it detect reordering<br />
<jasonwang> yes, it has sequence no.<br />
<jasonwang> but for ping, as you've said it's ICMP and was not the most of the cases<br />
<mst> ok, next we have netperf<br />
<mst> afaik it can do two things<br />
<mst> it can try sending as many packets as it can<br />
<jasonwang> yes<br />
<mst> or it can send a single one back and forth<br />
<mst> not a lot of data, but ok<br />
<jasonwang> yes<br />
<mst> and similar with UDP<br />
<mst> got to go have lunch<br />
<mst> So I will try and write all this up<br />
<mst> do you have any hardware for testing?<br />
<mst> if yes we'll add it too, I'll put up a wiki<br />
<mst> back in half an hour<br />
<jasonwang> yes, write all things up would help<br />
<jasonwang> go home now, please send me mail<br />
* jasonwang has quit (Quit: Leaving)<br />
<br />
* Loaded log from Wed Dec 15 15:07:24 2010</div>Msthttps://linux-kvm.org/index.php?title=NetworkingTodo&diff=3272NetworkingTodo2010-09-21T16:55:10Z<p>Mst: </p>
<hr />
<div>This page should cover all networking related activity in KVM,<br />
currently most info is related to virtio-net.<br />
<br />
Stabilization is highest priority currently.<br />
DOA test matrix (all combinations should work):<br />
vhost: test both on and off, obviously<br />
test: hotplug/unplug, vlan/mac filtering, netperf,<br />
file copy both ways: scp, NFS, NTFS<br />
guests: linux: release and debug kernels, windows<br />
conditions: plain run, run while under migration,<br />
vhost on/off migration<br />
networking setup: simple, qos with cgroups<br />
host configuration: host-guest, external-guest<br />
<br />
=== vhost-net driver projects ===<br />
* iovec length limitations<br />
Developer: Jason Wang <jasowang@redhat.com><br />
Testing: guest to host file transfer on windows.<br />
<br />
* mergeable buffers: fix host->guest BW regression<br />
Testing: netperf host to guest default flags<br />
<br />
* scalability tuning: threading for guest to guest<br />
Developer: MST<br />
Testing: netperf guest to guest<br />
<br />
=== qemu projects ===<br />
* fix hotplug issues<br />
Developer: MST<br />
https://bugzilla.redhat.com/show_bug.cgi?id=623735<br />
<br />
* migration with multiple macs/vlans<br />
qemu only sends ping with the first mac/no vlan:<br />
need to send it for all macs/vlan<br />
<br />
* bugfix: crash with illegal fd= value on command line<br />
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=581750<br />
<br />
=== virtio projects ===<br />
* suspend/resume support<br />
<br />
* API extension: improve small packet/large buffer performance:<br />
support "reposting" buffers for mergeable buffers,<br />
support pool for indirect buffers<br />
* ring redesign:<br />
find a way to test raw ring performance <br />
fix cacheline bounces <br />
reduce interrupts<br />
Developer: MST<br />
see patchset: virtio: put last seen used index into ring itself<br />
<br />
=== projects involing other kernel components and/or networking stack ===<br />
* guest programmable mac/vlan filtering with macvtap<br />
<br />
* bridge without promisc mode in NIC<br />
given hardware support, teach bridge<br />
to program mac/vlan filtering in NIC<br />
<br />
* rx mac filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
we have a small table of addresses, need to make it larger<br />
if we only need filtering for unicast (multicast is handled by IMP filtering)<br />
<br />
* vlan filtering in tun<br />
the need for this is still not understood as we have filtering in bridge<br />
for small # if vlans we can use BPF<br />
<br />
* vlan filtering in bridge<br />
IGMP snooping in bridge should take vlans into account<br />
<br />
* zero copy tx/rx for macvtap<br />
Developers: tx zero copy Shirley Ma; rx zero copy Xin Xiaohui<br />
<br />
* multiqueue (involves all of vhost, qemu, virtio, networking stack)<br />
Developer: Krishna Jumar<br />
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=632751<br />
<br />
* kvm MSI interrupt injection fast path<br />
Developer: MST<br />
<br />
* kvm eventfd support for injecting level interrupts<br />
<br />
* DMA emgine (IOAT) use in tun<br />
<br />
* allow handling short packets from softirq context<br />
Testing: netperf TCP STREAM guest to host<br />
netperf TCP RR<br />
<br />
* irq affinity:<br />
networking goes much faster with irq pinning:<br />
both with and without numa.<br />
what can be done to make the non-pinned setup go faster?<br />
<br />
=== testing projects ===<br />
* Cover test matrix with autotest<br />
* Test with windows drivers, pass WHQL<br />
<br />
=== non-virtio-net devices ===<br />
* e1000: stabilize<br />
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=602205<br />
<br />
=== bugzilla entries for bugs fixed ===<br />
* verify these are ok upstream<br />
https://bugzilla.redhat.com/show_bug.cgi?id=623552<br />
https://bugzilla.redhat.com/show_bug.cgi?id=632747<br />
https://bugzilla.redhat.com/show_bug.cgi?id=632745<br />
<br />
<br />
=== abandoned projects: ===<br />
* Add GSO/checksum offload support to AF_PACKET(raw) sockets.<br />
status: incomplete<br />
* guest kernel 2.6.31 seems to work well. Under certain workloads,<br />
virtio performance has regressed with guest kernels 2.6.32 and up<br />
(but still better than userspace). A patch has been posted:<br />
http://www.spinics.net/lists/netdev/msg115292.html<br />
status: might be fixed, need to test</div>Mst