[libvirt-users] NUMA issues on virtualized hosts

Discussion:

Lukas Hejtmanek

2018-09-14 12:06:26 UTC

Hello,

I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance
8-NUMA configuration:

This is from hypervizor:
[***@hde10 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7351 16-Core Processor
Stepping: 2
CPU MHz: 1800.000
CPU max MHz: 2400.0000
CPU min MHz: 1200.0000
BogoMIPS: 4800.05
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-3,32-35
NUMA node1 CPU(s): 4-7,36-39
NUMA node2 CPU(s): 8-11,40-43
NUMA node3 CPU(s): 12-15,44-47
NUMA node4 CPU(s): 16-19,48-51
NUMA node5 CPU(s): 20-23,52-55
NUMA node6 CPU(s): 24-27,56-59
NUMA node7 CPU(s): 28-31,60-63

I'm running one big virtual on this hypervizor - almost whole memory + all
physical CPUs.

This is what I'm seeing inside:

***@zenon10:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 8
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7351 16-Core Processor
Stepping: 2
CPU MHz: 2400.000
BogoMIPS: 4800.00
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s): 4-7
NUMA node2 CPU(s): 8-11
NUMA node3 CPU(s): 12-15
NUMA node4 CPU(s): 16-19
NUMA node5 CPU(s): 20-23
NUMA node6 CPU(s): 24-27
NUMA node7 CPU(s): 28-31

This is virtual node configuration: (i tried different numatune settings but
it was still the same)

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
<name>one-55782</name>
<vcpu><![CDATA[32]]></vcpu>
<cputune>
<shares>32768</shares>
</cputune>
<memory>507904000</memory>
<os>
<type arch='x86_64'>hvm</type>
</os>
<devices>
<emulator><![CDATA[/usr/bin/kvm]]></emulator>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.0'/>
<target dev='vda'/>
<driver name='qemu' type='qcow2' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.1'/>
<target dev='vdc'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.2'/>
<target dev='vdd'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
<target dev='vde'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='cdrom'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.4'/>
<target dev='vdb'/>
<readonly/>
<driver name='qemu' type='raw'/>
</disk>
<interface type='bridge'>
<source bridge='br0'/>
<mac address='02:00:93:fb:3b:78'/>
<target dev='one-55782-0'/>
<model type='virtio'/>
<filterref filter='no-arp-mac-spoofing'>
<parameter name='IP' value='147.251.59.120'/>
</filterref>
</interface>
</devices>
<features>
<pae/>
<acpi/>
</features>

<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune><memory mode='preferred' nodeset='0'/></numatune>)
<devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices>
<devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>

<devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices>
<metadata>
<system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore>
</metadata>
</domain>

If I run e.g., spec2017 on the virtual, I can see:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m
2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m
4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m
6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m
7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m
8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m
8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m
8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m
9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m
10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m
12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m
13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m
13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m
14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m
12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m
10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m
15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m
15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m
16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m
17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m
17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m
17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m
27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m
24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m
30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m

The CPU TIME should be roughly the same but huge differences are obvious.

This is what I see on the hypervizor:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm
369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1
368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0

i.e., kswapd is eating whole CPU. Swap is turned off.

[***@hde10 ~]# free
total used free shared buff/cache available
Mem: 528151432 503432580 1214048 34740 23504804 21907800
Swap: 0 0 0

Hypervisor is
[***@hde10 ~]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)

qemu-kvm-1.5.3-156.el7_5.5.x86_64

Virtual is Debian 9.

Moreover, I'm using this type of disks for virtuals:
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
<target dev='vde'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>

If I keep cache='unsafe' and if I run iozone test on really big files (e.g.,
8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are
running on 100 % percent and slowing things down. The disk under datastore is
NVME SSD Intel 4500.

If I set cache='none', kswaps are on idle, disk writes are pretty fast,
however, with 8-NUMA configuration, writes slow down to less than 10MB/s as
soon as the size of written data is roughly the same as memory size in the virtual
node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page
lists. If I do the same with 1-NUMA configuration, everything is ok except
performance penalty about 25 %.

--
Lukáš Hejtmánek

Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

Lukas Hejtmanek

2018-09-14 13:36:59 UTC

Permalink

Hello,

ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue
with iozone remains the same.

The spec is running, however, it runs slower than 1-NUMA case.

The corrected XML looks like follows:
<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune><memory mode='strict' nodeset='0-7'/></numatune>

In this case, the first part took more than 1700 seconds. 1-NUMA config
finishes in 1646 seconds.

Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with
8-NUMA config finishes in 900 seconds.

Post by Lukas Hejtmanek
Hello,
I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7351 16-Core Processor
Stepping: 2
CPU MHz: 1800.000
CPU max MHz: 2400.0000
CPU min MHz: 1200.0000
BogoMIPS: 4800.05
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-3,32-35
NUMA node1 CPU(s): 4-7,36-39
NUMA node2 CPU(s): 8-11,40-43
NUMA node3 CPU(s): 12-15,44-47
NUMA node4 CPU(s): 16-19,48-51
NUMA node5 CPU(s): 20-23,52-55
NUMA node6 CPU(s): 24-27,56-59
NUMA node7 CPU(s): 28-31,60-63
I'm running one big virtual on this hypervizor - almost whole memory + all
physical CPUs.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 8
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7351 16-Core Processor
Stepping: 2
CPU MHz: 2400.000
BogoMIPS: 4800.00
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s): 4-7
NUMA node2 CPU(s): 8-11
NUMA node3 CPU(s): 12-15
NUMA node4 CPU(s): 16-19
NUMA node5 CPU(s): 20-23
NUMA node6 CPU(s): 24-27
NUMA node7 CPU(s): 28-31
This is virtual node configuration: (i tried different numatune settings but
it was still the same)
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
<name>one-55782</name>
<vcpu><![CDATA[32]]></vcpu>
<cputune>
<shares>32768</shares>
</cputune>
<memory>507904000</memory>
<os>
<type arch='x86_64'>hvm</type>
</os>
<devices>
<emulator><![CDATA[/usr/bin/kvm]]></emulator>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.0'/>
<target dev='vda'/>
<driver name='qemu' type='qcow2' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.1'/>
<target dev='vdc'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.2'/>
<target dev='vdd'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
<target dev='vde'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='cdrom'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.4'/>
<target dev='vdb'/>
<readonly/>
<driver name='qemu' type='raw'/>
</disk>
<interface type='bridge'>
<source bridge='br0'/>
<mac address='02:00:93:fb:3b:78'/>
<target dev='one-55782-0'/>
<model type='virtio'/>
<filterref filter='no-arp-mac-spoofing'>
<parameter name='IP' value='147.251.59.120'/>
</filterref>
</interface>
</devices>
<features>
<pae/>
<acpi/>
</features>

<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune><memory mode='preferred' nodeset='0'/></numatune>)
<devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices>
<devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>
<devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices>
<metadata>
<system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore>
</metadata>
</domain>
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m
2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m
4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m
6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m
7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m
8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m
8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m
8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m
9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m
10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m
12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m
13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m
13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m
14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m
12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m
10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m
15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m
15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m
16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m
17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m
17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m
17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m
27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m
24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m
30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m
The CPU TIME should be roughly the same but huge differences are obvious.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm
369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1
368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0
i.e., kswapd is eating whole CPU. Swap is turned off.
total used free shared buff/cache available
Mem: 528151432 503432580 1214048 34740 23504804 21907800
Swap: 0 0 0
Hypervisor is
CentOS Linux release 7.5.1804 (Core)
qemu-kvm-1.5.3-156.el7_5.5.x86_64
Virtual is Debian 9.
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
<target dev='vde'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
If I keep cache='unsafe' and if I run iozone test on really big files (e.g.,
8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are
running on 100 % percent and slowing things down. The disk under datastore is
NVME SSD Intel 4500.
If I set cache='none', kswaps are on idle, disk writes are pretty fast,
however, with 8-NUMA configuration, writes slow down to less than 10MB/s as
soon as the size of written data is roughly the same as memory size in the virtual
node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page
lists. If I do the same with 1-NUMA configuration, everything is ok except
performance penalty about 25 %.
--
Lukáš Hejtmánek
Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

--
Lukáš Hejtmánek

Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

Lukas Hejtmanek

2018-09-14 13:40:56 UTC

Permalink

Hello again,

when the iozone writes slow. This is how slabtop looks like:
62476752 62476728 0% 0.10K 1601968 39 6407872K buffer_head
1000678 999168 0% 0.56K 142954 7 571816K radix_tree_node
132184 125911 0% 0.03K 1066 124 4264K kmalloc-32
118496 118224 0% 0.12K 3703 32 14812K kmalloc-node
73206 56467 0% 0.19K 3486 21 13944K dentry
34816 33247 0% 0.12K 1024 34 4096K kernfs_node_cache
34496 29031 0% 0.06K 539 64 2156K kmalloc-64
23283 22707 0% 1.05K 7761 3 31044K ext4_inode_cache
16940 16052 0% 0.57K 2420 7 9680K inode_cache
14464 4124 0% 0.06K 226 64 904K anon_vma_chain
11900 11841 0% 0.14K 425 28 1700K ext4_groupinfo_4k
11312 9861 0% 0.50K 1414 8 5656K kmalloc-512
10692 10066 0% 0.04K 108 99 432K ext4_extent_status
10688 4238 0% 0.25K 668 16 2672K kmalloc-256
8120 2420 0% 0.07K 145 56 580K anon_vma
8040 4563 0% 0.20K 402 20 1608K vm_area_struct
7488 3845 0% 0.12K 234 32 936K kmalloc-96
7456 7061 0% 1.00K 1864 4 7456K kmalloc-1024
7234 7227 0% 4.00K 7234 1 28936K kmalloc-4096

and this is /proc/$PID/stack of iozone eating CPU but not writing data.

[<ffffffffba78151b>] find_get_entry+0x1b/0x100
[<ffffffffba781de0>] pagecache_get_page+0x30/0x2a0
[<ffffffffc06ec12b>] ext4_da_get_block_prep+0x27b/0x440 [ext4]
[<ffffffffba840d8b>] __find_get_block_slow+0x3b/0x150
[<ffffffffba840ebd>] unmap_underlying_metadata+0x1d/0x70
[<ffffffffc06ec960>] ext4_block_write_begin+0x2e0/0x520 [ext4]
[<ffffffffc06ebeb0>] ext4_inode_attach_jinode.part.72+0xa0/0xa0 [ext4]
[<ffffffffc041f9f9>] jbd2__journal_start+0xd9/0x1e0 [jbd2]
[<ffffffffba80511a>] __check_object_size+0xfa/0x1d8
[<ffffffffba946b85>] iov_iter_copy_from_user_atomic+0xa5/0x330
[<ffffffffba780dcb>] generic_perform_write+0xfb/0x1d0
[<ffffffffba7831ca>] __generic_file_write_iter+0x16a/0x1b0
[<ffffffffc06e7220>] ext4_file_write_iter+0x90/0x370 [ext4]
[<ffffffffc06e7190>] ext4_dax_fault+0x140/0x140 [ext4]
[<ffffffffba6aef01>] update_curr+0xe1/0x160
[<ffffffffba808890>] new_sync_write+0xe0/0x130
[<ffffffffba809010>] vfs_write+0xb0/0x190
[<ffffffffba80a452>] SyS_write+0x52/0xc0
[<ffffffffba603b7d>] do_syscall_64+0x8d/0xf0
[<ffffffffbac15c4e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
[<ffffffffffffffff>] 0xffffffffffffffff

Post by Lukas Hejtmanek
Hello,
ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue
with iozone remains the same.
The spec is running, however, it runs slower than 1-NUMA case.
<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune><memory mode='strict' nodeset='0-7'/></numatune>
In this case, the first part took more than 1700 seconds. 1-NUMA config
finishes in 1646 seconds.
Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with
8-NUMA config finishes in 900 seconds.

--
Lukáš Hejtmánek
Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

--
Lukáš Hejtmánek

Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

Lukas Hejtmanek

2018-09-17 07:02:14 UTC

Permalink

Hello,

I did some performance measurements with SpecCPU 2017 in variant fp rate
(i.e., utilize all cpu cores). It looks like this:

8-NUMA Hypervizor specfp2017 - 124
1-NUMA Hypervizor specfp2017 - 103
2-NUMA Hypervizor specfp2017 - 120

8-NUMA Virtual (on 8N Hypervizor) specfp2017 - 92
1-NUMA Virtual (on 1N Hypervizor) specfp2017 - 95.2
2-NUMA Virtual (on 2N Hypervizor) specfp2017 - 98 (memory strict)
2-NUMA Virtual (on 2N Hypervizor) specfp2017 - 98.1 (memory interleave)
2x 1-NUMA Virtual (on 2N Hypervizor) specfp2017 - 117.2 (sum for both)

Post by Lukas Hejtmanek
Hello again,
62476752 62476728 0% 0.10K 1601968 39 6407872K buffer_head
1000678 999168 0% 0.56K 142954 7 571816K radix_tree_node
132184 125911 0% 0.03K 1066 124 4264K kmalloc-32
118496 118224 0% 0.12K 3703 32 14812K kmalloc-node
73206 56467 0% 0.19K 3486 21 13944K dentry
34816 33247 0% 0.12K 1024 34 4096K kernfs_node_cache
34496 29031 0% 0.06K 539 64 2156K kmalloc-64
23283 22707 0% 1.05K 7761 3 31044K ext4_inode_cache
16940 16052 0% 0.57K 2420 7 9680K inode_cache
14464 4124 0% 0.06K 226 64 904K anon_vma_chain
11900 11841 0% 0.14K 425 28 1700K ext4_groupinfo_4k
11312 9861 0% 0.50K 1414 8 5656K kmalloc-512
10692 10066 0% 0.04K 108 99 432K ext4_extent_status
10688 4238 0% 0.25K 668 16 2672K kmalloc-256
8120 2420 0% 0.07K 145 56 580K anon_vma
8040 4563 0% 0.20K 402 20 1608K vm_area_struct
7488 3845 0% 0.12K 234 32 936K kmalloc-96
7456 7061 0% 1.00K 1864 4 7456K kmalloc-1024
7234 7227 0% 4.00K 7234 1 28936K kmalloc-4096
and this is /proc/$PID/stack of iozone eating CPU but not writing data.
[<ffffffffba78151b>] find_get_entry+0x1b/0x100
[<ffffffffba781de0>] pagecache_get_page+0x30/0x2a0
[<ffffffffc06ec12b>] ext4_da_get_block_prep+0x27b/0x440 [ext4]
[<ffffffffba840d8b>] __find_get_block_slow+0x3b/0x150
[<ffffffffba840ebd>] unmap_underlying_metadata+0x1d/0x70
[<ffffffffc06ec960>] ext4_block_write_begin+0x2e0/0x520 [ext4]
[<ffffffffc06ebeb0>] ext4_inode_attach_jinode.part.72+0xa0/0xa0 [ext4]
[<ffffffffc041f9f9>] jbd2__journal_start+0xd9/0x1e0 [jbd2]
[<ffffffffba80511a>] __check_object_size+0xfa/0x1d8
[<ffffffffba946b85>] iov_iter_copy_from_user_atomic+0xa5/0x330
[<ffffffffba780dcb>] generic_perform_write+0xfb/0x1d0
[<ffffffffba7831ca>] __generic_file_write_iter+0x16a/0x1b0
[<ffffffffc06e7220>] ext4_file_write_iter+0x90/0x370 [ext4]
[<ffffffffc06e7190>] ext4_dax_fault+0x140/0x140 [ext4]
[<ffffffffba6aef01>] update_curr+0xe1/0x160
[<ffffffffba808890>] new_sync_write+0xe0/0x130
[<ffffffffba809010>] vfs_write+0xb0/0x190
[<ffffffffba80a452>] SyS_write+0x52/0xc0
[<ffffffffba603b7d>] do_syscall_64+0x8d/0xf0
[<ffffffffbac15c4e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
[<ffffffffffffffff>] 0xffffffffffffffff

--
Lukáš Hejtmánek
Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

--
Lukáš Hejtmánek

Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

Michal Privoznik

2018-09-17 13:08:34 UTC

Permalink

[Reformated XML for better reading]

<cpu mode="host-passthrough">
<topology sockets="8" cores="4" threads="1"/>
<numa>
<cell cpus="0-3" memory="62000000"/>
<cell cpus="4-7" memory="62000000"/>
<cell cpus="8-11" memory="62000000"/>
<cell cpus="12-15" memory="62000000"/>
<cell cpus="16-19" memory="62000000"/>
<cell cpus="20-23" memory="62000000"/>
<cell cpus="24-27" memory="62000000"/>
<cell cpus="28-31" memory="62000000"/>
</numa>
</cpu>
<cputune>
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="2"/>
<vcpupin vcpu="3" cpuset="3"/>
<vcpupin vcpu="4" cpuset="4"/>
<vcpupin vcpu="5" cpuset="5"/>
<vcpupin vcpu="6" cpuset="6"/>
<vcpupin vcpu="7" cpuset="7"/>
<vcpupin vcpu="8" cpuset="8"/>
<vcpupin vcpu="9" cpuset="9"/>
<vcpupin vcpu="10" cpuset="10"/>
<vcpupin vcpu="11" cpuset="11"/>
<vcpupin vcpu="12" cpuset="12"/>
<vcpupin vcpu="13" cpuset="13"/>
<vcpupin vcpu="14" cpuset="14"/>
<vcpupin vcpu="15" cpuset="15"/>
<vcpupin vcpu="16" cpuset="16"/>
<vcpupin vcpu="17" cpuset="17"/>
<vcpupin vcpu="18" cpuset="18"/>
<vcpupin vcpu="19" cpuset="19"/>
<vcpupin vcpu="20" cpuset="20"/>
<vcpupin vcpu="21" cpuset="21"/>
<vcpupin vcpu="22" cpuset="22"/>
<vcpupin vcpu="23" cpuset="23"/>
<vcpupin vcpu="24" cpuset="24"/>
<vcpupin vcpu="25" cpuset="25"/>
<vcpupin vcpu="26" cpuset="26"/>
<vcpupin vcpu="27" cpuset="27"/>
<vcpupin vcpu="28" cpuset="28"/>
<vcpupin vcpu="29" cpuset="29"/>
<vcpupin vcpu="30" cpuset="30"/>
<vcpupin vcpu="31" cpuset="31"/>
</cputune>
<numatune>
<memory mode="strict" nodeset="0-7"/>
</numatune>

However, this is not enough. This XML pins only vCPUs and not guest
memory. So while say vCPU #0 is pinned onto physical CPU #0, the memory
for guest NUMA #0 might be allocated at host NUMA #7 (for instance). You
need to add:

<numatune>
<memnode cellid="0" mode="strict" nodeset="0"/>
<memnode cellid="1" mode="strict" nodeset="1"/>
...
</numatune>

This will ensure also the guest memory pinning. But wait, there is more.
In your later e-mails you mention slow disk I/O. This might be caused by
various variables but the most obvious one in this case is qemu I/O
loop, I'd say. Without iothreads qemu has only one I/O loop and thus if
your guest issues writes from all 32 cores at once this loop is unable
to handle it (performance wise) and therefore the performance drop. You
can try enabling iothreads:

https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation

This is a qemu feature that allows you to create more I/O threads and
also pin them. This is an example how to use them:

https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/iothreads-disk.xml;h=0aa32c392300c0a86ad26185292ebc7a0d85d588;hb=HEAD

And this is an example how to pin them:

https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/cputune-iothreads.xml;h=311a1d3604177d9699edf7132a75f387aa57ad6f;hb=HEAD

Also, since iothreads are capable of handling just any I/O they can be
used for other devices too, not only disks. For instance interfaces.

Hopefully, this will boost your performance.

Regards,
Michal (who is a bit envious about your machine :-P)

Lukas Hejtmanek

2018-09-17 14:59:39 UTC

Permalink

Hello,

so the current domain configuration:
<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune>
<memnode cellid="0" mode="strict" nodeset="0"/>
<memnode cellid="1" mode="strict" nodeset="1"/>
<memnode cellid="2" mode="strict" nodeset="2"/>
<memnode cellid="3" mode="strict" nodeset="3"/>
<memnode cellid="4" mode="strict" nodeset="4"/>
<memnode cellid="5" mode="strict" nodeset="5"/>
<memnode cellid="6" mode="strict" nodeset="6"/>
<memnode cellid="7" mode="strict" nodeset="7"/>
</numatune>

hopefully, I got it right.

Good news is, that spec benchmark looks promising. The first test bwaves
finished in 1003 seconds compared to 1700 seconds in the previous wrong case.
So far so good.

Bad news is, that iozone is still the same. There might be some
misunderstanding.

I have to cases:

1) cache=unsafe. In this case, I can see that hypervizor is prone to swap.
Swap a lot. It usually eats whole swap partition and kswapd is running on 100%
CPU. swappines, dirty_ration and company do not improve things at all.
However, I believe, this is just wrong option for scratch disks where one can
expect huge I/O load. Moreover, the hypevizor is poor machine with only low
memory left (ok, in my case about 10GB available), so it does not make sense
to use that memory for additional cache/disk buffers.

2) cache=none. In this case, performance is better (only few percent behind
baremetal). However, as soon as the size of stored data is about the size of
memory of the virtual, writes stops and iozone is eating whole CPU, it looks like
it is searching more free pages and it is harder and harder. But not sure,
I am not skilled in this area.

here, you can clearly see, that it starts writes, doing the writes, then it
takes a pause, writes again, and so on, but the pauses are longer and longer..
https://pastebin.com/2gfPFgb9
The output is until the very end of iozone (I cancelled it by ctrl-c).

It seems that this is not happening on 2-NUMA node with rotational disks only.
It is partly happening on 2-NUMA node with 2 NVME SSDs. The partly means, that
there are also pauses in writes but it finishes, speed is reduced though. On
1-NUMA node, with the same test, I can see steady writes from the very
beginning to the very end at roughly the same speed.

Maybe it could be related to the fact, that NVME is PCI device that is linked
to one NUMA node only?

As of iothreads, I have only 1 disk (the vde) that is exposed to high i/o
load, so I believe more I/O threads is not applicable here. If I understand
correctly, I cannot set more iothreads to a single device.. And it does not
seem to be iothreads linked as the same scenario in 1-NUMA configuration works
OK (I mean that memory penalties can be huge as it does not reflect real NUMA
topology, but disk speed it ok anyway.)

And as of that machine, what about this one? :)

[***@urga1 ~]$ free -g
total used free shared buff/cache available
Mem: 5857 75 5746 0 35 5768
...
NUMA node47 CPU(s): 376-383

this it not virtualized though :)

Post by Michal Privoznik

[Reformated XML for better reading]
<cpu mode="host-passthrough">
<topology sockets="8" cores="4" threads="1"/>
<numa>
<cell cpus="0-3" memory="62000000"/>
<cell cpus="4-7" memory="62000000"/>
<cell cpus="8-11" memory="62000000"/>
<cell cpus="12-15" memory="62000000"/>
<cell cpus="16-19" memory="62000000"/>
<cell cpus="20-23" memory="62000000"/>
<cell cpus="24-27" memory="62000000"/>
<cell cpus="28-31" memory="62000000"/>
</numa>
</cpu>
<cputune>
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="2"/>
<vcpupin vcpu="3" cpuset="3"/>
<vcpupin vcpu="4" cpuset="4"/>
<vcpupin vcpu="5" cpuset="5"/>
<vcpupin vcpu="6" cpuset="6"/>
<vcpupin vcpu="7" cpuset="7"/>
<vcpupin vcpu="8" cpuset="8"/>
<vcpupin vcpu="9" cpuset="9"/>
<vcpupin vcpu="10" cpuset="10"/>
<vcpupin vcpu="11" cpuset="11"/>
<vcpupin vcpu="12" cpuset="12"/>
<vcpupin vcpu="13" cpuset="13"/>
<vcpupin vcpu="14" cpuset="14"/>
<vcpupin vcpu="15" cpuset="15"/>
<vcpupin vcpu="16" cpuset="16"/>
<vcpupin vcpu="17" cpuset="17"/>
<vcpupin vcpu="18" cpuset="18"/>
<vcpupin vcpu="19" cpuset="19"/>
<vcpupin vcpu="20" cpuset="20"/>
<vcpupin vcpu="21" cpuset="21"/>
<vcpupin vcpu="22" cpuset="22"/>
<vcpupin vcpu="23" cpuset="23"/>
<vcpupin vcpu="24" cpuset="24"/>
<vcpupin vcpu="25" cpuset="25"/>
<vcpupin vcpu="26" cpuset="26"/>
<vcpupin vcpu="27" cpuset="27"/>
<vcpupin vcpu="28" cpuset="28"/>
<vcpupin vcpu="29" cpuset="29"/>
<vcpupin vcpu="30" cpuset="30"/>
<vcpupin vcpu="31" cpuset="31"/>
</cputune>
<numatune>
<memory mode="strict" nodeset="0-7"/>
</numatune>
However, this is not enough. This XML pins only vCPUs and not guest
memory. So while say vCPU #0 is pinned onto physical CPU #0, the memory
for guest NUMA #0 might be allocated at host NUMA #7 (for instance). You
<numatune>
<memnode cellid="0" mode="strict" nodeset="0"/>
<memnode cellid="1" mode="strict" nodeset="1"/>
...
</numatune>
This will ensure also the guest memory pinning. But wait, there is more.
In your later e-mails you mention slow disk I/O. This might be caused by
various variables but the most obvious one in this case is qemu I/O
loop, I'd say. Without iothreads qemu has only one I/O loop and thus if
your guest issues writes from all 32 cores at once this loop is unable
to handle it (performance wise) and therefore the performance drop. You
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
This is a qemu feature that allows you to create more I/O threads and
https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/iothreads-disk.xml;h=0aa32c392300c0a86ad26185292ebc7a0d85d588;hb=HEAD
https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/cputune-iothreads.xml;h=311a1d3604177d9699edf7132a75f387aa57ad6f;hb=HEAD
Also, since iothreads are capable of handling just any I/O they can be
used for other devices too, not only disks. For instance interfaces.
Hopefully, this will boost your performance.
Regards,
Michal (who is a bit envious about your machine :-P)

--
Lukáš Hejtmánek

Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

Michal Privoznik

2018-09-18 07:50:44 UTC

Permalink

Post by Lukas Hejtmanek
Hello,
<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30

' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune>

Post by Lukas Hejtmanek
<numatune>
<memnode cellid="0" mode="strict" nodeset="0"/>
<memnode cellid="1" mode="strict" nodeset="1"/>
<memnode cellid="2" mode="strict" nodeset="2"/>
<memnode cellid="3" mode="strict" nodeset="3"/>
<memnode cellid="4" mode="strict" nodeset="4"/>
<memnode cellid="5" mode="strict" nodeset="5"/>
<memnode cellid="6" mode="strict" nodeset="6"/>
<memnode cellid="7" mode="strict" nodeset="7"/>
</numatune>
hopefully, I got it right.

Yes, looking good.

Post by Lukas Hejtmanek
Good news is, that spec benchmark looks promising. The first test bwaves
finished in 1003 seconds compared to 1700 seconds in the previous wrong case.
So far so good.

Very well, this means that the config above is correct.

Post by Lukas Hejtmanek
Bad news is, that iozone is still the same. There might be some
misunderstanding.
1) cache=unsafe. In this case, I can see that hypervizor is prone to swap.
Swap a lot. It usually eats whole swap partition and kswapd is running on 100%
CPU. swappines, dirty_ration and company do not improve things at all.
However, I believe, this is just wrong option for scratch disks where one can
expect huge I/O load. Moreover, the hypevizor is poor machine with only low
memory left (ok, in my case about 10GB available), so it does not make sense
to use that memory for additional cache/disk buffers.

One thing that just occurred to me - is the qcow2 file fully allocated?

# qemu-img info /var/lib/libvirt/images/fedora.qcow2
..
virtual size: 20G (21474836480 bytes)
disk size: 7.0G
..

This is NOT a fully allocated qcow2.

Post by Lukas Hejtmanek
2) cache=none. In this case, performance is better (only few percent behind
baremetal). However, as soon as the size of stored data is about the size of
memory of the virtual, writes stops and iozone is eating whole CPU, it looks like
it is searching more free pages and it is harder and harder. But not sure,
I am not skilled in this area.

Hmm. Could it be that SSD doesn't have enough free blocks and thus
writes are throttled? Can you fstrim it and see if that helps?

Post by Lukas Hejtmanek
here, you can clearly see, that it starts writes, doing the writes, then it
takes a pause, writes again, and so on, but the pauses are longer and longer..
https://pastebin.com/2gfPFgb9
The output is until the very end of iozone (I cancelled it by ctrl-c).
It seems that this is not happening on 2-NUMA node with rotational disks only.
It is partly happening on 2-NUMA node with 2 NVME SSDs. The partly means, that
there are also pauses in writes but it finishes, speed is reduced though. On
1-NUMA node, with the same test, I can see steady writes from the very
beginning to the very end at roughly the same speed.
Maybe it could be related to the fact, that NVME is PCI device that is linked
to one NUMA node only?

Can be. I don't know qemu internals that much to know if its capable of
doing zero copy disk writes.

Post by Lukas Hejtmanek
As of iothreads, I have only 1 disk (the vde) that is exposed to high i/o
load, so I believe more I/O threads is not applicable here. If I understand
correctly, I cannot set more iothreads to a single device.. And it does not
seem to be iothreads linked as the same scenario in 1-NUMA configuration works
OK (I mean that memory penalties can be huge as it does not reflect real NUMA
topology, but disk speed it ok anyway.)

Ah, since it's only one disk then iothreads will not help much here.
Still worth giving it a shot ;-) Remember, iothreads are for all I/O,
not disk I/O only.

Anyway, this is the point where I have to say "I don't know". Sorry. Try
contacting qemu guys:

qemu-***@nongnu.org
qemu-***@nongnu.org

Michal

Lukas Hejtmanek

2018-09-20 14:28:01 UTC

Permalink

Hello,

so final working solution for 8-NUMA node configuration is:

<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa>
<cell id='0' cpus='0-3' memory='62000000'>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='16'/>
<sibling id='2' value='16'/>
<sibling id='3' value='16'/>
<sibling id='4' value='32'/>
<sibling id='5' value='32'/>
<sibling id='6' value='32'/>
<sibling id='7' value='32'/>
</distances>
</cell>
<cell id='1' cpus='4-7' memory='62000000'>
<distances>
<sibling id='0' value='16'/>
<sibling id='1' value='10'/>
<sibling id='2' value='16'/>
<sibling id='3' value='16'/>
<sibling id='4' value='32'/>
<sibling id='5' value='32'/>
<sibling id='6' value='32'/>
<sibling id='7' value='32'/>
</distances>
</cell>
<cell id='2' cpus='8-11' memory='62000000'>
<distances>
<sibling id='0' value='16'/>
<sibling id='1' value='16'/>
<sibling id='2' value='10'/>
<sibling id='3' value='16'/>
<sibling id='4' value='32'/>
<sibling id='5' value='32'/>
<sibling id='6' value='32'/>
<sibling id='7' value='32'/>
</distances>
</cell>
<cell id='3' cpus='12-15' memory='62000000'>
<distances>
<sibling id='0' value='16'/>
<sibling id='1' value='16'/>
<sibling id='2' value='16'/>
<sibling id='3' value='10'/>
<sibling id='4' value='32'/>
<sibling id='5' value='32'/>
<sibling id='6' value='32'/>
<sibling id='7' value='32'/>
</distances>
</cell>
<cell id='4' cpus='16-19' memory='62000000'>
<distances>
<sibling id='0' value='32'/>
<sibling id='1' value='32'/>
<sibling id='2' value='32'/>
<sibling id='3' value='32'/>
<sibling id='4' value='10'/>
<sibling id='5' value='16'/>
<sibling id='6' value='16'/>
<sibling id='7' value='16'/>
</distances>
</cell>
<cell id='5' cpus='20-23' memory='62000000'>
<distances>
<sibling id='0' value='32'/>
<sibling id='1' value='32'/>
<sibling id='2' value='32'/>
<sibling id='3' value='32'/>
<sibling id='4' value='16'/>
<sibling id='5' value='10'/>
<sibling id='6' value='16'/>
<sibling id='7' value='16'/>
</distances>
</cell>
<cell id='6' cpus='24-27' memory='62000000'>
<distances>
<sibling id='0' value='32'/>
<sibling id='1' value='32'/>
<sibling id='2' value='32'/>
<sibling id='3' value='32'/>
<sibling id='4' value='16'/>
<sibling id='5' value='16'/>
<sibling id='6' value='10'/>
<sibling id='7' value='16'/>
</distances>
</cell>
<cell id='7' cpus='28-31' memory='62000000'>
<distances>
<sibling id='0' value='32'/>
<sibling id='1' value='32'/>
<sibling id='2' value='32'/>
<sibling id='3' value='32'/>
<sibling id='4' value='16'/>
<sibling id='5' value='16'/>
<sibling id='6' value='16'/>
<sibling id='7' value='10'/>
</distances>
</cell>
</numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune>
<memnode cellid='0' mode='strict' nodeset='0'/>
<memnode cellid='1' mode='strict' nodeset='1'/>
<memnode cellid='2' mode='strict' nodeset='2'/>
<memnode cellid='3' mode='strict' nodeset='3'/>
<memnode cellid='4' mode='strict' nodeset='4'/>
<memnode cellid='5' mode='strict' nodeset='5'/>
<memnode cellid='6' mode='strict' nodeset='6'/>
<memnode cellid='7' mode='strict' nodeset='7'/>
</numatune>

With this configuration, virtualized Debian 9 even slightly outperforms the same Debian
9 on the bare metal.

As of iozone and cache none case. It seems that the problem is with KVM which
stops the iozone running for the first time when not all memory pages are
puppulated or something. Number of pages is not small with 512GB machine.
However, letting kvm to poppulate pages and running the iozone again, does not
bring any performance loss.

Post by Michal Privoznik

Post by Lukas Hejtmanek
Hello,

Yes, looking good.

Post by Lukas Hejtmanek
Good news is, that spec benchmark looks promising. The first test bwaves
finished in 1003 seconds compared to 1700 seconds in the previous wrong case.
So far so good.

Very well, this means that the config above is correct.

One thing that just occurred to me - is the qcow2 file fully allocated?
# qemu-img info /var/lib/libvirt/images/fedora.qcow2
..
virtual size: 20G (21474836480 bytes)
disk size: 7.0G
..
This is NOT a fully allocated qcow2.

Hmm. Could it be that SSD doesn't have enough free blocks and thus
writes are throttled? Can you fstrim it and see if that helps?

Can be. I don't know qemu internals that much to know if its capable of
doing zero copy disk writes.

Ah, since it's only one disk then iothreads will not help much here.
Still worth giving it a shot ;-) Remember, iothreads are for all I/O,
not disk I/O only.
Anyway, this is the point where I have to say "I don't know". Sorry. Try
Michal

--
Lukáš Hejtmánek

Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title