Lukas Hejtmanek
2018-09-14 12:06:26 UTC
Hello,
I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance
8-NUMA configuration:
This is from hypervizor:
[***@hde10 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7351 16-Core Processor
Stepping: 2
CPU MHz: 1800.000
CPU max MHz: 2400.0000
CPU min MHz: 1200.0000
BogoMIPS: 4800.05
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-3,32-35
NUMA node1 CPU(s): 4-7,36-39
NUMA node2 CPU(s): 8-11,40-43
NUMA node3 CPU(s): 12-15,44-47
NUMA node4 CPU(s): 16-19,48-51
NUMA node5 CPU(s): 20-23,52-55
NUMA node6 CPU(s): 24-27,56-59
NUMA node7 CPU(s): 28-31,60-63
I'm running one big virtual on this hypervizor - almost whole memory + all
physical CPUs.
This is what I'm seeing inside:
***@zenon10:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 8
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7351 16-Core Processor
Stepping: 2
CPU MHz: 2400.000
BogoMIPS: 4800.00
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s): 4-7
NUMA node2 CPU(s): 8-11
NUMA node3 CPU(s): 12-15
NUMA node4 CPU(s): 16-19
NUMA node5 CPU(s): 20-23
NUMA node6 CPU(s): 24-27
NUMA node7 CPU(s): 28-31
This is virtual node configuration: (i tried different numatune settings but
it was still the same)
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
<name>one-55782</name>
<vcpu><![CDATA[32]]></vcpu>
<cputune>
<shares>32768</shares>
</cputune>
<memory>507904000</memory>
<os>
<type arch='x86_64'>hvm</type>
</os>
<devices>
<emulator><![CDATA[/usr/bin/kvm]]></emulator>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.0'/>
<target dev='vda'/>
<driver name='qemu' type='qcow2' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.1'/>
<target dev='vdc'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.2'/>
<target dev='vdd'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
<target dev='vde'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='cdrom'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.4'/>
<target dev='vdb'/>
<readonly/>
<driver name='qemu' type='raw'/>
</disk>
<interface type='bridge'>
<source bridge='br0'/>
<mac address='02:00:93:fb:3b:78'/>
<target dev='one-55782-0'/>
<model type='virtio'/>
<filterref filter='no-arp-mac-spoofing'>
<parameter name='IP' value='147.251.59.120'/>
</filterref>
</interface>
</devices>
<features>
<pae/>
<acpi/>
</features>
<!-- RAW data follows: -->
<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune><memory mode='preferred' nodeset='0'/></numatune>)
<devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices>
<devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>
<devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices>
<metadata>
<system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore>
</metadata>
</domain>
If I run e.g., spec2017 on the virtual, I can see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m
2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m
4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m
6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m
7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m
8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m
8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m
8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m
9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m
10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m
12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m
13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m
13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m
14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m
12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m
10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m
15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m
15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m
16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m
17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m
17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m
17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m
27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m
24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m
30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m
The CPU TIME should be roughly the same but huge differences are obvious.
This is what I see on the hypervizor:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm
369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1
368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0
i.e., kswapd is eating whole CPU. Swap is turned off.
[***@hde10 ~]# free
total used free shared buff/cache available
Mem: 528151432 503432580 1214048 34740 23504804 21907800
Swap: 0 0 0
Hypervisor is
[***@hde10 ~]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
qemu-kvm-1.5.3-156.el7_5.5.x86_64
Virtual is Debian 9.
Moreover, I'm using this type of disks for virtuals:
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
<target dev='vde'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
If I keep cache='unsafe' and if I run iozone test on really big files (e.g.,
8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are
running on 100 % percent and slowing things down. The disk under datastore is
NVME SSD Intel 4500.
If I set cache='none', kswaps are on idle, disk writes are pretty fast,
however, with 8-NUMA configuration, writes slow down to less than 10MB/s as
soon as the size of written data is roughly the same as memory size in the virtual
node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page
lists. If I do the same with 1-NUMA configuration, everything is ok except
performance penalty about 25 %.
I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance
8-NUMA configuration:
This is from hypervizor:
[***@hde10 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7351 16-Core Processor
Stepping: 2
CPU MHz: 1800.000
CPU max MHz: 2400.0000
CPU min MHz: 1200.0000
BogoMIPS: 4800.05
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-3,32-35
NUMA node1 CPU(s): 4-7,36-39
NUMA node2 CPU(s): 8-11,40-43
NUMA node3 CPU(s): 12-15,44-47
NUMA node4 CPU(s): 16-19,48-51
NUMA node5 CPU(s): 20-23,52-55
NUMA node6 CPU(s): 24-27,56-59
NUMA node7 CPU(s): 28-31,60-63
I'm running one big virtual on this hypervizor - almost whole memory + all
physical CPUs.
This is what I'm seeing inside:
***@zenon10:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 8
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7351 16-Core Processor
Stepping: 2
CPU MHz: 2400.000
BogoMIPS: 4800.00
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s): 4-7
NUMA node2 CPU(s): 8-11
NUMA node3 CPU(s): 12-15
NUMA node4 CPU(s): 16-19
NUMA node5 CPU(s): 20-23
NUMA node6 CPU(s): 24-27
NUMA node7 CPU(s): 28-31
This is virtual node configuration: (i tried different numatune settings but
it was still the same)
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
<name>one-55782</name>
<vcpu><![CDATA[32]]></vcpu>
<cputune>
<shares>32768</shares>
</cputune>
<memory>507904000</memory>
<os>
<type arch='x86_64'>hvm</type>
</os>
<devices>
<emulator><![CDATA[/usr/bin/kvm]]></emulator>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.0'/>
<target dev='vda'/>
<driver name='qemu' type='qcow2' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.1'/>
<target dev='vdc'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.2'/>
<target dev='vdd'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
<target dev='vde'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
<disk type='file' device='cdrom'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.4'/>
<target dev='vdb'/>
<readonly/>
<driver name='qemu' type='raw'/>
</disk>
<interface type='bridge'>
<source bridge='br0'/>
<mac address='02:00:93:fb:3b:78'/>
<target dev='one-55782-0'/>
<model type='virtio'/>
<filterref filter='no-arp-mac-spoofing'>
<parameter name='IP' value='147.251.59.120'/>
</filterref>
</interface>
</devices>
<features>
<pae/>
<acpi/>
</features>
<!-- RAW data follows: -->
<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune><memory mode='preferred' nodeset='0'/></numatune>)
<devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices>
<devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>
<devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices>
<metadata>
<system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore>
</metadata>
</domain>
If I run e.g., spec2017 on the virtual, I can see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m
2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m
4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m
6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m
7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m
8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m
8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m
8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m
9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m
10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m
12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m
13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m
13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m
14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m
12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m
10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m
15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m
15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m
16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m
17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m
17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m
17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m
27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m
24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m
30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m
The CPU TIME should be roughly the same but huge differences are obvious.
This is what I see on the hypervizor:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm
369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1
368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0
i.e., kswapd is eating whole CPU. Swap is turned off.
[***@hde10 ~]# free
total used free shared buff/cache available
Mem: 528151432 503432580 1214048 34740 23504804 21907800
Swap: 0 0 0
Hypervisor is
[***@hde10 ~]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
qemu-kvm-1.5.3-156.el7_5.5.x86_64
Virtual is Debian 9.
Moreover, I'm using this type of disks for virtuals:
<disk type='file' device='disk'>
<source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
<target dev='vde'/>
<driver name='qemu' type='raw' cache='unsafe'/>
</disk>
If I keep cache='unsafe' and if I run iozone test on really big files (e.g.,
8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are
running on 100 % percent and slowing things down. The disk under datastore is
NVME SSD Intel 4500.
If I set cache='none', kswaps are on idle, disk writes are pretty fast,
however, with 8-NUMA configuration, writes slow down to less than 10MB/s as
soon as the size of written data is roughly the same as memory size in the virtual
node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page
lists. If I do the same with 1-NUMA configuration, everything is ok except
performance penalty about 25 %.
--
Lukáš Hejtmánek
Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title
Lukáš Hejtmánek
Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title