Odroid XU4(Cortex A15) vs HiKey 960(A73) speed

xlazom00 · June 23, 2017, 10:48am

I mounted big heat sink with fan. I also check cpu freq and it is stable.
Intel cores insert NOP instruction in case of overheating. Is it same on ARM?
How can I check if ARM CPU has same heat problem? From what I understand OS will change freq of CPU

watch -n1 “cat /sys/class/thermal/thermal_zone0/temp”
get me 52000 ( 52 C ?)

I wasn’t aware that CPU SOC is covered with RAM module

xlazom00 · June 23, 2017, 10:53am

@leo-yan feel free and do your own benchmark with libjpeg turbo
I post my results
I can also post result from Raspbery Pi 3

leo-yan · June 23, 2017, 3:38pm

ARM soc uses capping frequency (CPU/GPU/DDR) for thermal issue. ‘NOP’ method seems like the idle injection, this is not enabled on Hikey960, but it’s common technology and also can be applied on ARM CPUs.

Yes, Hikey960 DDR is on top of SoC, this also impacts the heat dissipation.

I am curious a bit for the case libjpeg tubo, seems like this case’s tasks are small tasks but if place tasks onto SMP cores, then can boost performance for it, right?

leo-yan · June 23, 2017, 3:14pm

@Karl_M thanks for the performance testing, there have some settings and kernel patches for multi-threading benchmarks performance boosting; before I can really work out these patches for Android kernel branch, you could try below settings, from previous experience it’s helpful for CPU benchmarks:

echo 1 > /proc/sys/kernel/sched_migration_cost_ns
echo 99999999 > /sys/class/thermal/thermal_zone0/sustainable_power
echo 0-7 > /dev/cpuset/background/cpus
echo 0-7 > /dev/cpuset/foreground/cpus
echo 0-7 > /dev/cpuset/top-app/cpus
echo 0-7 > /dev/cpuset/system-background/cpus

xlazom00 · June 23, 2017, 4:36pm

@leo-yan
I choosed libjpeg turbo as it is single thread benchmark and I want compare single thread speed of 2 different cores
And linaro was big contributor
./tjbench has switch to run it multiple times.

I don’t see /dev/cpuset/
How can I enable cpuset device?

echo NO_ENERGY_AWARE > /sys/kernel/debug/sched_features - no significant change
echo 99999999 > /sys/class/thermal/thermal_zone0/sustainable_power - no significant change

I hoped that Cortex A73 will be big boost compare to A15

leo-yan · June 23, 2017, 11:59pm

For single thread benchmark, You could try ‘taskset’ command:

On android: taskset 16 benchmark benchmark_arguments
On debian: taskset 0x10 benchmark benchmark_arguments

So you can bind task to CPU4 (set bit 4 in upper command). CPU4/5/6/7 are CA73 CPUs.

leo-yan · June 24, 2017, 12:07am

Please check below configs have been enabled:

CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_SCHEDTUNE=y

Please note, AOSP images has enabled CPUSET and mount in /dev folder; so after system boot up, you need manually mount cpuset virtual fs, the detailed info you can refer the doc: https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt; The chapter “1.9 How do I use cpusets ?” has detailed info for how to use it.

nobe · June 25, 2017, 10:32am

after a quick look at arch/arm64/boot/dts/hisilicon/hi3660.dtsi - kernel/hikey-linaro - Git at Google,
i don’t find any result when i search for ‘cache’ keyword.

so i’m wondering if the A73 cluster 2MB L2 cache is correctly set up and used by the system.

if not, this might explain your performance issue.

xlazom00 · June 25, 2017, 7:00pm

I looked at ARM default CPU
And they set up cache

github.com

96boards-hikey/linux/blob/working-android-hikey-linaro-4.4/arch/arm64/boot/dts/arm/juno.dts#L90

    
      
          				exit-latency-us = <1200>;
          				min-residency-us = <2500>;
          			};
          		};
          
          
		A57_0: cpu@0 {
          			compatible = "arm,cortex-a57","arm,armv8";
          			reg = <0x0 0x0>;
          			device_type = "cpu";
          			enable-method = "psci";
          			next-level-cache = <&A57_L2>;
          			clocks = <&scpi_dvfs 0>;
          			cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
          			sched-energy-costs = <&CPU_COST_A57 &CLUSTER_COST_A57>;
          		};
          
          
		A57_1: cpu@1 {
          			compatible = "arm,cortex-a57","arm,armv8";
          			reg = <0x0 0x1>;
          			device_type = "cpu";
          			enable-method = "psci";

and then

github.com

96boards-hikey/linux/blob/working-android-hikey-linaro-4.4/arch/arm64/boot/dts/arm/juno.dts#L151

    
      
          			compatible = "arm,cortex-a53","arm,armv8";
          			reg = <0x0 0x103>;
          			device_type = "cpu";
          			enable-method = "psci";
          			next-level-cache = <&A53_L2>;
          			clocks = <&scpi_dvfs 1>;
          			cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
          			sched-energy-costs = <&CPU_COST_A53 &CLUSTER_COST_A53>;
          		};
          
          
		A57_L2: l2-cache0 {
          			compatible = "cache";
          		};
          
          
		A53_L2: l2-cache1 {
          			compatible = "cache";
          		};
          
          
		/include/ "juno-sched-energy.dtsi"
          	};

xlazom00 · June 25, 2017, 10:50pm

I updated dts file and it really go up

leo-yan · June 26, 2017, 12:15am

@nobe, thanks a lot for the info.

I just wander the L2 cache is in two clusters and they are built in. And basically when we enable CPU SMP bit and CCI coherency for two clusters then L2 cache will be enabled by default. So we have get ready for L2 cache in ARM Trusted Firmware at boot up time. So if we add the ‘cache’ node in DTS, why this impacts the performance heavily?

nobe · June 26, 2017, 12:45pm

@leo-yan
to be honest, i shouldn’t answer this question because this is clearly outside of my understanding scope.

but maybe this blog post could be helpful to understand what’s happening : Get Cache Info in Linux on ARMv8 64-bit Platform

it references this commit from Sudeep Holla : https://android.googlesource.com/kernel/hikey-linaro/+/246246cbde5e840012f853e27630ebb59f409486
here’s a quote from his commit message :

It also implements the shared_cpu_map attribute, which is essential for
enabling both kernel and user-space to discover the system’s overall cache
topology.

hope this helps

@xlazom00
could you please share your new x264 and libjpeg-turbo benchmark after your device tree change ?

leo-yan · June 26, 2017, 2:24pm

@nobe thanks a lot for the shared info

xlazom00 · June 26, 2017, 10:33pm

Sorry but it looks I get same result after all

but some results:

big cores only / governor on demand

./tjbench testimages/testimgari.jpg -benchtime 20

Image size: 227 x 149
Decompress → Frame rate: 469.202742 fps
Throughput: 15.869844 Megapixels/sec

small cores only / governor on demand

./tjbench testimages/testimgari.jpg -benchtime 20

Image size: 227 x 149
Decompress → Frame rate: 456.859094 fps
Throughput: 15.452345 Megapixels/sec

big cores only / governor performance min and max core freq 2,36GHz

./tjbench testimages/testimgari.jpg -benchtime 20

Image size: 227 x 149
Decompress → Frame rate: 469.965926 fps
Throughput: 15.895658 Megapixels/sec

small cores only / governor performance min and max core freq 1,84GHz

./tjbench testimages/testimgari.jpg -benchtime 20

Image size: 227 x 149
Decompress → Frame rate: 457.087566 fps
Throughput: 15.460073 Megapixels/sec

x264 encoder
threads 8 all cores enabled and all cores max freq
24.78

threads 8 big cores only
21.87

threads 8 small cores only
11.67

threads 6 all cores enabled
22.27

threads 6 big cores only
20.96

threads 6 small cores only
11.32

taskset -c 4-7 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 4.5521s
total number of events: 10000
total time taken by event execution: 18.1632
per-request statistics:
min: 1.80ms
avg: 1.82ms
max: 37.82ms
approx. 95 percentile: 1.81ms

Threads fairness:
events (avg/stddev): 2500.0000/5.74
execution time (avg/stddev): 4.5408/0.01

taskset -c 0-3 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 4.9342s
total number of events: 10000
total time taken by event execution: 19.7033
per-request statistics:
min: 1.96ms
avg: 1.97ms
max: 33.98ms
approx. 95 percentile: 1.97ms

Threads fairness:
events (avg/stddev): 2500.0000/4.74
execution time (avg/stddev): 4.9258/0.00

taskset -c 4-7 ./tinymembench

C copy backwards : 6613.6 MB/s (0.6%)
C copy backwards (32 byte blocks) : 6330.2 MB/s
C copy backwards (64 byte blocks) : 6322.5 MB/s
C copy : 6292.1 MB/s
C copy prefetched (32 bytes step) : 6845.9 MB/s
C copy prefetched (64 bytes step) : 6954.4 MB/s
C 2-pass copy : 2507.5 MB/s (0.3%)
C 2-pass copy prefetched (32 bytes step) : 2292.9 MB/s (3.5%)
C 2-pass copy prefetched (64 bytes step) : 2387.8 MB/s (4.8%)
C fill : 8675.2 MB/s
C fill (shuffle within 16 byte blocks) : 8661.5 MB/s
C fill (shuffle within 32 byte blocks) : 8659.0 MB/s
C fill (shuffle within 64 byte blocks) : 8661.8 MB/s

standard memcpy : 6727.1 MB/s (0.1%)
standard memset : 10114.2 MB/s (0.1%)

NEON LDP/STP copy : 6793.4 MB/s (0.2%)
NEON LDP/STP copy pldl2strm (32 bytes step) : 6861.7 MB/s
NEON LDP/STP copy pldl2strm (64 bytes step) : 6973.5 MB/s
NEON LDP/STP copy pldl1keep (32 bytes step) : 6878.4 MB/s (0.3%)
NEON LDP/STP copy pldl1keep (64 bytes step) : 6934.3 MB/s (0.1%)
NEON LD1/ST1 copy : 6576.7 MB/s
NEON STP fill : 10097.2 MB/s
NEON STNP fill : 10067.8 MB/s
ARM LDP/STP copy : 6543.7 MB/s
ARM STP fill : 10106.2 MB/s
ARM STNP fill : 10078.2 MB/s

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.1 ns / 0.1 ns
131072 : 7.4 ns / 13.7 ns
262144 : 12.1 ns / 19.8 ns
524288 : 15.3 ns / 21.3 ns
1048576 : 17.0 ns / 21.4 ns
2097152 : 19.2 ns / 23.4 ns
4194304 : 122.4 ns / 180.2 ns
8388608 : 177.5 ns / 226.9 ns
16777216 : 205.4 ns / 243.7 ns
33554432 : 219.9 ns / 253.8 ns
67108864 : 228.4 ns / 258.4 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.1 ns / 0.1 ns
131072 : 7.4 ns / 13.7 ns
262144 : 11.2 ns / 20.0 ns
524288 : 13.0 ns / 23.0 ns
1048576 : 14.1 ns / 24.5 ns
2097152 : 15.9 ns / 27.3 ns
4194304 : 118.3 ns / 177.0 ns
8388608 : 170.1 ns / 219.2 ns
16777216 : 196.2 ns / 232.1 ns
33554432 : 209.2 ns / 236.7 ns
67108864 : 218.2 ns / 239.3 ns

xlazom00 · June 26, 2017, 10:25pm

When I modified dts file I was able to see cache sizes

grep . /sys/devices/system/cpu/cpu*/cache/index*/*

xlazom00 · June 27, 2017, 12:09am

Odroid XU4
Memory latency test looks much much better than HiKey960

odroid@odroid:~/tinymembench$ taskset -c 4-7 ./tinymembench
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for ‘copy’ tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source → L1 cache, L1 cache → destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==

C copy backwards : 1205.9 MB/s
C copy backwards (32 byte blocks) : 1189.5 MB/s (0.2%)
C copy backwards (64 byte blocks) : 2577.1 MB/s (0.2%)
C copy : 2762.6 MB/s
C copy prefetched (32 bytes step) : 3047.8 MB/s (0.2%)
C copy prefetched (64 bytes step) : 3136.0 MB/s
C 2-pass copy : 1384.4 MB/s
C 2-pass copy prefetched (32 bytes step) : 1661.3 MB/s
C 2-pass copy prefetched (64 bytes step) : 1668.1 MB/s
C fill : 4934.5 MB/s (1.5%)
C fill (shuffle within 16 byte blocks) : 1866.3 MB/s
C fill (shuffle within 32 byte blocks) : 1863.8 MB/s (0.5%)
C fill (shuffle within 64 byte blocks) : 1944.8 MB/s (0.2%)

standard memcpy : 2417.6 MB/s (0.3%)
standard memset : 4937.1 MB/s (1.0%)

NEON read : 3392.9 MB/s
NEON read prefetched (32 bytes step) : 4284.9 MB/s
NEON read prefetched (64 bytes step) : 4296.6 MB/s
NEON read 2 data streams : 3463.8 MB/s
NEON read 2 data streams prefetched (32 bytes step) : 4440.4 MB/s
NEON read 2 data streams prefetched (64 bytes step) : 4446.9 MB/s
NEON copy : 2627.7 MB/s
NEON copy prefetched (32 bytes step) : 2923.8 MB/s
NEON copy prefetched (64 bytes step) : 2912.8 MB/s
NEON unrolled copy : 2254.9 MB/s
NEON unrolled copy prefetched (32 bytes step) : 3251.8 MB/s (3.9%)
NEON unrolled copy prefetched (64 bytes step) : 3269.6 MB/s
NEON copy backwards : 1221.1 MB/s
NEON copy backwards prefetched (32 bytes step) : 1430.2 MB/s
NEON copy backwards prefetched (64 bytes step) : 1430.0 MB/s
NEON 2-pass copy : 2070.6 MB/s
NEON 2-pass copy prefetched (32 bytes step) : 2233.8 MB/s
NEON 2-pass copy prefetched (64 bytes step) : 2238.3 MB/s
NEON unrolled 2-pass copy : 1396.4 MB/s
NEON unrolled 2-pass copy prefetched (32 bytes step) : 1747.0 MB/s
NEON unrolled 2-pass copy prefetched (64 bytes step) : 1763.6 MB/s
NEON fill : 4928.0 MB/s (0.8%)
NEON fill backwards : 1854.0 MB/s
VFP copy : 2446.2 MB/s (0.1%)
VFP 2-pass copy : 1335.8 MB/s
ARM fill (STRD) : 4945.0 MB/s (0.8%)
ARM fill (STM with 8 registers) : 4949.8 MB/s (0.3%)
ARM fill (STM with 4 registers) : 4946.5 MB/s (0.2%)
ARM copy prefetched (incr pld) : 2949.2 MB/s (0.7%)
ARM copy prefetched (wrap pld) : 2781.6 MB/s
ARM 2-pass copy prefetched (incr pld) : 1662.6 MB/s
ARM 2-pass copy prefetched (wrap pld) : 1632.9 MB/s

==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can’t handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==

block size : single random read / dual random read
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 4.4 ns / 6.7 ns
131072 : 6.7 ns / 8.9 ns
262144 : 9.6 ns / 11.8 ns
524288 : 11.0 ns / 13.5 ns
1048576 : 11.9 ns / 14.5 ns
2097152 : 23.2 ns / 30.4 ns
4194304 : 95.2 ns / 143.5 ns
8388608 : 133.8 ns / 182.9 ns
16777216 : 153.6 ns / 199.6 ns
33554432 : 168.9 ns / 218.9 ns
67108864 : 178.4 ns / 231.6 ns

ric96 · June 27, 2017, 2:21am

Tests should always be done 5-10 times on the same configuration, extremely odd results should be eliminated and then the average scores of all the tests should be compared. just to remove the false positive.

leo-yan · June 27, 2017, 6:25am

xlazom00:

taskset -c 4-7 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 4.5521s
total number of events: 10000
total time taken by event execution: 18.1632
per-request statistics:
min: 1.80ms
avg: 1.82ms
max: 37.82ms
approx. 95 percentile: 1.81ms

Threads fairness:
events (avg/stddev): 2500.0000/5.74
execution time (avg/stddev): 4.5408/0.01

taskset -c 0-3 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 4.9342s
total number of events: 10000
total time taken by event execution: 19.7033
per-request statistics:
min: 1.96ms
avg: 1.97ms
max: 33.98ms
approx. 95 percentile: 1.97ms

Threads fairness:
events (avg/stddev): 2500.0000/4.74
execution time (avg/stddev): 4.9258/0.00

Below is my testing result with disabling thermal throttling:

hikey960:/data # ./taskset 0xf ./sysbench --test=cpu --cpu-max-prime=20000 r>
sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 15.0594s
total number of events: 10000
total time taken by event execution: 60.1927
per-request statistics:
min: 6.00ms
avg: 6.02ms
max: 33.73ms
approx. 95 percentile: 6.01ms

Threads fairness:
events (avg/stddev): 2500.0000/1.87
execution time (avg/stddev): 15.0482/0.01

hikey960:/data # ./taskset 0xf0 ./sysbench --test=cpu --cpu-max-prime=20000 ru>
sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 9.0935s
total number of events: 10000
total time taken by event execution: 36.3460
per-request statistics:
min: 3.46ms
avg: 3.63ms
max: 32.57ms
approx. 95 percentile: 4.22ms

Threads fairness:
events (avg/stddev): 2500.0000/1.87
execution time (avg/stddev): 9.0865/0.00

xlazom00 · June 27, 2017, 10:48am

@leo-yan
I get better results

tkaiser · June 27, 2017, 11:26am

Please provide also tinymembench numbers (to check for L2 cache – background information: http://www.cnx-software.com/2017/04/26/96boards-compliant-hikey-960-arm-cortex-a73-development-board-is-now-available-for-239/#comment-543357)

The most interesting test for me personally would be using taskset to run sysbench on either 1, 2 or 3 A73 cores simulatenously to get an idea how dynamic cpufreq/dvfs scaling works here (details why also above).

And one of the things I almost immediately do when fiddling around with a new platform/kernel is this

find /sys \( -iname "*clock*" -o -iname "*freq*" \)

to get monitoring sources for that.

Odroid XU4(Cortex A15) vs HiKey 960(A73) speed

big cores only / governor on demand

small cores only / governor on demand

big cores only / governor performance min and max core freq 2,36GHz

small cores only / governor performance min and max core freq 1,84GHz

standard memcpy : 6727.1 MB/s (0.1%) standard memset : 10114.2 MB/s (0.1%)

standard memcpy : 2417.6 MB/s (0.3%) standard memset : 4937.1 MB/s (1.0%)

standard memcpy : 6727.1 MB/s (0.1%)
standard memset : 10114.2 MB/s (0.1%)

standard memcpy : 2417.6 MB/s (0.3%)
standard memset : 4937.1 MB/s (1.0%)