Odroid XU4(Cortex A15) vs HiKey 960(A73) speed

@tkaiser

linaro@linaro-developer:~$ taskset -c 4 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 10.8512s
total number of events: 10000
total time taken by event execution: 10.8501
per-request statistics:
min: 1.08ms
avg: 1.09ms
max: 1.71ms
approx. 95 percentile: 1.09ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 10.8501/0.00

linaro@linaro-developer:~$ taskset -c 3 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 19.6397s
total number of events: 10000
total time taken by event execution: 19.6373
per-request statistics:
min: 1.96ms
avg: 1.96ms
max: 2.08ms
approx. 95 percentile: 1.97ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 19.6373/0.00

so small core is 1/2 speed of big core but 1,8GHz vs 2,4GHz kind of not match

cat /sys/devices/system/cpu/cpufreq/policy4/stats/time_in_state
903000 1292
1421000 88
1805000 0
2112000 30
2362000 22893

cat /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state
533000 1272
999000 30
1402000 30
1709000 38
1844000 24120

so there is still something that slow down Cortex A73

@xlazom00 your suspicion is reasonable.

Please note in the kernel side there have thermal framework to throttling CPU frequency and the temperature range is [65’c…75’c]; if beyond this range, MCU firmware also will throttle CPU frequency and try to limit temperature to 85’c. So MCU firmware is still possible to cap CPU capacity. But one thing I am not sure, I see you only use one CA73 CPU, from power measurement I observed one CA73 CPU doesn’t introduce serious thermal issue, if you have enabled hdmi you can try to unplug it so can save about ~1W power, and you can see if can get better result or not.

I used benchmark ‘dhry2 2’ to test CPU performance, I can see CA53 1.8GHz score is: 12918333.57, CA73 2.4GHz score is: 28327272.35, the performance CA73@2.4GHz is about 2.2 times than CA53@1.8GHz. This result seems reasonable to me, the CA73 has much complex pipeline than CA53, so it should get better result than the pure frequency ratio.

Please forget about sysbench numbers. The cpu test can only be used to isolate problems but not to measure performance (it only does prime number calculations in a specific way). Sysbench is only useful to test workloads without being affected by memory throughput (not that realistic for anything else) and it can be used to test for scheduling issues (your results with 4.4 kernel where execution time remained the same whether running on 1, 2 or all cores)

A53 and A73 are two completely different architectures (even made by different teams: Cambridge vs. Sophia-Antipolis) and the useless sysbench test might not show the A73 improvements.

At least it seems cpufreq scaling and (SMP/HMP) scheduling is working now, to get whether you’re affected by throttling it’s easy to switch to performance governor and prior to benchmarking and after do

cat /sys/devices/system/cpu/cpufreq/policy4/stats/time_in_state
cat /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state

If other cpufreqs than 2362000 or 1844000 increase their numbers throttling occured (if we can rely on sysfs here – to be confirmed. We know of at least two implementations where ‘firmwares’ were cheating on us or still cheat: RPi 3 lies here and Amlogic 9xx did lie in the past)

I personally would be interested in minerd --benchmark numbers running on either the big, the little and all cores. To compare numbers on a Ubuntu Xenial this should be sufficient

sudo apt -f -qq -y install libcurl4-gnutls-dev
wget http://downloads.sourceforge.net/project/cpuminer/pooler-cpuminer-2.4.5.tar.gz
tar xf pooler-cpuminer-2.4.5.tar.gz && rm pooler-cpuminer-2.4.5.tar.gz
cd cpuminer-2.4.5/
./configure CFLAGS="-O3 -mfpu=neon"
make
sudo make install

I always use minerd to optimize cpufreq/dvfs OPP tables (since you get the efficiency in khash/s and you also see when throttling happens since performance starts to decrease!) but for a general performance picture it might also not be sufficient. At least minerd needs memory throughput so we should see nice high numbers here.

I forgot: most probably this will not work with taskset/cgroups but only CPU hotplugging (if possible) eg.

for i in 0 1 2 3 ; do echo 0 > /sys/devices/system/cpu/cpu${i}/online ; done

Don’t forget to bring up the killed CPU cores prior to switching clusters! :wink: It’s similar:

for i in 0 1 2 3 ; do echo 1 > /sys/devices/system/cpu/cpu${i}/online ; done

And a final remark: Sometimes dvfs tables are insufficient (eg. voltage for a specific dvfs operation point is too low which can lead to instability or data corruption – happened with RPi 3 last year with early firmware versions, later versions increased VDD_CPUX voltage for the 1200MHz OPP).

Here https://github.com/ThomasKaiser/StabilityTester a statically linked Linpack (OpenBLAS/NEON) is provided and the scripts we normally use to test through dvfs tables (you need only xhpl64 and HPL.dat and look into stabilityTester.sh how to call the former). Can be used for this here too providing some performance numbers and test for throttling but then scripts have to be adjusted since not big.LITTLE aware currently. The nice thing with this Linpack is that it detects undervoltage by validating its own results and reports data corruption.

@leo-yan
And is it possible somehow to disable MCU freq caping ?
As when I run benchmark on 1 or 2 cores, everything should be still fine

@tkaiser
when I run
taskset -c 4-7 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4
cpu really crash from overheating when it get over 85C
with down-clocking to max 2,1GHz all works “fine” :slight_smile:

maybe try adding a fan over the heat sink, 12v pc fans work well

I consider ‘sysbench’ a pretty light load BTW. If this with 2.4GHz already exceeds 85°C or even crashes you don’t need to think about running heavy stuff like cpuminer or the NEON optimized Linpack version I mentioned above. Or you need to massively improve heat dissipation.

On the other hand I’m not that surprised since this SoC is made for phones where sometimes high single thread performance is needed and everything else happens on GPU cores and inside video engine anyway.

The more interesting question is: Where does the dvfs table lives? Usually this table contains a relationship between cpufreqs and voltage the CPU cores are supplied with at this clockspeed. Are the voltages too low then you run into stability issues, are they too high the thing overheats. So you need a perfectly balanced dvfs table with voltages as low as possible while having some safety headroom added so they’ll work everywhere.

When starting with vendor kernels/settings (that favour Android behaviour → keeping single threaded CPU performance up even if the thing starts to overheat) after improving dvfs table and budget cooling stuff in normal heavy multi-threaded benchmarks SoCs are usually 2 to 3 times faster afterwards in situations with mediocre heat dissipation. Just by tuning settings for the workload. With liquid cooling applied or large fans the difference is not that much.

@tkaiser operating points are defined in the device tree file

thermal zones is also defined in there, starting at line 1350

but i still wonder if the 85°C from MCU firmware can be changed.

@nobe thank you but I’ve dealt with way too many devices in the meantime to trust in definitions $somewhere. On many platforms the values defined in DT are only used by u-boot (so using dtc you won’t succeed but need to do some obscure bootloader stuff and overwrite sectors with dd), on other platforms this is done by ATF, SCPI, on others it’s some $firmware.

And then there’s the other problem that you’re sometimes not even able to rely on the information available through sysfs since some driver interacting with some $firmware gets wrong information anyway. That’s why I recommended cpuminer since this tool will tell you when throttling occurs since khash/s values decrease even if cpufreq information available tells you something different (on RPi 3 for example the firmware dynamically throttles down from 1200 MHz to 601 MHz while /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq all the time still reports 1200 MHz)

@leo-yan
Why linaro’s branch “tracking-john-hikey960-rebase-4.9” is faster then
kernel/hikey-linaro - Git at Google ??
what is different except 4.4 vs 4.9?

As can be seen above: not working cpufreq scaling was one issue (CPU cores not clocking up to their maximum speed but remaining at very conservative low clockspeeds) and the other was f*cked up scheduling at least on the big cluster.

BTW: cpufreq scaling won’t work properly when device tree nodes don’t match kernel. So switching between kernels without always also switching the .dtb files is also a great recipe to ruin performance :slight_smile:

btw I just did my x264 benchmark 1minute PAL video to H264
and A53 core on 1,8 GHz
3,5fps

and A73 core at 2,4 GHz
10,5fps

So I get different numbers to compare with sysbench

@tkaiser with big cores only enabled and configured without your parameters as I get less khash/s

linaro@linaro-developer:~/cpuminer-2.4.5$ ./minerd --benchmark -t 4
[2017-06-30 14:25:35] 4 miner threads started, using ‘scrypt’ algorithm.
[2017-06-30 14:25:36] thread 0: 4096 hashes, 2.20 khash/s
[2017-06-30 14:25:36] thread 3: 4096 hashes, 2.20 khash/s
[2017-06-30 14:25:37] thread 1: 4096 hashes, 2.18 khash/s
[2017-06-30 14:25:37] thread 2: 4096 hashes, 2.17 khash/s
[2017-06-30 14:25:39] thread 1: 6542 hashes, 2.26 khash/s
[2017-06-30 14:25:39] thread 2: 6507 hashes, 2.23 khash/s
[2017-06-30 14:25:40] thread 3: 8786 hashes, 2.26 khash/s
[2017-06-30 14:25:40] Total: 8.96 khash/s
[2017-06-30 14:25:40] thread 0: 8797 hashes, 2.26 khash/s
[2017-06-30 14:25:40] thread 1: 2262 hashes, 2.27 khash/s
[2017-06-30 14:25:40] thread 2: 2231 hashes, 2.24 khash/s
[2017-06-30 14:25:45] thread 3: 11318 hashes, 2.22 khash/s
[2017-06-30 14:25:45] Total: 8.99 khash/s
[2017-06-30 14:25:45] thread 0: 11314 hashes, 2.22 khash/s
[2017-06-30 14:25:46] thread 1: 11346 hashes, 2.22 khash/s
[2017-06-30 14:25:46] thread 2: 11199 hashes, 2.19 khash/s
[2017-06-30 14:25:49] thread 1: 8877 hashes, 2.24 khash/s
[2017-06-30 14:25:50] thread 2: 8747 hashes, 2.22 khash/s
[2017-06-30 14:25:50] thread 3: 11109 hashes, 2.24 khash/s
[2017-06-30 14:25:50] Total: 8.91 khash/s
[2017-06-30 14:25:50] thread 0: 11102 hashes, 2.24 khash/s
[2017-06-30 14:25:50] thread 1: 2239 hashes, 2.24 khash/s
[2017-06-30 14:25:55] thread 2: 11086 hashes, 2.19 khash/s
[2017-06-30 14:25:55] thread 3: 11192 hashes, 2.23 khash/s
[2017-06-30 14:25:55] Total: 8.90 khash/s
[2017-06-30 14:25:55] thread 0: 11191 hashes, 2.23 khash/s
[2017-06-30 14:25:56] thread 1: 11193 hashes, 2.23 khash/s
[2017-06-30 14:25:59] thread 2: 10975 hashes, 2.23 khash/s
[2017-06-30 14:26:00] thread 1: 8919 hashes, 2.22 khash/s
[2017-06-30 14:26:00] thread 2: 2229 hashes, 2.25 khash/s
[2017-06-30 14:26:00] thread 3: 11153 hashes, 2.22 khash/s
[2017-06-30 14:26:00] Total: 8.92 khash/s
[2017-06-30 14:26:00] thread 0: 11153 hashes, 2.22 khash/s
[2017-06-30 14:26:04] thread 1: 11082 hashes, 2.24 khash/s
[2017-06-30 14:26:05] thread 0: 11116 hashes, 2.25 khash/s
[2017-06-30 14:26:05] thread 3: 11117 hashes, 2.24 khash/s
[2017-06-30 14:26:05] Total: 8.99 khash/s
[2017-06-30 14:26:05] thread 1: 2245 hashes, 2.28 khash/s
[2017-06-30 14:26:05] thread 2: 11254 hashes, 2.25 khash/s
[2017-06-30 14:26:10] thread 0: 11270 hashes, 2.27 khash/s
[2017-06-30 14:26:10] thread 3: 11182 hashes, 2.25 khash/s
[2017-06-30 14:26:10] Total: 9.06 khash/s
[2017-06-30 14:26:10] thread 2: 11263 hashes, 2.27 khash/s
[2017-06-30 14:26:10] thread 1: 11419 hashes, 2.27 khash/s

Of course! I already explained above that sysbench --test=cpu does NOT test CPU appropriately. There are only two use cases for sysbench as also outlined above.

Using sysbench for any more general performance measuring is crap. This tool horribly sucks as a ‘general purpose benchmark’. Eg. using the Debian Jessie distro package you get 30% lower scores compared to Ubuntu Xenial (GCC 4.7 vs. 5.4), if you use more recent 0.5 version you get a lot lower values compared to 0.4.12, if you use on a RPi 3 sysbench from Raspbian it takes 48 seconds while only 3 seconds when using an ARMv8 optimized distro (or 6 seconds if your RPi 3 runs under-volted and therefore ‘frequency-capped’ – same link)

sysbench --test=cpu was only useful to identify specific typical problem sources. For everything else it’s absolutely useless.

Well, I really don’t get the meaning of this sentence. When I did this test on an ODROID-XU4 running Ubuntu Xenial (GCC 5.4) 2 months ago with exactly same cpuminer version I wrote down ‘When forced to run on the little cores cpuminer gets 2.27 khash/s (no throttling occuring), running on the big cores it starts with 8.23 khash/s’ (‘starts with’ since back then we experimented with 4.9 ODROID-XU kernel and dvfs and thermal settings were underwhelming, this XU4 thing almost immediately throttled down even with heatsink + fan).

In other words: I don’t trust that much in the A73 numbers above since I would expect somewhat better results.

@tkaiser
I am on debian 9 with gcc 6.3.0
I wasn’t able to configure with ./configure CFLAGS="-O3 -mfpu=neon"
so I made it with this ./configure CFLAGS="-O3 -mtune=cortex-a72"

I see, I (again) forgot that -mfpu=neon isn’t needed with aarch64 since GCC default here anyway.

I still find the performance numbers somewhat underwhelming but it looks like not worth the efforts to dig deeper (at least ‘remotely’). If I would sit in front of the machine I would further test whether there are some budget cooling strategies active (testing with different core counts with throttling active (no fan, maybe even no heatsink).

But I think I better spend the time on playing with server grade SoCs instead (Armada 8040 coming to my mind). At least it was interesting! Thanks for providing the numbers! :slight_smile:

Edit: Also interesting to explore specific throttling behaviour especially when a ‘MCU firmware’ is also involved. I played around with a rather beefy octa-core A53 device last year. Throttling there starts at 85°C and is implemented pretty stupid: Not dynamically downclocking but always down to a lower cpufreq (800 MHz here). So when testing I discovered that I got way better results limiting max cpufreq to 1300 MHz than allowing the maximum 1400 MHz (since then performance dropped way lower with the CPU cores constantly jumping between 1400 and 800 MHz instead of throttling trying to explore 1.3 and 1.2Ghz). Maybe somewhere here: https://forum.armbian.com/index.php?/topic/1285-nanopi-m3-cheap-8-core-35/&do=findComment&comment=13803 (just did a quick forum search).