Odroid XU4(Cortex A15) vs HiKey 960(A73) speed

@xlazom00 , @leo-yan
Could you please describe your test environment?
Which kernel (git hash) and rootfs are you using?

@domi
results are pretty much same for
https://github.com/96boards-hikey/linux/tree/working-android-hikey-linaro-4.4
branch “working-android-hikey-linaro-4.4”
with thermal driver http://paste.ubuntu.com/24963018/
l2 cache controler http://paste.ubuntu.com/24963068/
.config http://paste.ubuntu.com/24963065/

or
kernel/hikey-linaro - Git at Google
branch “android-hikey-linaro-4.4”
Only this kernels has working USB!
l2 cache controler http://paste.ubuntu.com/24963030/
and .config http://paste.ubuntu.com/24963034/

cmdline without debug

/opt/workspace/boot/uefi/hikey960/tools-images-hikey960/build-from-source/mkbootimg --kernel $OUT/$BOOT/Image.gz --ramdisk $RAMDISK --cmdline “console=ttyAMA6,115200n8 root=/dev/sdd10 rootwait rw quiet efi=noruntime” --base 0 --tags_offset 0x07a00000 --kernel_offset 0x00080000 --ramdisk_offset 0x07c00000 --output boot.img

/opt/workspace/boot/uefi/hikey960/tools-images-hikey960/build-from-source/mkdtimg -d $OUT/$BOOT/dts/hisilicon/hi3660-hikey960.dtb -s 2048 -c -o dt.img

and as rootfs I has
http://builds.96boards.org/releases/dragonboard410c/linaro/debian/latest/linaro-stretch-developer-qcom-snapdragon-arm64-20170607-246.img.gz

as ramdisk http://www.mdragon.org/ramdisk.gz

Thanks! This is very helpful.

@tkaiser
taskset -c 1 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 19.6489s
total number of events: 10000
total time taken by event execution: 19.6464
per-request statistics:
min: 1.96ms
avg: 1.96ms
max: 3.11ms
approx. 95 percentile: 1.97ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 19.6464/0.00

taskset -c 4 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 18.0456s
total number of events: 10000
total time taken by event execution: 18.0438
per-request statistics:
min: 1.80ms
avg: 1.80ms
max: 1.94ms
approx. 95 percentile: 1.81ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 18.0438/0.00

taskset -c 4-5 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=2
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 18.0461s
total number of events: 10000
total time taken by event execution: 36.0758
per-request statistics:
min: 1.80ms
avg: 3.61ms
max: 25.82ms
approx. 95 percentile: 17.81ms

Threads fairness:
events (avg/stddev): 5000.0000/2.00
execution time (avg/stddev): 18.0379/0.01

taskset -c 4-6 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=3
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 18.0534s
total number of events: 10000
total time taken by event execution: 54.1261
per-request statistics:
min: 1.80ms
avg: 5.41ms
max: 33.82ms
approx. 95 percentile: 25.81ms

Threads fairness:
events (avg/stddev): 3333.3333/1.89
execution time (avg/stddev): 18.0420/0.01

1 Like

That took 4.9342s so now running only on a single little core getting 19.6489s ( * 4 ) is exactly what should happen and these are exactly the numbers one can expect from A53 clocking at 1.8GHz.

That took 4.5521s back then (too long, if the A73 runs at 2.4GHz it has to be 2 seconds or maybe even less), now you get on one single core 18.0456s as expected but same results when running on 2 or 3 big cores which points to scheduler weirdness (especially when looking at sysbench increasing avg, max and ‘approx. 95 percentile’ numbers). It would be interesting how numbers look like when you test with taskset -c 4-7 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4 again. Whether you get also 18.x seconds or 4.5. If it’s 18 seconds I assume you now run a different kernel behaving totally different wrt scheduling.

Anyway by interpreting the results it looks like A73 cores currently clock at 1000 MHz max and there’s something seriously wrong with scheduling (might be funny to install/use htop to watch what happens on individual cores). Added to that there might still be a L2 cache problem or not.

to have a better understanding of what’s happening, you might want to play with the PMU

and here is a list of the cortex A73 perf events :
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100048_0002_05_en/jfa1406793332829.html

I will try linaro’s android kernel to see if I get same results

@tkaiser
I don’t understand but if I change min max freq to any value
results are always same
taskset -c 4 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1

Test execution summary:
total time: 18.0456s
total number of events: 10000
total time taken by event execution: 18.0439
per-request statistics:
min: 1.80ms
avg: 1.80ms
max: 1.88ms
approx. 95 percentile: 1.81ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 18.0439/0.00

it looks like kernel get “some value”

@tkaiser
I moved to branch 4.9
‘’ https://github.com/96boards-hikey/linux/tree/hikey960-v4.9
but this branch don’t have working USB driver :frowning:

and
taskset -c 4-7 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4

Test execution summary:
total time: 2.7884s
total number of events: 10000
total time taken by event execution: 11.1175
per-request statistics:
min: 1.08ms
avg: 1.11ms
max: 33.09ms
approx. 95 percentile: 1.09ms

Threads fairness:
events (avg/stddev): 2500.0000/0.71
execution time (avg/stddev): 2.7794/0.01

taskset -c 4 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1

Test execution summary:
total time: 10.8483s
total number of events: 10000
total time taken by event execution: 10.8473
per-request statistics:
min: 1.08ms
avg: 1.08ms
max: 1.10ms
approx. 95 percentile: 1.09ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 10.8473/0.00

x264 encoder benchmark 29fps

taskset -c 4 ./tjbench testimages/testimg

Image size: 227 x 149
Decompress → Frame rate: 782.208770 fps
Throughput: 26.456647 Megapixels/sec

taskset -c 4-7 ./tinymembench
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for ‘copy’ tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source → L1 cache, L1 cache → destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==

C copy backwards : 9610.6 MB/s (0.6%)
C copy backwards (32 byte blocks) : 9442.9 MB/s
C copy backwards (64 byte blocks) : 9499.4 MB/s
C copy : 9297.4 MB/s
C copy prefetched (32 bytes step) : 9068.2 MB/s (0.1%)
C copy prefetched (64 bytes step) : 9072.6 MB/s (0.1%)
C 2-pass copy : 3678.5 MB/s
C 2-pass copy prefetched (32 bytes step) : 3306.5 MB/s
C 2-pass copy prefetched (64 bytes step) : 3423.4 MB/s (0.1%)
C fill : 14780.5 MB/s (0.2%)
C fill (shuffle within 16 byte blocks) : 14727.6 MB/s
C fill (shuffle within 32 byte blocks) : 14702.7 MB/s
C fill (shuffle within 64 byte blocks) : 14714.5 MB/s

standard memcpy : 9272.2 MB/s
standard memset : 14820.6 MB/s (0.1%)

NEON LDP/STP copy : 9261.7 MB/s
NEON LDP/STP copy pldl2strm (32 bytes step) : 9265.6 MB/s
NEON LDP/STP copy pldl2strm (64 bytes step) : 9261.9 MB/s
NEON LDP/STP copy pldl1keep (32 bytes step) : 8950.4 MB/s (1.3%)
NEON LDP/STP copy pldl1keep (64 bytes step) : 8913.0 MB/s
NEON LD1/ST1 copy : 8738.3 MB/s
NEON STP fill : 14953.9 MB/s
NEON STNP fill : 14959.2 MB/s
ARM LDP/STP copy : 8887.7 MB/s
ARM STP fill : 14942.8 MB/s
ARM STNP fill : 14948.9 MB/s

==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can’t handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 5.5 ns / 10.1 ns
262144 : 8.9 ns / 14.5 ns
524288 : 11.2 ns / 15.6 ns
1048576 : 12.4 ns / 15.8 ns
2097152 : 13.8 ns / 16.9 ns
4194304 : 86.9 ns / 123.6 ns
8388608 : 125.1 ns / 153.4 ns
16777216 : 143.4 ns / 161.8 ns
33554432 : 152.7 ns / 165.3 ns
67108864 : 158.4 ns / 168.3 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 5.5 ns / 10.1 ns
262144 : 8.2 ns / 14.7 ns
524288 : 9.6 ns / 16.9 ns
1048576 : 10.3 ns / 18.0 ns
2097152 : 11.4 ns / 19.7 ns
4194304 : 83.0 ns / 119.5 ns
8388608 : 119.2 ns / 144.8 ns
16777216 : 137.6 ns / 152.1 ns
33554432 : 146.7 ns / 154.2 ns
67108864 : 152.3 ns / 155.9 ns

@leo-yan
Any change that we will have working USB linaro’s 4.9 or 4.12 kernel ??
there is commit with HiKey 960 USB but still it isn’t working

there is new branch
https://github.com/96boards-hikey/linux/tree/tracking-john-hikey960-rebase-4.9
with working usb
and nice performance :slight_smile:

my .config
http://paste.ubuntu.com/24976454/

@xlazom00 It’s great you can find working USB driver on 4.9 branch;

For 4.12 kernel branch, I was told at the booting system you need disconnect USB cable on OTG port, otherwise it’s unstable. FYI. If you see the USB failure on 4.12 kernel branch, could you paste the log so we can check for this?

That means cpufreq scaling is not working and the CPU cores remain at the same clockspeed all the time (set by u-boot or ATF – no idea how it’s working on this platform)

The results with the other kernel still don’t look good IMO (too slow). How does the output of

find /sys -name time_in_state

look like? There should be 2 files, one with cpufreq statistics for the little cluster and another for the big one. Output would be interesting.

@leo-yan
4.12 version don’t boot
maybee there is some stuff in defconfig
I don’t know, byt kernel don’t start
should I move to UEFI ?

@tkaiser

linaro@linaro-developer:~$ taskset -c 4 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 10.8512s
total number of events: 10000
total time taken by event execution: 10.8501
per-request statistics:
min: 1.08ms
avg: 1.09ms
max: 1.71ms
approx. 95 percentile: 1.09ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 10.8501/0.00

linaro@linaro-developer:~$ taskset -c 3 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 19.6397s
total number of events: 10000
total time taken by event execution: 19.6373
per-request statistics:
min: 1.96ms
avg: 1.96ms
max: 2.08ms
approx. 95 percentile: 1.97ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 19.6373/0.00

so small core is 1/2 speed of big core but 1,8GHz vs 2,4GHz kind of not match

cat /sys/devices/system/cpu/cpufreq/policy4/stats/time_in_state
903000 1292
1421000 88
1805000 0
2112000 30
2362000 22893

cat /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state
533000 1272
999000 30
1402000 30
1709000 38
1844000 24120

so there is still something that slow down Cortex A73

@xlazom00 your suspicion is reasonable.

Please note in the kernel side there have thermal framework to throttling CPU frequency and the temperature range is [65’c…75’c]; if beyond this range, MCU firmware also will throttle CPU frequency and try to limit temperature to 85’c. So MCU firmware is still possible to cap CPU capacity. But one thing I am not sure, I see you only use one CA73 CPU, from power measurement I observed one CA73 CPU doesn’t introduce serious thermal issue, if you have enabled hdmi you can try to unplug it so can save about ~1W power, and you can see if can get better result or not.

I used benchmark ‘dhry2 2’ to test CPU performance, I can see CA53 1.8GHz score is: 12918333.57, CA73 2.4GHz score is: 28327272.35, the performance CA73@2.4GHz is about 2.2 times than CA53@1.8GHz. This result seems reasonable to me, the CA73 has much complex pipeline than CA53, so it should get better result than the pure frequency ratio.

Please forget about sysbench numbers. The cpu test can only be used to isolate problems but not to measure performance (it only does prime number calculations in a specific way). Sysbench is only useful to test workloads without being affected by memory throughput (not that realistic for anything else) and it can be used to test for scheduling issues (your results with 4.4 kernel where execution time remained the same whether running on 1, 2 or all cores)

A53 and A73 are two completely different architectures (even made by different teams: Cambridge vs. Sophia-Antipolis) and the useless sysbench test might not show the A73 improvements.

At least it seems cpufreq scaling and (SMP/HMP) scheduling is working now, to get whether you’re affected by throttling it’s easy to switch to performance governor and prior to benchmarking and after do

cat /sys/devices/system/cpu/cpufreq/policy4/stats/time_in_state
cat /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state

If other cpufreqs than 2362000 or 1844000 increase their numbers throttling occured (if we can rely on sysfs here – to be confirmed. We know of at least two implementations where ‘firmwares’ were cheating on us or still cheat: RPi 3 lies here and Amlogic 9xx did lie in the past)

I personally would be interested in minerd --benchmark numbers running on either the big, the little and all cores. To compare numbers on a Ubuntu Xenial this should be sufficient

sudo apt -f -qq -y install libcurl4-gnutls-dev
wget http://downloads.sourceforge.net/project/cpuminer/pooler-cpuminer-2.4.5.tar.gz
tar xf pooler-cpuminer-2.4.5.tar.gz && rm pooler-cpuminer-2.4.5.tar.gz
cd cpuminer-2.4.5/
./configure CFLAGS="-O3 -mfpu=neon"
make
sudo make install

I always use minerd to optimize cpufreq/dvfs OPP tables (since you get the efficiency in khash/s and you also see when throttling happens since performance starts to decrease!) but for a general performance picture it might also not be sufficient. At least minerd needs memory throughput so we should see nice high numbers here.