Odroid XU4(Cortex A15) vs HiKey 960(A73) speed

When I modified dts file I was able to see cache sizes

grep . /sys/devices/system/cpu/cpu*/cache/index*/*

Odroid XU4
Memory latency test looks much much better than HiKey960

odroid@odroid:~/tinymembench$ taskset -c 4-7 ./tinymembench
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for ‘copy’ tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==

C copy backwards : 1205.9 MB/s
C copy backwards (32 byte blocks) : 1189.5 MB/s (0.2%)
C copy backwards (64 byte blocks) : 2577.1 MB/s (0.2%)
C copy : 2762.6 MB/s
C copy prefetched (32 bytes step) : 3047.8 MB/s (0.2%)
C copy prefetched (64 bytes step) : 3136.0 MB/s
C 2-pass copy : 1384.4 MB/s
C 2-pass copy prefetched (32 bytes step) : 1661.3 MB/s
C 2-pass copy prefetched (64 bytes step) : 1668.1 MB/s
C fill : 4934.5 MB/s (1.5%)
C fill (shuffle within 16 byte blocks) : 1866.3 MB/s
C fill (shuffle within 32 byte blocks) : 1863.8 MB/s (0.5%)
C fill (shuffle within 64 byte blocks) : 1944.8 MB/s (0.2%)

standard memcpy : 2417.6 MB/s (0.3%)
standard memset : 4937.1 MB/s (1.0%)

NEON read : 3392.9 MB/s
NEON read prefetched (32 bytes step) : 4284.9 MB/s
NEON read prefetched (64 bytes step) : 4296.6 MB/s
NEON read 2 data streams : 3463.8 MB/s
NEON read 2 data streams prefetched (32 bytes step) : 4440.4 MB/s
NEON read 2 data streams prefetched (64 bytes step) : 4446.9 MB/s
NEON copy : 2627.7 MB/s
NEON copy prefetched (32 bytes step) : 2923.8 MB/s
NEON copy prefetched (64 bytes step) : 2912.8 MB/s
NEON unrolled copy : 2254.9 MB/s
NEON unrolled copy prefetched (32 bytes step) : 3251.8 MB/s (3.9%)
NEON unrolled copy prefetched (64 bytes step) : 3269.6 MB/s
NEON copy backwards : 1221.1 MB/s
NEON copy backwards prefetched (32 bytes step) : 1430.2 MB/s
NEON copy backwards prefetched (64 bytes step) : 1430.0 MB/s
NEON 2-pass copy : 2070.6 MB/s
NEON 2-pass copy prefetched (32 bytes step) : 2233.8 MB/s
NEON 2-pass copy prefetched (64 bytes step) : 2238.3 MB/s
NEON unrolled 2-pass copy : 1396.4 MB/s
NEON unrolled 2-pass copy prefetched (32 bytes step) : 1747.0 MB/s
NEON unrolled 2-pass copy prefetched (64 bytes step) : 1763.6 MB/s
NEON fill : 4928.0 MB/s (0.8%)
NEON fill backwards : 1854.0 MB/s
VFP copy : 2446.2 MB/s (0.1%)
VFP 2-pass copy : 1335.8 MB/s
ARM fill (STRD) : 4945.0 MB/s (0.8%)
ARM fill (STM with 8 registers) : 4949.8 MB/s (0.3%)
ARM fill (STM with 4 registers) : 4946.5 MB/s (0.2%)
ARM copy prefetched (incr pld) : 2949.2 MB/s (0.7%)
ARM copy prefetched (wrap pld) : 2781.6 MB/s
ARM 2-pass copy prefetched (incr pld) : 1662.6 MB/s
ARM 2-pass copy prefetched (wrap pld) : 1632.9 MB/s

==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can’t handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==

block size : single random read / dual random read
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 4.4 ns / 6.7 ns
131072 : 6.7 ns / 8.9 ns
262144 : 9.6 ns / 11.8 ns
524288 : 11.0 ns / 13.5 ns
1048576 : 11.9 ns / 14.5 ns
2097152 : 23.2 ns / 30.4 ns
4194304 : 95.2 ns / 143.5 ns
8388608 : 133.8 ns / 182.9 ns
16777216 : 153.6 ns / 199.6 ns
33554432 : 168.9 ns / 218.9 ns
67108864 : 178.4 ns / 231.6 ns

Tests should always be done 5-10 times on the same configuration, extremely odd results should be eliminated and then the average scores of all the tests should be compared. just to remove the false positive.

Below is my testing result with disabling thermal throttling:

hikey960:/data # ./taskset 0xf ./sysbench --test=cpu --cpu-max-prime=20000 r>
sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 15.0594s
total number of events: 10000
total time taken by event execution: 60.1927
per-request statistics:
min: 6.00ms
avg: 6.02ms
max: 33.73ms
approx. 95 percentile: 6.01ms

Threads fairness:
events (avg/stddev): 2500.0000/1.87
execution time (avg/stddev): 15.0482/0.01

hikey960:/data # ./taskset 0xf0 ./sysbench --test=cpu --cpu-max-prime=20000 ru>
sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000

Test execution summary:
total time: 9.0935s
total number of events: 10000
total time taken by event execution: 36.3460
per-request statistics:
min: 3.46ms
avg: 3.63ms
max: 32.57ms
approx. 95 percentile: 4.22ms

Threads fairness:
events (avg/stddev): 2500.0000/1.87
execution time (avg/stddev): 9.0865/0.00

@leo-yan
I get better results

Please provide also tinymembench numbers (to check for L2 cache – background information: http://www.cnx-software.com/2017/04/26/96boards-compliant-hikey-960-arm-cortex-a73-development-board-is-now-available-for-239/#comment-543357)

The most interesting test for me personally would be using taskset to run sysbench on either 1, 2 or 3 A73 cores simulatenously to get an idea how dynamic cpufreq/dvfs scaling works here (details why also above).

And one of the things I almost immediately do when fiddling around with a new platform/kernel is this

find /sys \( -iname "*clock*" -o -iname "*freq*" \)

to get monitoring sources for that.

@xlazom00 , @leo-yan
Could you please describe your test environment?
Which kernel (git hash) and rootfs are you using?

@domi
results are pretty much same for
https://github.com/96boards-hikey/linux/tree/working-android-hikey-linaro-4.4
branch "working-android-hikey-linaro-4.4"
with thermal driver http://paste.ubuntu.com/24963018/
l2 cache controler http://paste.ubuntu.com/24963068/
.config http://paste.ubuntu.com/24963065/

or
https://android.googlesource.com/kernel/hikey-linaro
branch "android-hikey-linaro-4.4"
Only this kernels has working USB!
l2 cache controler http://paste.ubuntu.com/24963030/
and .config http://paste.ubuntu.com/24963034/

cmdline without debug

/opt/workspace/boot/uefi/hikey960/tools-images-hikey960/build-from-source/mkbootimg --kernel $OUT/$BOOT/Image.gz --ramdisk $RAMDISK --cmdline “console=ttyAMA6,115200n8 root=/dev/sdd10 rootwait rw quiet efi=noruntime” --base 0 --tags_offset 0x07a00000 --kernel_offset 0x00080000 --ramdisk_offset 0x07c00000 --output boot.img

/opt/workspace/boot/uefi/hikey960/tools-images-hikey960/build-from-source/mkdtimg -d $OUT/$BOOT/dts/hisilicon/hi3660-hikey960.dtb -s 2048 -c -o dt.img

and as rootfs I has
http://builds.96boards.org/releases/dragonboard410c/linaro/debian/latest/linaro-stretch-developer-qcom-snapdragon-arm64-20170607-246.img.gz

as ramdisk http://www.mdragon.org/ramdisk.gz

Thanks! This is very helpful.

@tkaiser
taskset -c 1 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 19.6489s
total number of events: 10000
total time taken by event execution: 19.6464
per-request statistics:
min: 1.96ms
avg: 1.96ms
max: 3.11ms
approx. 95 percentile: 1.97ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 19.6464/0.00

taskset -c 4 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 18.0456s
total number of events: 10000
total time taken by event execution: 18.0438
per-request statistics:
min: 1.80ms
avg: 1.80ms
max: 1.94ms
approx. 95 percentile: 1.81ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 18.0438/0.00

taskset -c 4-5 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=2
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 18.0461s
total number of events: 10000
total time taken by event execution: 36.0758
per-request statistics:
min: 1.80ms
avg: 3.61ms
max: 25.82ms
approx. 95 percentile: 17.81ms

Threads fairness:
events (avg/stddev): 5000.0000/2.00
execution time (avg/stddev): 18.0379/0.01

taskset -c 4-6 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=3
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 18.0534s
total number of events: 10000
total time taken by event execution: 54.1261
per-request statistics:
min: 1.80ms
avg: 5.41ms
max: 33.82ms
approx. 95 percentile: 25.81ms

Threads fairness:
events (avg/stddev): 3333.3333/1.89
execution time (avg/stddev): 18.0420/0.01

1 Like

That took 4.9342s so now running only on a single little core getting 19.6489s ( * 4 ) is exactly what should happen and these are exactly the numbers one can expect from A53 clocking at 1.8GHz.

That took 4.5521s back then (too long, if the A73 runs at 2.4GHz it has to be 2 seconds or maybe even less), now you get on one single core 18.0456s as expected but same results when running on 2 or 3 big cores which points to scheduler weirdness (especially when looking at sysbench increasing avg, max and ‘approx. 95 percentile’ numbers). It would be interesting how numbers look like when you test with taskset -c 4-7 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4 again. Whether you get also 18.x seconds or 4.5. If it’s 18 seconds I assume you now run a different kernel behaving totally different wrt scheduling.

Anyway by interpreting the results it looks like A73 cores currently clock at 1000 MHz max and there’s something seriously wrong with scheduling (might be funny to install/use htop to watch what happens on individual cores). Added to that there might still be a L2 cache problem or not.

to have a better understanding of what’s happening, you might want to play with the PMU

and here is a list of the cortex A73 perf events :
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100048_0002_05_en/jfa1406793332829.html

I will try linaro’s android kernel to see if I get same results

@tkaiser
I don’t understand but if I change min max freq to any value
results are always same
taskset -c 4 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1

Test execution summary:
total time: 18.0456s
total number of events: 10000
total time taken by event execution: 18.0439
per-request statistics:
min: 1.80ms
avg: 1.80ms
max: 1.88ms
approx. 95 percentile: 1.81ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 18.0439/0.00

it looks like kernel get “some value”

@tkaiser
I moved to branch 4.9
’’ https://github.com/96boards-hikey/linux/tree/hikey960-v4.9
but this branch don’t have working USB driver :frowning:

and
taskset -c 4-7 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4

Test execution summary:
total time: 2.7884s
total number of events: 10000
total time taken by event execution: 11.1175
per-request statistics:
min: 1.08ms
avg: 1.11ms
max: 33.09ms
approx. 95 percentile: 1.09ms

Threads fairness:
events (avg/stddev): 2500.0000/0.71
execution time (avg/stddev): 2.7794/0.01

taskset -c 4 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1

Test execution summary:
total time: 10.8483s
total number of events: 10000
total time taken by event execution: 10.8473
per-request statistics:
min: 1.08ms
avg: 1.08ms
max: 1.10ms
approx. 95 percentile: 1.09ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 10.8473/0.00

x264 encoder benchmark 29fps

taskset -c 4 ./tjbench testimages/testimg

Image size: 227 x 149
Decompress --> Frame rate: 782.208770 fps
Throughput: 26.456647 Megapixels/sec

taskset -c 4-7 ./tinymembench
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for ‘copy’ tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==

C copy backwards : 9610.6 MB/s (0.6%)
C copy backwards (32 byte blocks) : 9442.9 MB/s
C copy backwards (64 byte blocks) : 9499.4 MB/s
C copy : 9297.4 MB/s
C copy prefetched (32 bytes step) : 9068.2 MB/s (0.1%)
C copy prefetched (64 bytes step) : 9072.6 MB/s (0.1%)
C 2-pass copy : 3678.5 MB/s
C 2-pass copy prefetched (32 bytes step) : 3306.5 MB/s
C 2-pass copy prefetched (64 bytes step) : 3423.4 MB/s (0.1%)
C fill : 14780.5 MB/s (0.2%)
C fill (shuffle within 16 byte blocks) : 14727.6 MB/s
C fill (shuffle within 32 byte blocks) : 14702.7 MB/s
C fill (shuffle within 64 byte blocks) : 14714.5 MB/s

standard memcpy : 9272.2 MB/s
standard memset : 14820.6 MB/s (0.1%)

NEON LDP/STP copy : 9261.7 MB/s
NEON LDP/STP copy pldl2strm (32 bytes step) : 9265.6 MB/s
NEON LDP/STP copy pldl2strm (64 bytes step) : 9261.9 MB/s
NEON LDP/STP copy pldl1keep (32 bytes step) : 8950.4 MB/s (1.3%)
NEON LDP/STP copy pldl1keep (64 bytes step) : 8913.0 MB/s
NEON LD1/ST1 copy : 8738.3 MB/s
NEON STP fill : 14953.9 MB/s
NEON STNP fill : 14959.2 MB/s
ARM LDP/STP copy : 8887.7 MB/s
ARM STP fill : 14942.8 MB/s
ARM STNP fill : 14948.9 MB/s

==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can’t handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 5.5 ns / 10.1 ns
262144 : 8.9 ns / 14.5 ns
524288 : 11.2 ns / 15.6 ns
1048576 : 12.4 ns / 15.8 ns
2097152 : 13.8 ns / 16.9 ns
4194304 : 86.9 ns / 123.6 ns
8388608 : 125.1 ns / 153.4 ns
16777216 : 143.4 ns / 161.8 ns
33554432 : 152.7 ns / 165.3 ns
67108864 : 158.4 ns / 168.3 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 5.5 ns / 10.1 ns
262144 : 8.2 ns / 14.7 ns
524288 : 9.6 ns / 16.9 ns
1048576 : 10.3 ns / 18.0 ns
2097152 : 11.4 ns / 19.7 ns
4194304 : 83.0 ns / 119.5 ns
8388608 : 119.2 ns / 144.8 ns
16777216 : 137.6 ns / 152.1 ns
33554432 : 146.7 ns / 154.2 ns
67108864 : 152.3 ns / 155.9 ns

@leo-yan
Any change that we will have working USB linaro’s 4.9 or 4.12 kernel ??
there is commit with HiKey 960 USB but still it isn’t working

there is new branch
https://github.com/96boards-hikey/linux/tree/tracking-john-hikey960-rebase-4.9
with working usb
and nice performance :slight_smile:

my .config
http://paste.ubuntu.com/24976454/

@xlazom00 It’s great you can find working USB driver on 4.9 branch;

For 4.12 kernel branch, I was told at the booting system you need disconnect USB cable on OTG port, otherwise it’s unstable. FYI. If you see the USB failure on 4.12 kernel branch, could you paste the log so we can check for this?

That means cpufreq scaling is not working and the CPU cores remain at the same clockspeed all the time (set by u-boot or ATF – no idea how it’s working on this platform)

The results with the other kernel still don’t look good IMO (too slow). How does the output of

find /sys -name time_in_state

look like? There should be 2 files, one with cpufreq statistics for the little cluster and another for the big one. Output would be interesting.