I just did some x264 benchmark
same input video, same x264 git version
And I get 21fps(HiKey 960 64bit x264 and 32bit x264) vs 19(Odroid X4 32bit x264)
I disabled small cores
HiKey on 2,4GHz
How is this possible ??
I also run some NEON benchmarks from
http://www.roylongbottom.org.uk/Raspberry_Pi_Benchmarks.zip
and 960 isnāt any big winner
I did all benchmarks with big heatsink
And libjpeg-turbo benchmark
./tjbench testimages/testimgari.jpg
HiKey 960
Image size: 227 x 149
Decompress ā Frame rate: 458.435876 fps
Throughput: 15.505677 Megapixels/sec
Odroid XU4
Image size: 227 x 149
Decompress ā Frame rate: 503.725179 fps
Throughput: 17.037497 Megapixels/sec
Thanks for sharing this.
Sorry I am not familiar with multimedia related works, want to check these testing can boost performance with hardware codec or just use software (CPU) for these benchmarks? AFAIK, the GPU has been enabled on Hikey960 but currently the codec driver has not been enabled.
If this is CPU bottleneck issue, itās likely the issue related with two things: one is scheduler, another is thermal. If itās this case, you could try below some settings:
echo 99999999 > /sys/class/thermal/thermal_zone0/sustainable_power
echo 0-7 > /dev/cpuset/background/cpus
echo 0-7 > /dev/cpuset/foreground/cpus
echo 0-7 > /dev/cpuset/top-app/cpus
echo 0-7 > /dev/cpuset/system-background/cpus
echo NO_ENERGY_AWARE > /sys/kernel/debug/sched_features
BTW, please note recently the boot images and android 4.4 kernel have been updated to fix some power management related features, so you should update firmware for this. Otherwise, the CPU frequency cannot work well at sometime. Now this is much stable with these fixing.
The XU4 is no slouch for an ARM board by any means, but on paper, the Hikey 960 should be roughly 2.4x faster in terms of raw CPU. Thermal can make a big difference - Iāve often seen fanless ARM boards choking back 20-40% speed - and adding a better heat sink and/or fan should improve your performance considerably. But even taking this into account, something is non-optimal on the Hikey 960: Itās underperforming considerably even taking that into account using Geekbench 4.
with echo NO_ENERGY_AWARE > /sys/kernel/debug/sched_features
https://android.googlesource.com/kernel/hikey-linaro
android-hikey-linaro-4.4
start to run on all cores
kernel from github run all all cores from begging
x264 is pure software H264 encoder. so nice NEON benchark
but I choose libjpeg turbo as alternative
I mounted big heat sink with fan. I also check cpu freq and it is stable.
Intel cores insert NOP instruction in case of overheating. Is it same on ARM?
How can I check if ARM CPU has same heat problem? From what I understand OS will change freq of CPU
watch -n1 ācat /sys/class/thermal/thermal_zone0/tempā
get me 52000 ( 52 C ?)
I wasnāt aware that CPU SOC is covered with RAM module
@leo-yan feel free and do your own benchmark with libjpeg turbo
I post my results
I can also post result from Raspbery Pi 3
ARM soc uses capping frequency (CPU/GPU/DDR) for thermal issue. āNOPā method seems like the idle injection, this is not enabled on Hikey960, but itās common technology and also can be applied on ARM CPUs.
Yes, Hikey960 DDR is on top of SoC, this also impacts the heat dissipation.
I am curious a bit for the case libjpeg tubo, seems like this caseās tasks are small tasks but if place tasks onto SMP cores, then can boost performance for it, right?
@Karl_M thanks for the performance testing, there have some settings and kernel patches for multi-threading benchmarks performance boosting; before I can really work out these patches for Android kernel branch, you could try below settings, from previous experience itās helpful for CPU benchmarks:
echo 1 > /proc/sys/kernel/sched_migration_cost_ns
echo 99999999 > /sys/class/thermal/thermal_zone0/sustainable_power
echo 0-7 > /dev/cpuset/background/cpus
echo 0-7 > /dev/cpuset/foreground/cpus
echo 0-7 > /dev/cpuset/top-app/cpus
echo 0-7 > /dev/cpuset/system-background/cpus
@leo-yan
I choosed libjpeg turbo as it is single thread benchmark and I want compare single thread speed of 2 different cores
And linaro was big contributor
./tjbench has switch to run it multiple times.
I donāt see /dev/cpuset/
How can I enable cpuset device?
echo NO_ENERGY_AWARE > /sys/kernel/debug/sched_features - no significant change
echo 99999999 > /sys/class/thermal/thermal_zone0/sustainable_power - no significant change
I hoped that Cortex A73 will be big boost compare to A15
For single thread benchmark, You could try ātasksetā command:
On android: taskset 16 benchmark benchmark_arguments
On debian: taskset 0x10 benchmark benchmark_arguments
So you can bind task to CPU4 (set bit 4 in upper command). CPU4/5/6/7 are CA73 CPUs.
Please check below configs have been enabled:
CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_SCHEDTUNE=y
Please note, AOSP images has enabled CPUSET and mount in /dev folder; so after system boot up, you need manually mount cpuset virtual fs, the detailed info you can refer the doc: https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt; The chapter ā1.9 How do I use cpusets ?ā has detailed info for how to use it.
after a quick look at arch/arm64/boot/dts/hisilicon/hi3660.dtsi - kernel/hikey-linaro - Git at Google,
i donāt find any result when i search for ācacheā keyword.
so iām wondering if the A73 cluster 2MB L2 cache is correctly set up and used by the system.
if not, this might explain your performance issue.
I looked at ARM default CPU
And they set up cache
and then
I updated dts file and it really go up
@nobe, thanks a lot for the info.
I just wander the L2 cache is in two clusters and they are built in. And basically when we enable CPU SMP bit and CCI coherency for two clusters then L2 cache will be enabled by default. So we have get ready for L2 cache in ARM Trusted Firmware at boot up time. So if we add the ācacheā node in DTS, why this impacts the performance heavily?
@leo-yan
to be honest, i shouldnāt answer this question because this is clearly outside of my understanding scope.
but maybe this blog post could be helpful to understand whatās happening : Get Cache Info in Linux on ARMv8 64-bit Platform
it references this commit from Sudeep Holla : https://android.googlesource.com/kernel/hikey-linaro/+/246246cbde5e840012f853e27630ebb59f409486
hereās a quote from his commit message :
It also implements the shared_cpu_map attribute, which is essential for
enabling both kernel and user-space to discover the systemās overall cache
topology.
hope this helps
@xlazom00
could you please share your new x264 and libjpeg-turbo benchmark after your device tree change ?
Sorry but it looks I get same result after all
but some results:
big cores only / governor on demand
./tjbench testimages/testimgari.jpg -benchtime 20
Image size: 227 x 149
Decompress ā Frame rate: 469.202742 fps
Throughput: 15.869844 Megapixels/sec
small cores only / governor on demand
./tjbench testimages/testimgari.jpg -benchtime 20
Image size: 227 x 149
Decompress ā Frame rate: 456.859094 fps
Throughput: 15.452345 Megapixels/sec
big cores only / governor performance min and max core freq 2,36GHz
./tjbench testimages/testimgari.jpg -benchtime 20
Image size: 227 x 149
Decompress ā Frame rate: 469.965926 fps
Throughput: 15.895658 Megapixels/sec
small cores only / governor performance min and max core freq 1,84GHz
./tjbench testimages/testimgari.jpg -benchtime 20
Image size: 227 x 149
Decompress ā Frame rate: 457.087566 fps
Throughput: 15.460073 Megapixels/sec
x264 encoder
threads 8 all cores enabled and all cores max freq
24.78
threads 8 big cores only
21.87
threads 8 small cores only
11.67
threads 6 all cores enabled
22.27
threads 6 big cores only
20.96
threads 6 small cores only
11.32
taskset -c 4-7 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 4.5521s
total number of events: 10000
total time taken by event execution: 18.1632
per-request statistics:
min: 1.80ms
avg: 1.82ms
max: 37.82ms
approx. 95 percentile: 1.81ms
Threads fairness:
events (avg/stddev): 2500.0000/5.74
execution time (avg/stddev): 4.5408/0.01
taskset -c 0-3 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 4.9342s
total number of events: 10000
total time taken by event execution: 19.7033
per-request statistics:
min: 1.96ms
avg: 1.97ms
max: 33.98ms
approx. 95 percentile: 1.97ms
Threads fairness:
events (avg/stddev): 2500.0000/4.74
execution time (avg/stddev): 4.9258/0.00
taskset -c 4-7 ./tinymembench
C copy backwards : 6613.6 MB/s (0.6%)
C copy backwards (32 byte blocks) : 6330.2 MB/s
C copy backwards (64 byte blocks) : 6322.5 MB/s
C copy : 6292.1 MB/s
C copy prefetched (32 bytes step) : 6845.9 MB/s
C copy prefetched (64 bytes step) : 6954.4 MB/s
C 2-pass copy : 2507.5 MB/s (0.3%)
C 2-pass copy prefetched (32 bytes step) : 2292.9 MB/s (3.5%)
C 2-pass copy prefetched (64 bytes step) : 2387.8 MB/s (4.8%)
C fill : 8675.2 MB/s
C fill (shuffle within 16 byte blocks) : 8661.5 MB/s
C fill (shuffle within 32 byte blocks) : 8659.0 MB/s
C fill (shuffle within 64 byte blocks) : 8661.8 MB/s
standard memcpy : 6727.1 MB/s (0.1%)
standard memset : 10114.2 MB/s (0.1%)
NEON LDP/STP copy : 6793.4 MB/s (0.2%)
NEON LDP/STP copy pldl2strm (32 bytes step) : 6861.7 MB/s
NEON LDP/STP copy pldl2strm (64 bytes step) : 6973.5 MB/s
NEON LDP/STP copy pldl1keep (32 bytes step) : 6878.4 MB/s (0.3%)
NEON LDP/STP copy pldl1keep (64 bytes step) : 6934.3 MB/s (0.1%)
NEON LD1/ST1 copy : 6576.7 MB/s
NEON STP fill : 10097.2 MB/s
NEON STNP fill : 10067.8 MB/s
ARM LDP/STP copy : 6543.7 MB/s
ARM STP fill : 10106.2 MB/s
ARM STNP fill : 10078.2 MB/s
block size : single random read / dual random read, [MADV_NOHUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.1 ns / 0.1 ns
131072 : 7.4 ns / 13.7 ns
262144 : 12.1 ns / 19.8 ns
524288 : 15.3 ns / 21.3 ns
1048576 : 17.0 ns / 21.4 ns
2097152 : 19.2 ns / 23.4 ns
4194304 : 122.4 ns / 180.2 ns
8388608 : 177.5 ns / 226.9 ns
16777216 : 205.4 ns / 243.7 ns
33554432 : 219.9 ns / 253.8 ns
67108864 : 228.4 ns / 258.4 ns
block size : single random read / dual random read, [MADV_HUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.1 ns / 0.1 ns
131072 : 7.4 ns / 13.7 ns
262144 : 11.2 ns / 20.0 ns
524288 : 13.0 ns / 23.0 ns
1048576 : 14.1 ns / 24.5 ns
2097152 : 15.9 ns / 27.3 ns
4194304 : 118.3 ns / 177.0 ns
8388608 : 170.1 ns / 219.2 ns
16777216 : 196.2 ns / 232.1 ns
33554432 : 209.2 ns / 236.7 ns
67108864 : 218.2 ns / 239.3 ns