APQ8096: Need to port patches for random boot failure issue, from kernel version 5.7.19 to 4.14.96

Hiren · December 7, 2020, 10:24am

Hi,

We have some frequency stability issue in kernel release version 4.14.96, we have also created Frequency switching stability issues on Q820 thread previously. After applying changes suggested by @Loic in loic.poulain/linux.git - [no description] branch, our cpufreq-bench issue got resolved but there is still a random boot failure issue.

We have tested kernel version 5.7.19 and it seems in this kernel version, random boot stuck issue got resolved (haven’t tested cpufreq-bench yet), we have tested approx. 70 hour reboot test and haven’t observed boot failure issue.

Now our condition is that we have released our product on 4.14.96 kernel and don’t want to upgrade whole kernel at this stage. So we want to know some path where we can port changes which solved this random boot failure issue. It will really helpful if we get to know specific patches or branch in which this issue was resolved.

Thanks,
Hiren

doitright · December 7, 2020, 1:55pm

I looked into backporting the changes some time ago, and the conclusion I came to was that it wasn’t feasible – much less work involved in moving forward to 5.4, which is an LTS release and free of the boot failure problem.

Any particular reason it would be a huge challenge for you to use a newer kernel?

Loic · December 12, 2020, 9:00am

Hi @Hiren , unfortunately I don’t know which change fixed the boot issue, it would request some git bisecting… I would suggest moving to 5.4 or up. An other solution (workaround) would be to disable ramdump/download mode, which prevents the board to reboot normally when the boot issue occurs, this requests a tiny change in the XBL bootloader (closed source) to discard boot_dload mode.

Hiren · December 14, 2020, 5:00am

Hi, Sorry for late response. @doitright we are checking with our customer to use latest kernel and why they don’t want to upgrade kernel.

@Loic we already gives bypass ramdump mode workaround to our customer but now they also want to fix this issue in 4.14.96 kernel, so we are looking for some patches which could possibly resolve this issue.

Hiren · February 17, 2021, 10:23am

Hi @Loic, @doitright,

Our customer has suggested below changes and it seems that after applying below changes in kernel version 4.14.96, we have not observed random boot failure issue in 24 hour testing. We will test more to make sure that this changes will resolved random boot failure issue.

diff --git a/drivers/clk/qcom/clk-cpu-8996.c b/drivers/clk/qcom/clk-cpu-8996.c
index 9a43bdf..0e8823c 100644
--- a/drivers/clk/qcom/clk-cpu-8996.c
+++ b/drivers/clk/qcom/clk-cpu-8996.c
@@ -281,7 +281,7 @@ int cpu_clk_notifier_cb(struct notifier_block *nb, unsigned long event,
                              DIV_2_INDEX);
        else
            ret = clk_cpu_8996_mux_set_parent(&cpuclk->clkr.hw,
-                             ACD_INDEX);
+                             PLL_INDEX);
        break;
    default:
        ret = 0;
@@ -345,7 +345,8 @@ static struct clk_cpu_8996_mux pwrcl_pmux = {
        },
        .num_parents = 4,
        .ops = &clk_cpu_8996_mux_ops,
-       .flags = CLK_SET_RATE_PARENT | CLK_IGNORE_UNUSED,
+       /* CPU clock is critical and should never be gated */
+       .flags = CLK_SET_RATE_PARENT | CLK_IS_CRITICAL,
    },
 };
 
@@ -366,7 +367,8 @@ static struct clk_cpu_8996_mux perfcl_pmux = {
        },
        .num_parents = 4,
        .ops = &clk_cpu_8996_mux_ops,
-       .flags = CLK_SET_RATE_PARENT | CLK_IGNORE_UNUSED,
+       /* CPU clock is critical and should never be gated */
+       .flags = CLK_SET_RATE_PARENT | CLK_IS_CRITICAL,
    },
 };

Could someone please suggest if we can go with this change or not. Or is there any side effect with this changes.

Thanks,
Hiren

Loic · February 19, 2021, 7:15am

Changes seem fine, this is what is used in the mainline clk-cpu-8996 driver, except for the PLL_INDEX change. So yes please confirm if you have stable boot sequences, if yes, you the PLL change should probably be submitted upstream.

Hiren · February 22, 2021, 12:45pm

Hi @Loic,

It seems that after PLL_INDEX change, we have stable boot sequences. I have tested it for mare than 24 hours. Although I have observed crash after 24 hour and more, but it is coming at some other point after complete boot and not after cpufreq log where it usually comes, but maybe it is some other issue.

I am trying to understand more about ACD_INDEX and PLL_INDEX. I have seen [v6,07/14] clk: qcom: Add ACD path to CPU clock driver for msm8996 - Patchwork related to adding ACD path for CPU clock. As per my understanding when ACD_INDEX is used, ACD will provide PLL/2 whenever there is some voltage drop detected otherwise it will provide PLL_EARLY clock. Please let me know if my understanding is correct or not and it would be really helpful if we get more information about difference between PLL_INDEX and ACD_INDEX.

One more thing is that in clk-cpu-8996.c driver of kernel v4.14.96 below line is written in comment section:

ACD stands for Adaptive Clock Distribution and is used to
detect voltage droops. We do not add support for ACD as yet.

While in clk-cpu-8996.c driver of kernel v5.10.7 below line is written:

ACD stands for Adaptive Clock Distribution and is used to
detect voltage droops.

So does that mean that ACD will works only with latest kernel and not in 4.14.96 ?

Thanks,
Hiren

tamo2 · July 2, 2021, 10:52pm

Hi @Hiren,
I am curious what happened to your patches. Is your patched 4.14 Kernel working fine without any issues?
Thanks,
tamo2

Hiren · July 5, 2021, 6:10am

Hello @tamo2,

After this changes, it seems that it has resolved our issue of random boot failure. I have observed some crash after long run but I think it is because of some other failure as crash was observed at some other point after bootup.

Thanks,
Hiren