Quantcast
Channel: Processors
Viewing all articles
Browse latest Browse all 149881

Forum Post: RE: am335x watchdog disabled by Linux boot

$
0
0

Hi Tor,

Thanks for the patches, they are working! However after running stress testing with our kernel after a few days or weeks of operation we were running into issues with accesses in /sys and /proc starting to oops the kernel. I traced the issue back to this patch, and specifically to the enabling of the flag WDIOF_KEEPALIVEPING or possibly WDIOC_SETOPTIONS. It seemed like a problem with the scheduling as the oops would happen inside try_to_wake_up, dereferencing an invalid task entry.

After removing this support (see patch) our devices run OK, even in lots of stress testing, and still have the watchdog functionality running through the boot process. We use it with NOWAYOUT as well.

Here is an example script that we had crashing the process through the kernel oops after 5-8 hours running continuously, and sometimes adding multiple instances helped oops it faster as well. After the first oops this script would oops again quickly, so it seems to be creating some corruption inside the kernel. Only a reboot seemed to fix the issue.

#!/bin/sh
while true; do
        echo default-on > /sys/class/leds/<any>_led/trigger
        echo 1 > /sys/class/leds/<any>_led/brightness
done

== or ==

while true;
do
        cat /sys/class/net/eth0/statistics/rx_packets > /dev/null
        echo 0 > /sys/class/gpio/gpio47/value
done


Really, lots of /sys or /proc accesses would trigger this. In our devices sometimes it took weeks to first manifest itself, sometimes really quickly inside a day, our app writes and reads from /sys filesystem a lot.

Of course, this could be a compiler issue or something else, we are using Yocto/Dora with a kernel based on the TI PSP for the am335x, as of d5720d33bc7c434f9a023dbb62c795538f976b7a, with some additional patches from our SOM vendor.

The telltale sign is an oops like this, always in try_to_wake_up and with address e5832000 (on our system):

[94307.627105] Unable to handle kernel paging request at virtual address e5832000
[94307.634735] pgd = cf4e0000
[94307.637573] [e5832000] *pgd=00000000
[94307.641357] Internal error: Oops: 5 [#1]
[94307.645477] Modules linked in: option usb_wwan
[94307.650177] CPU: 0    Not tainted  (3.2.0 #1)
[94307.654785] PC is at try_to_wake_up+0x18/0x8c
[94307.659362] LR is at wake_up_process+0x18/0x1c
[94307.664062] pc : [<c0037688>]    lr : [<c0037714>]    psr: 800f0093
[94307.664062] sp : cf71de90  ip : cf71deb0  fp : cf71deac
[94307.676147] r10: cf4eac00  r9 : cf46e1c0  r8 : cf71df78
[94307.681640] r7 : 00000019  r6 : 00000019  r5 : 800f0013  r4 : e5832000
[94307.688537] r3 : c044fe60  r2 : 00000000  r1 : 0000000f  r0 : e5832000
[94307.695404] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[94307.703002] Control: 10c5387d  Table: 8f4e0019  DAC: 00000015
[94307.709075] Process monit (pid: 1137, stack limit = 0xcf71c2f0)
[94307.715301] Stack: (0xcf71de90 to 0xcf71e000)
[94307.719879] de80:                                     c044fe60 00000019 be952bb4 00000019
[94307.728515] dea0: cf71debc cf71deb0 c0037714 c003767c cf71decc cf71dec0 c044fda0 c0037708
[94307.737152] dec0: cf71dedc cf71ded0 c044fdc8 c044fd84 cf71df24 cf71dee0 c00b75b8 c044fdb0
...
[94307.814727] dfe0: 00000003 be952bb0 000234d8 4b6e7eac 200f0010 00000006 00000000 00000000
[94307.823333] Backtrace:
[94307.825927] [<c0037670>] (try_to_wake_up+0x0/0x8c) from [<c0037714>] (wake_up_process+0x18/0x1c)
[94307.835174]  r6:00000019 r5:be952bb4 r4:00000019 r3:c044fe60
[94307.841186] [<c00376fc>] (wake_up_process+0x0/0x1c) from [<c044fda0>] (__mutex_unlock_slowpath+0x28/0x2c)
[94307.851257] [<c044fd78>] (__mutex_unlock_slowpath+0x0/0x2c) from [<c044fdc8>] (mutex_unlock+0x24/0x28)
[94307.861083] [<c044fda4>] (mutex_unlock+0x0/0x28) from [<c00b75b8>] (seq_read+0x408/0x418)
[94307.869720] [<c00b71b0>] (seq_read+0x0/0x418) from [<c00e1564>] (proc_reg_read+0x44/0x60)
[94307.878326] [<c00e1520>] (proc_reg_read+0x0/0x60) from [<c009d6c0>] (vfs_read+0xb0/0x140)
[94307.886962]  r5:be952bb4 r4:cf46e1c0
[94307.890747] [<c009d610>] (vfs_read+0x0/0x140) from [<c009da78>] (sys_read+0x40/0x74)
[94307.898895]  r8:00000040 r7:be952bb4 r6:cf46e1c0 r5:00000000 r4:00000000
[94307.906005] [<c009da38>] (sys_read+0x0/0x74) from [<c00131c0>] (ret_fast_syscall+0x0/0x30)
[94307.914703]  r8:c0013368 r7:00000003 r6:00086a00 r5:00086bcc r4:000869f0
[94307.921783] Code: e24cb004 e1a04000 e10f5000 f10c0080 (e5900000)
[94307.928283] ---[ end trace ab3aa54f6031918a ]---

Just a note so if someone sees an error like this....

Thanks!

Kevin


Viewing all articles
Browse latest Browse all 149881

Trending Articles