开发者

Embedded systems : last gasp before reboot

开发者 https://www.devze.com 2022-12-19 08:58 出处:网络
When things go badly awry in embedded systems I tend to write an error to a special log file in flash and then reboot (there\'s not much option if, say, you run out of memory).

When things go badly awry in embedded systems I tend to write an error to a special log file in flash and then reboot (there's not much option if, say, you run out of memory).

I realize even that can go wrong, so I try to minimize it (by not allocating any memory during the final write, and boosting the write processes priority).

But that relies on someone retrieving the log file. Now I was considering sending a message over the intertubes to report the error before rebooting.

On second thoughts, of course, it would be better to send that message aft开发者_运维知识库er reboot, but it did get me to thinking...

What sort of things ought I be doing if I discover an irrecoverable error, and how can I do them as safely as possible in a system which is in an unstable state?


One strategy is to use a section of RAM that is not initialised by during power-on/reboot. That can be used to store data that survives a reboot, and then when your app restarts, early on in the code it can check that memory and see if it contains any useful data. If it does, then write it to a log, or send it over a comms channel.

How to reserve a section of RAM that is non-initialised is platform-dependent, and depends if you're running a full-blown OS (Linux) that manages RAM initialisation or not. If you're on a small system where RAM initialisation is done by the C start-up code, then your compiler probably has a way to put data (a file-scope variable) in a different section (besides the usual e.g. .bss) which is not initialised by the C start-up code.

If the data is not initialised, then it will probably contain random data at power-up. To determine whether it contains random data or valid data, use a hash, e.g. CRC-32, to determine its validity. If your processor has a way to tell you if you're in a reboot vs a power-up reset, then you should also use that to decide that the data is invalid after a power-up.


There is no single answer to this. I would start with a Watchdog timer. This reboots the system if things go terribly awry.

Something else to consider - what is not in a log file is also important. If you have routine updates from various tasks/actions logged then you can learn from what is missing.

Finally, in the case that things go bad and you are still running: enter a critical section, turn off as much of the OS a possible, shut down peripherals, log as much state info as possible, then reboot!


The one thing you want to make sure you do is to not corrupt data that might legitimately be in flash, so if you try to write information in a crash situation you need to do so carefully and with the knowledge that the system might be an a very bad state so anything you do needs to be done in a way that doesn't make things worse.

Generally, when I detect a crash state I try to spit information out a serial port. A UART driver that's accessible from a crashed state is usually pretty simple - it just needs to be a simple polling driver that writes characters to the transmit data register when the busy bit is clear - a crash handler generally doesn't need to play nice with multitasking, so polling is fine. And it generally doesn't need to worry about incoming data; or at least not needing to worry about incoming data in a fashion that can't be handled by polling. In fact, a crash handler generally cannot expect that multitasking and interrupt handling will be working since the system is screwed up.

I try to have it write the register file, a portion of the stack and any important OS data structures (the current task control block or something) that might be available and interesting. A watchdog timer usually is responsible for resetting the system in this state, so the crash handler might not have the opportunity to write everything, so dump the most important stuff first (do not have the crash handler kick the watchdog - you don't want to have some bug mistakenly prevent the watchdog from resetting the system).

Of course this is most useful in a development setup, since when the device is released it might not have anything attached to the serial port. If you want to be able to capture these kinds of crash dumps after release, then they need to get written somewhere appropriate (like maybe a reserved section of flash - just make sure it's not part of the normal data/file system area unless you're sure it can't corrupt that data). Of course you'd need to have something examine that area at boot so it can be detected and sent somewhere useful or there's no point, unless you might get units back post-mortem and can hook them up to a debugging setup that can look at the data.


I think the most well known example of proper exception handling is a missile self-destruction. The exception was caused by arithmetic overflow in software. There obviously was a lot of tracing/recording media involved because the root cause is known. It was discovered debugged.

So, every embedded design must include 2 features: recording media like your log file and graceful halt, like disabling all timers/interrupts, shutting all ports and sitting in infinite loop or in case of a missile - self-destruction.


Writing messages to flash before reboot in embedded systems is often a bad idea. As you point out, no one is going to read the message, and if the problem is not transient you wear out the flash.

When the system is in an inconsistent state, there is almost nothing you can do reliably and the best thing to do is to restart the system as quickly as possible so that you can recover from transient failures (timing, special external events, etc.). In some systems I have written a trap handler that uses some reserved memory so that it can, set up the serial port and then emit a stack dump and register contents without requiring extra stack space or clobbering registers.

A simple restart with a dump like that is reasonable because if the problem is transient the restart will resolve the problem and you want to keep it simple and let the device continue. If the problem is not transient you are not going to make forward progress anyway and someone can come along and connect a diagnostic device.

Very interesting paper on failures and recovery: WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?


For a very simple system, do you have a pin you can wiggle? For example, when you start up configure it to have high output, if things go way south (i.e. watchdog reset pending) then set it to low.


Have you ever considered using a garbage collector ?

And I'm not joking.

If you do dynamic allocation at runtime in embedded systems, why not reserve a mark buffer and mark and sweep when the excrement hits the rotating air blower.

You've probably got the malloc (or whatever) implementation's source, right ?

If you don't have library sources for your embedded system forget I ever suggested it, but tell the rest of us what equipment it is in so we can avoid ever using it. Yikes (how do you debug without library sources?).

If you're system is already dead.... who cares how long it takes. It obviously isn't critical that it be running this instant; if it was you couldn't risk "dieing" like this anyway ?

0

精彩评论

暂无评论...
验证码 换一张
取 消