Finding division by zero in a big project_问答_开发者

Recently, our big project began crashing 开发者_如何学JAVAon unhandled division by zero. No recent code seems to contain any likely elements so it may be new data sets affecting old code. The problem is the code base is pretty big, and running on an embedded device with no comfortable debug access (debug is done by a lot of printf()s over serial console, there is no gdb for the device and even if there was, the binary compiled with debug symbols wouldn't fit).

The most viable way would likely be to find all the division operations (they are relatively infrequent), and analyze code surrounding each of them to see if any of the divisor variables was left unguarded.

The question is then either how to find all division operations in a big (~200 files, some big) C++ project, or, if you have a better idea how to locate the error, please give them.

extra info: project runs on embedded ARM9, a small custom Linux distro, crosscompiled with Cygwin/Windows crosstools, IDE is Eclipse but there's also Cygwin with all the respective goodies. Thing is the project is very hardware-specific, and the crashes occur only when running at full capacity, all the essential interconnected modules active. Restricted "fault mode" where only bare bones are active doesn't create them.

I think the most direct step, would be to try to catch the unhandled exception and generate a dump or printf stack information or similar.

Take a look at this question or just search in google for info relating to exception catching in your particular environment.

By the way, I think that the division could happen as a result of a call to an external library, so it's not 100% sure that you'll find the culprit just by greping your code.

If I remember right, the ARM9 doesn't have hardware divide so it's going to be implemented in a function call the compiler makes whenever it has to perform a division.

See if your toolset implements the divide by zero handling in the same way as ARM's toolset does (it's likely that it does something at least similar). If so, you can install a handler that gets called when the problem occurs and you can printf() registers and stack so that you can determine where the problem is occurring. A possible similar alternative is that your small Linux distro is throwing a signal you can catch.

I'm not sure how you're getting your information that a divide by zero is occurring, but if it's because the runtime is spitting out a message to that effect, you always have the option of finding out where that is handled in the runtime, and replacing it with your own more informative message. However, I'd guess that there's a more 'architected' way to get your code to run (a signal handler or ARM's technique).

Finding all of the divisions shouldn't be hard with a custom grep search. You can easily distinguish that usage from other usages of the / and % character in C++.

Also, if you know what you are dividing, you could globally overload the / and % operator to have a __FILE__ and __LINE__ informing assertion. If using a makefile, it shouldn't be hard to include the custom operator code in all the linked files without touching the code.

You should use this as an excuse to invest in improving the debug-ability of your device - for both this problem and future issues. Even if you can't get live debugging, you should be able to find a way to generate and save off core dumps for post-mortem debugging (pinpointing the source or any unhandled exception immediately).

PC-Lint might help, it's like Findbugs for C++. It is a commercial product but there is a 30 money back guarantee.

Handle the exception.

Usually the exception will be handed a structure that contains the address that caused the exception and other information. You will probably have to become familiar with the microcontroller's datasheet or RTOS manual.

Use the -save-temps for gcc and find the relevant assembly for division in the generated .s file. If you're lucky it will be something fairly distinctive, possibly even a function call. If it's a function call you can use weak linking to override it with your own checked version. Otherwise locating the divisions in the assembly should give you a very good idea where they are in the C/C++ code and you can instrument them directly.

usually you could modify/override the divide-by-zero exception handler if you have access to the exception handler routines. in case of ARM, the division is done by a library routine. and there are mechanisms to inform the user-code, when a divide by zero occurs.

see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4061.html

i would suggest to provide a __rt_raise() as said in the page above.

__rt_raise(2,2) will get called when the divide routine detects a divide by zero. so you can print the LR register. and then use addr2line to crossref it against the source line

The only way to find those conditions is the usual:

try to reproduce the problem without looking at the source (as the bug already happened you should have info on the part of the program that is affected)
if found, check the source for this point and fix it, otherwise:
2.1. grep for each / not followed by a * or / (grep "/[^/*]" i think)
2.2. find the conditions for which the code is executed and reproduce it

The exception already has the address location of the offending divide by zero code. The CPU saves register contents when a exception occurs including the PC(program counter). Your OS should pass this information along (I assumes that is how you know it is divide by zero). Print the address and go look in your code. If you can print a stack trace it would be even easier to solve.

Another option would be to check the differences in your version control software between the last know working version and the first non working version. This should give you a limmited change set within which to search for the problem.