I'm working with a single-threaded native c++ application. There is a very hard to reproduce bug that I cannot reproduce locally. I enabled full page heap and debug information in the release executable, and obtained dumps from a client (which has to use the application many days to get the bug).
What the client reports: the application hangs and never recovers. It has to be killed from the task manager. What I see from the dumps: the application is stuck in an infinite loop.
The loop is from walking a double linked list which has become cyclic. There are signs of memory corruption, in that开发者_运维技巧 many data members have strange values, like no matching enumerant, values under 0000FFFF or the linked list itself is reported as being 300 million+ in size which is not normal.
The only other information I can get from the dumps is that a socket read operation failed with 0 data read. This causes the walking of the (now cyclic) list.
I have several dumps all hanging in the same infinite loop. I've tried to get the allocation stack trace, but !heap -p -a gives me "ReadMemory error for address eeddccee Use `!address eeddccee' to check validity of the address." for all addresses I try.
Currently I'm looking into fixing the L4 warnings (except I don't know which can be related to this, I have a bunch of C4100, C4511, C4512 which I don't know how to fix; I'm mostly fixing no-brainer's like C4244). DebugDiag did not find anything, except give me a "This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required." on the single thread.
From what I see, my options are fixing more warnings, re-reading the code until something jumps at me or learning something new from here.
Is this really a memory corruption? Why does it hang in the same structure every time? How can I find the cause?
Fixing the warning errors is a good idea - it may help you feel better and will certainly reduce confusion in the build - but it's unlikely to resolve the present issue, so may be better left as an out-of-band task for the future.
Socket read failure with 0 data may imply the socket got closed down. Perhaps you have a timing problem here where socket closedown logic is resulting in concurrent access to some shared data structure that is not properly locked. Take a good look at the socket code to make sure locking is correct and watertight. Make sure that all possible error codes are handled correctly in your sockets API calls (Winsock, presumably?). You can be sure that even the slightest window for concurrent access on a container or "that can't happen" error paths will eventually be hit in your production environment. I know you said the app is single-threaded but Windows has a funny habit of giving you extra threads that you did not start up yourself, for example if you are using DLL services that themselves kick off new threads.
It's hard when you cannot get good production diagnostics, but if you can narrow down the problem to a particular area, try to isolate the failing code in a unit test application that mimics the usage in real life, and stress the heck out of it on your desktop. I have had intermittent bugs like this that even under heavy load in a specialized test app took hours to reproduce the problem. Running in this mode (release build of course) in the debugger may expose the issue more quickly that you would think.
Another option may be to install the Process Dumper on the failing machine and instruct it to dump a full memory image (debuggable as per standard Windbg DMP file) on access violation and process exit. This may provide better information than a minidump postmortem debug. If your client is cooperative they can instruct the dump to be generated when the problem next occurs. This is the closest you can get to a live debug without being on the machine or having remote access to it.
You may want to consider generating extra diagnostics in the socket closedown logic as well to verify whether or not this is the proximate cause of the error condition.
Make sure your client's OS and other system software is up-to-date with all required patches. Maybe this is not even your fault (though it seems likely that you have a bug, to be sure).
If it is some kind of heap corruption, then Application Verifier could help detect that in your own environment.
Set full page heap validation. If your application has any heap overrun or underrun, it will be caught immediately.
If Application Verifier or some other tool does not easily uncover the problem, then it may come down to deducing what could have led to the problem. Focus on a specific issue such as the circular list. What could cause that? The obvious places to look are at all pieces of code that touch the list (it is possible that some random bad memory write could cause it but more often the culprit is closer to the scene of the crime).
If the list is only accessed through well-defined methods, then your job is easier. If it is through a global pointer that everyone can touch, then it is harder but still possible to examine if you search through all references (any good editor can do that). If you find, for example, an error case that maybe doesn't clean up nicely and fill in a back link correctly, then you might be half way there. You then work backwards from there. What could cause that specific error? And so on. Deducing a "possible" chain of events that can lead to a certain situation can often resolve a problem like this (and can make you feel like a magician in the process especially if it is someone else's bug that you find).
This can be pretty much anything.
If it is heap corruption, try to insert heap checks into the code at strategic places. Make sure you binaries are compiled with the run time checks that Visual C++ compiler offers. If possible, obtain a testcase from your users. If this is not possible, try to get them run debugging binary and/or debug the live application. Fixing the warning is good idea though I find most of VC's level 4 warnings less than useful. Sprinkle your code liberally with assert(like) checks. Make sure all your pre-conditions and post-conditions are checked. Make sure you are really handling each and every return value of all function calls. Also avoid any questionable practices in code like using C-style casts and type punning.
For some closure to anyone interested: it was a dangling pointer. One year or so after posting the question, the customer changed the server hardware and kindly lent the server to us. I could easily reproduce it with live debugging on that machine and find the issue.
精彩评论