I am currently writing a windows service that runs entirely in the background and does something every day. My idea is that the service should be very stable so if something goes wrong it should not stop开发者_StackOverflow中文版 but try it next day again and of course log the exception. Can you suggest me any best practice how to make truly stable windows services?
I have read the article of Scott Hanselman of exception handling best practice where he writes that there are only few cases when you should swallow an exception. I think somehow that windows service is one of the few cases, but I would be happy to get some confirmation on that.
'Swallowing' an exception is different to 'abandoning a specific task without stopping the entire process'. In our windows service, we catch exceptions, log their details, then gracefully degrade that task and wait for the next task. We can then use the log to troubleshoot the error while the server is still running.
The question you should be asking, is should your Windows service be fault tolerant. Remebering that any unhandled exceptions will bring the service down, which results in its immediate unavailability. How do you think your service should behave? Should it try and continue servicing whatever it needs to? Should it be terminated?
Actually, if you have an unexpected exception that is passed all the way to the top level of your service, you should not continue processing; log it and propogate it. If you truly need a "reliable" service, then you'll need a "watchdog" that restarts the original service when it exits.
Note that modern operating systems act as a watchdog, so you don't need a watchdog service in most cases (check out the "Recovery" tab under your Service properties). Historically, critical services would have a second "watchdog" service whose sole purpose is to restart the real service if it fails.
It sounds like your design may be able to make use of the scheduler; just let Windows take care of the "once a day" part and just have your service do the task a single time. If it fails, fine; Windows is responsible for starting it again the next day.
One final note: this level of reliability in a service is rarely needed. In commercial code, I've only seen it used in a couple of antivirus programs and a network filtering program (that had to be running or else all network communication would fail). I've done a couple "watchdog" programs myself, but these were for customers like auto companies who would lose tons of money when their assembly line systems went down. In addition to the software watchdog, these systems also had redundant power supplies, RAIDed hot-swappable hard drives, and a complete duplicate of the entire system for use as an automatic failover.
Just saying: you may want to reconsider how much you really need to increase reliability (keeing in mind that 100% reliability is impossible; it can only be approached, at exponential cost).
In my opinion, you should establish a strong distinction between unrecoverable and recoverable exceptions, i.e., exceptions that prevent the continuation of your service (if your "static" data structures are corrupted) and exceptions that just determine the failure of the current operation. To make clear the distinction you may have to separated exception classes hierarchies.
This distinction should go along with a strong distinction between the structures of the "supervisor" part of the service (the one that schedules the periodic action) and the part of the service that actually does such periodic action. In case of a recoverable exception, you could abort the running operation and completely reset this last part, obviously logging all the details of the exception to the system event log; on the other hand, if you got an unrecoverable error (supervisor's structures in an inconsistent state and SEH exceptions, of course) you should just log your error and exit, since continuing running in an inconsistent state is much more dangerous than not running at all.
Swallowing exceptions is rarely a good idea and as Scott says in his article, there really are only a few valid cases where it might be the best option.
My advice would be to firstly, know what exceptions you're catching and catch them. It'll be more useful to you in the future if you know what you're catching rather than the generic (Exception e)
Once you've caught the exception then as you stated above, writing that to a logging service, perhaps emailing the details to the maintainer of the code or even firing off another event that sets up a re-try of the code with a limit on the number of attempts before a new message is issued to the code maintainer.
By catching specific exceptions you can do specific things about them. You can also catch the general exception to ensure that exceptions you really didn't expect don't cause a complete system failure.
Once you know about exceptions you weren't aware of before, these can then be refactored into the next release with a more ideal way of handling them.
Like so many things in software development rarely does "one size fit all". If you deem it appropriate to swallow the exception with the intention of retrying at a later date then that's perfectly reasonable. What really does matter is that you clean up after yourself, log and determine a reasonable retry policy before notifying someone.
The Exception Handling Block of the Enterprise Library may prove useful as you can modify your exception policy within config without changing the code.
A service should never stop. There are two classes of errors, errors in the Service itself, and errors in data provided to the service. Data Errors should be reported but not ignored. These two goals can be accomplished by having the service log errors, by providing a way to transmit error information to the user, and by having the service retry the failure after the user (or programmer in the case of an error in the service) has corrected what caused the service to fail (obviously the service will have to be stopped, re-installed, and re-started if a program error is corrected).
精彩评论