Windows Services -- High availability scenarios and design approach_问答_开发者

Let's say I have a standalone windows service running in a windows server machine. Ho开发者_运维技巧w to make sure it is highly available?

1). What are all the design level guidelines that you can propose?

2). How to make it highly available like primary/secondary, eg., the clustering solutions currently available in the market

3). How to deal with cross-cutting concerns in case any fail-over scenarios

If any other you can think of please add it here ..

Note: The question is only related to windows and windows services, please try to obey this rule :)

To keep the service at least running you can arrange for the Windows Service Manager to automatically restart the service if it crashes (see the Recovery tab on the service properties.) More details are available here, including a batch script to set these properties - Restart a windows service if it crashes

High availability is more than just keeping the service up from the outside - the service itself needs to be built with high-availabiity in mind (i.e. use good programming practices throughout, appropriate datastructures, pairs resource aquire and release), and the whole stress-tested to ensure that it will stay up under expected loads.

For idempotent commands, tolerating intermittent failures (such as locked resources) can be achieved by re-invoking the command a certain number of times. This allows the service to shield the client from the failure (up to a point.) The client should also be coded to anticipate failure. The client can handle service failure in several ways - logging, prompting the user, retrying X times, logging a fatal error and exiting are all possible handlers - which one is right for you depends upon your requirements. If the service has "conversation state", when service fails hard (i.e. process is restarted), the client should be aware of and handle ths situation, as it usually means current conversation state has been lost.

A single machine is going to be vulnerable to hardware failure, so if you are going to use a single machine, then ensure it has redundant components. HDDs are particularly prone to failure, so have at least mirrored drives, or a RAID array. PSUs are the next weak point, so redundant PSU is also worthwhile, as is a UPS.

As to clustering, Windows supports service clustering, and manages services using a Network Name, rather than individual Computer names. This allows your client to connect to any machine running the service and not a hard-coded name. But unless you take additional measures, this is Resource failover - directing requests from one instance of the service to another. Converstaion state is usually lost. If your services are writing to a database, then that should also be clustered to also ensure reliabiity and ensure changes are available to the entire cluster, and not just the local node.

This is really just the tip of the iceberg, but I hope it gives you ideas to get started on further research.

Microsoft Clustering Service (MSCS)

If you break down the problems you are trying to solve, I think you'll probably come up with a few answers yourself. As Justin mentioned in the comment, there is no one answer. It completely depends on what your service does and how clients use it. You also don't specify any details about the client-server interactivity. HTTP? TCP? UDP? Other?

Here are some things to think about to get you started.

1) What do you do if the service or server goes down?

How about run more than one instance of your service on separate servers?

2) Ok, but now how do the clients know about the multiple services?

You can hard code the list into each client(not recommended)
You can use DNS round-robin to bounce requests across all of them.
You can use a load-balancing device.
You can have a separate service that knows about all of the other services and can direct clients to available services.

3) So what if one service goes down?

Do the client applications know what to do if the service they are connected to goes down? If not, then they need to be updated to handle that situation.

That should get you started with the basic idea of how to get started with high-availability. If you provide specific details about your architecture, you will probably get a much better response.

If the service doesn’t expose any interface for client connectivity you could:

Broadcast or expose an “I’m alive” message or signal a database/registry/tcp/whatever that you are alive
Have a second service (monitor) that checks for these “I’m alive” signals and try to restart the service in case it is down

But if you have a client connecting to this service through namedpipes/tcp/etc, the client would have to check the address of the machine with the service running in a database, or have something fancier like an intelligent switch to redirect traffic.