WCF timeouts are a nightmare_问答_开发者_运维开发者技术经验分享

We have a bunch of WCF services that work almost all of the time, using various bindings, ports, max sizes, etc. The开发者_如何转开发 super-frustrating thing about WCF is that when it (rarely) fails, we are powerless to find out why it failed. Sometimes you will get a message that looks like this:

System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '01:00:00'. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.

The problem is that the local socket timeout it's giving you is merely an attempt to be convenient. It may or may not be the cause of the problem. But OK, sometimes networks have issues. No big deal. We can retry or something. But here's the huge problem. On top of failing to tell you which precisely which timeout (if any) resulted in the failure ("your server-side receive timeout was exceeded," or something, would be helpful), WCF seems to have two types of timeouts.

Timeout Type #1) A timeout, that, if increased, would increase the chance of your operation's success. So, the pertinent timeout is an hour, you are uploading a huge file that will take an hour and twenty minutes. It fails. You increase the timeout, it succeeds. I have no no problem with this type of timeout.

Timeout Type #2) A timeout which merely defines how long you have to wait for the service to actually fail and give you an error, but modifying the value of this timeout has no impact on the chance of success. Basically, something happens during the first second of the service request which mucks things up. It will never recover. WCF doesn't magically retry the network connection for you. Fine, sometimes establishing a network connection doesn't go well. But, if your timeout is 2 hours, you have to wait 2 whole hours with no chance of it ever working before it finally acknowledges that it didn't work and gives you the error.

But the error you see in both cases looks the same. With timeout Type #2, it still looks like you are running into a timeout. But, you could increase all of your timeouts to 4 years, and all it would do is make it take 4 years to get an error message. I know that Type #2 exists because I can do an operation that is known to complete in less than a minute when successful, and have it take 2 hours to fail. But, if I kill it and retry, it succeeds quickly. (If you are wondering why there might be a 2 hour timeout on an operation that takes less than a minute, there are times I run the operation with a much larger file and it could take over an hour.)

So, to combat the problem with Type #2, you'd want your timeout to be really quick so you immediately know if there is a problem. Then you can retry. But the insurmountable problem is that because I don't know which timeouts are the cause of failure, I don't know what timeouts are Type #1 and which ones are Type #2. There may be one timeout (let's say the client-side send timeout) that acts like Type #1 in some cases and Type #2 in others. I have no idea, and I have no way of finding out.

Does anyone know how to track down Type #2 timeouts so I can set them to low values without having to shorten actual (read: Type #1) timeouts and lower the chance of success?

Thank you.

Clarification of Type #2 timeouts in response to Andrew Anderson's comment:

My belief is that something goes wrong between the client request and the code starting to execute on the server. In all cases where we have the server code indicate partial progress, it's never finished some of the operation without finishing the whole thing. So, the server code never gets to execute, and how long it would take to execute is irrelevant (other than that it affects what we set our timeout values to in the first place in order to accommodate it).

I always put a "heartbeat" message in my long-running WCF services. Then you can set Type #1 timeouts to a low value (2-3 times the heartbeat call frequency), and Type #2 timeouts become obvious.

To learn which particular timeout has caused a timeout or other error, configure and use tracing.

I've got the same problem, and it was related to a bad hardware, and it was really difficult to debug, also with wireshark (tcp sniffer) the packets didn't show any particular errors, we found some tcp-retries and this could have been a symptom, but actually the packets was simply stuck in somewhere inside the modem-router that was a telecom modem (pirelli gate 2 plus), after changed the modem/router, the problem completely disappear.

Anyway we figured out that a wsHttpBinding over http, it's more reliable for an internet connection where you don't have control, and you cannot be sure on what hardware is installed on the site.

Hope this can help also someone else :)

Make sure you are correctly handling service exceptions. You will often get connections that drop out for no reason if exceptions are not correctly handled. Also, if they do, and they're handled correctly, you can normally get some more useful information:

https://msdn.microsoft.com/en-us/library/ms733721(v=vs.110).aspx

Also, use a "Heartbeat" or regular ping method that you can call from the client. I have found that clients routers have an automatic timeout built into TCP connections that it uses to end idle connections. Without the heartbeat method the clients router might be prematurely ending a connection which wont be affected by the WCF service settings