This is a baffling one. My ASP.NET 3.5 app that was working fine suddenly started getting timeout errors...
System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding
...but only for requests with the correct username/password in the connection string. If we deliberately messed up the connection string in the config, SQL Server would properly reject the request, but instantly, not after a 15 second wait (configured timeout), clearly indicating it had no problem communicating with the server.
We bounced the SQL Server box (SQL Server 2005, fully patched) and the IIS box, touched the web.config to force app restart, etc. but no luck. All requests would hang for 15 seconds then report that error. At all times we were able to reach the database server directly (management studio, monitoring tools), AND I was able to configure my local copy of the site inside Visual Studio to hit this same database server without incident. And this issue that came on suddenly by itself and lasted most of the day (errors started getting logged at 6:17am) suddenly resolved itself around 4:30pm.
It seemed to be acting as if there was a network routing issue between this one web server and this one database server, but only with a particular set of SQL credentials. I know this doesn't make any sense, but no does any other scenario we can imagine. I'm a very experienced developer and neither I nor our very experienced DBA nor sysadmins can find anything in event logs, monitoring irregularities, etc. that could give a sensible explanation to this bizarre and self-resolving set of symptoms.
This happened on our client's test environment, which was not ideal, but because we don't understand what the hell happened our concern is that this may suddenly pop up in production and we'll be beating our collective heads against the wall, so any ideas or wild theories are very much welcome.
The LINQ-to-SQL generated line of code that's blowing up is ExecuteMethodCall:
[Function(Name="dbo.spSetModelingNodeState")]
public int spSetModelingNodeState([Parameter(Name="NodeIdentifier", DbType="VarChar(60)")] string nodeIdentifier, [Parameter(Name="NodeStatus", DbType="Int")] System.Nullable<int> nodeStatus, [Parameter(Name="PoolWeighting", DbType="Float")] System.Nullable<double> poolWeighting)
{
IExecuteResult result = this.ExecuteMethodCall(this, ((MethodInfo)(MethodInfo.GetCurrentMethod())), nodeIdentifier, nodeStatus, poolWeighting);
return ((int)(result.ReturnValue));
}
and my code that calls that code is (abbreviated):
using (MyDataContext myDataContext = new MyDataContext(_connectionString))
{
myDataContext.spSetModelingNodeState(
Adapter.Identifier, // string
(int)newHealthValue, // enum, cast to int
PoolWeighting); // float
}
EDIT: Took a couple days to track down, but I've added log/Profiler info below, per commentor requests, and the body of the stored procedure in question
The SQL Server logs simply show logins and logouts and no indication of problem. The Profiler trace also has no smoking gun, but the following seems to show a statement taking 18 seconds. Follow spid 82. It begins at 4:37:50, does some things, then spid 83 records a login at 4:38:07. Spid 82 records its completion next, with the same 4:38:07 start time, but because these are logged sequentially, it must have actually occurred somewhere between 4:38:07 (previous trace statement) and 4:40:10 (following statement).
Spid Starttime Event Query 82 4:37:50 PM Login -- network protocol: LPC set quoted_identifier on set arithabort off set numeric_roundabort off set ansi_warnings on set ansi_padding on set ansi_nulls on set concat_null_yields_null on set cursor_close_on_commit off set implicit_transactions off 82 4:37:50 PM SP:StmtStarting EXEC @RETURN_VALUE = [dbo].[spSetModelingNodeState] @NodeIdentifier = @p0, @NodeStatus = @p1, @PoolWeighting = @p2 82 4:37:50 PM SP:Starting EXEC @RETURN_VALUE = [dbo].[spSetModelingNodeState] @NodeIdentifier = @p0, @NodeStatus = @p1, @PoolWeighting = @p2 82 4:37:50 PM SP:StmtStarting UPDATE ModelingNodeState SET NodeStatus = @NodeStatus, PoolWeighting = ISNULL(@PoolWeighting, PoolWeighting) WHERE NodeIdentifier = @NodeIdentifier 83 4:38:07 PM Login -- network protocol: LPC set quoted_identifier on set arithabort off set numeric_roundabort off set ansi_warnings on set ansi_padding on set ansi_nulls on set concat_null_yields_null on set cursor_close_on_commit off set implicit_transactions off 82 4:37:50 PM SP:Completed EXEC @RETURN_VALUE = [dbo].[spSetModelingNodeState] @NodeIdentifier = @p0, @NodeStatus = @p1, @PoolWeighting = @p2 80 4:40:10 PM SP:StmtStarting SELECT 'Server[@Name=' + quotename(CAST(serverproperty(N'Servername') AS sysname),'''') + ']' + '/Database[@Name=' + quotename(db_name(),'''') + ']' + '/Table[@Name=' + quotename(tbl.name,'''') + ' and @Schema=' + quotename(SCHEMA_NAME(tbl.schema_id),'''') + ']' AS [Urn], tbl.name AS [Name], SCHEMA_NAME(tbl.schema_id) AS [Schema], CAST( case when tbl.is_ms_shipped = 1 then 1 when ( select major_id from sys.extended_properties where major_id = tbl.object_id and minor_id = 0 and class = 1 and name = N'microsoft_database_tools_support') is not null then 1 else 0 end AS bit) AS [IsSystemObject], tbl.create_date AS [CreateDate], stbl.name AS [Owner] FROM sys.tables AS tbl INNER JOIN sys.database_principals AS stbl ON stbl.principal_id = ISNULL(tbl.principal_id, (OBJE开发者_JAVA百科CTPROPERTY(tbl.object_id, 'OwnerId'))) WHERE (CAST( case when tbl.is_ms_shipped = 1 then 1 when ( select major_id from sys.extended_properties where major_id = tbl.object_id and minor_id = 0 and class = 1 and name = N'microsoft_database_tools_support') is not null then 1 else 0 end AS bit)=@_msparam_0) ORDER BY [Schema] ASC,[Name] ASC 80 4:40:10 PM SQL:BatchStarting use [master] 62 4:20:10 PM Logout NULL 55 4:20:02 PM Logout NULL 74 4:13:37 PM Logout NULL 59 4:20:10 PM Logout NULL 55 4:40:29 PM Login -- network protocol: TCP/IP set quoted_identifier on set arithabort off set numeric_roundabort off set ansi_warnings on set ansi_padding on set ansi_nulls on set concat_null_yields_null on set cursor_close_on_commit off set implicit_transactions 55 4:40:29 PM SQL:BatchStarting set transaction isolation level read committed set implicit_transactions off 82 4:37:50 PM Logout NULL
This is definitely a deeper level of SQL debugging than I'm accustomed to, so please let me know if I'm misreading this, but it certainly appears like this statement was running without incident, just incredibly slowly. And again, it was running fine beforehand, started timing out, then resumed normal operation.
The only thing I can think of now that I've ruminated in public is that if there were some long-running locks jamming up the tables called by the sproc, that might explain it. Because LINQ-to-SQL somewhat hides the login/logout process, it's possible that the login was always working fine, and it's just this one sproc call that timed out due to a block. Whether those tables were locked, and if so, why, is impossible to say at this point. Does that sound like the most likely explanation for what I saw, or does anyone have another theory?
For completeness, here's the body of the sproc:
/****** Object: StoredProcedure [dbo].[spSetModelingNodeState] Script Date: 10/29/2010 14:37:46 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[spSetModelingNodeState]
(
@NodeIdentifier varchar(60),
@NodeStatus int,
@PoolWeighting float = NULL
)
AS
/*try to update existing row to new state*/
UPDATE
ModelingNodeState
SET
NodeStatus = @NodeStatus,
PoolWeighting = ISNULL(@PoolWeighting, PoolWeighting)
WHERE
NodeIdentifier = @NodeIdentifier
IF @@ROWCOUNT = 0
/*not found, so insert new one*/
INSERT
ModelingNodeState(
NodeIdentifier,
PoolWeighting,
NodeStatus,
LastModelingResult)
VALUES(
@NodeIdentifier,
ISNULL(@PoolWeighting, 1),
0,
NULL)
DECLARE @timestamp datetime
SET @timestamp = CURRENT_TIMESTAMP
/*fill endtime of previous node state*/
UPDATE
ModelingNodeStateLog
SET
EndTime = @timestamp
WHERE
EndTime IS NULL AND
NodeIdentifier = @NodeIdentifier AND
NodeStatus <> @NodeStatus
/*start a new entry in the log (yes, I saw I should remove the IF check and always insert, but that's how it currently is in the db)*/
IF @@ROWCOUNT <> 0
INSERT
ModelingNodeStateLog
(
NodeIdentifier,
NodeStatus,
StartTime
)
VALUES
(
@NodeIdentifier,
@NodeStatus,
@timestamp
)
RETURN
Since nobody else has chimed in, I'm going to assume that my evolved theory of what happened was correct. Specifically, this was due to a lock of as-yet undiagnosed origin on the ModelingNodeState or ModelingNodeStateLog table causing a statement timeout, not a connection timeout. This distinction was obfuscated by LINQ-to-SQL's usually handy management of connections.
This agrees with all observed symptoms:
- Sproc calls with deliberately bad credentials in the connection string were rejected immediately at the connection level
- Sproc calls with correct credentials in the connection string were allowed to login (SQL Server logs showed normal login/logout), but timed out executing the statement agains the locked table
- LINQ's hiding of the underlying SQL connection management meant the captured stack trace could not be used to distinguish between a connection and a statement timeout, and I erroneously assumed it was a connection timeout.
- We have no idea how/why/if there was an all-day lock on either or both of these tables, but an uncommitted statement somewhere could have done it, as this began after a night of deploys involving DBA work, migration scripts, etc. This would explain things starting suddenly (after the script began), lasting all day, then suddenly resolving when the offending script committed or was rolled back, perhaps when the app running it was shut down at the end of the work day.
Lessons learned:
- Don't assume you know what kind of timeout it is just because you saw the word "timeout" in an exception or stack trace
- As I thought about it, I realized that this whole aspect of the app could be allowed to fail utterly without impacting critical app functionality. I'm going to rewrite it so that a recurrence of this issue will not bring the whole app down as it does today.
Thanks for the community feedback that steered me to the (let's hope) correct answer!
I have seen this behavior after restoring a database backup from a different computer.
This can be fixed by a SQL command in management studio. See How to fix orphaned SQL Server users
精彩评论