The usual suspect in these cases is blocking, so that's the first thing I looked for. No luck though, the process shows as Suspended, but not blocked by any other process. The wait type for the process is PAGEIOLATCH_SH, but the waittime is always just a few milliseconds. The CPU and IO for the process are off the charts however, and increasing with every refresh. So it's doing a whole lot of reading, but there is no clear indication of what. Even more strange, once I was hung up on a certain point in the processing, other users could get past that point, but would hang up in different spots.
So I trace to find the last query executed by the application and run that in the query analyzer. Sure enough, it returns immediately. So it's reading a huge amount of data, but I only asked for a few rows, what is the server doing? And why is it different from the application versus the query analyzer? So I turned on all SP and query-related events in the profiler, and now the culprit is revealed. An SP:Statement event shows that what is happening is this:
SELECT StatMan([SC0], [SB0000]) FROM (SELECT TOP 100 PERCENT [SC0], step_direction([SC0] over (order by NULL) AS [SB0000] FROM (SELECT SUBSTRING([Image_Back], 1, 100)++substring([Image_Back], case when datalength([Image_Back])<=200 then 101 else datalength[Image_Back])-99 end, datalength([Image_Back])) AS [SC0] FROM [dbo].[CheckItemImage] WITH (READUNCOMMITTED, SAMPLE 6.712329e+001 PERCENT) ) AS _MS_UPDSTATS_TBL_HELPER ORDER BY [SC0], [SB0000]) AS _MS_UPDSTATS_TBL OPTION (MAXDOP 1)
This is happening during a recompile of a view involved in the select. The table in question only contains a few hundred thousand rows, but they are check images, so there is a huge amount of binary data involved, and this statistics query is basically reading the whole thing.
Updating statistics on the database did not resolve the issue, the server still tries to build the statistics to optimize the query. So I turned off Auto Create Statistics, and now the query runs instantly (turning off Auto Update Statistics did not work).
Of course, I don't want to turn off these features, they are great features. And I'm not faulting SQL Server in the slightest on this one. Query optimization is a difficult problem, and statistics are a great way to help prune the set of potential plans, but I will from this point on immediately suspect this scenario whenever I see a process suspended with a waittype of PAGEIOLATCH_SH.
To re-enable these features, I'll need to determine why the stats query is taking so long. I may be able to build those particular statistics manually, or it may involve actually archiving some of the older data in the table.
So why didn't it hang in the query analyzer? Turns out it did if the application wasn't already hanging on the same query, I just never tried that until after I understood what the problem was. I suspect this is because if the server is already computing a particular statistic for another plan, then it will just use the existing statistic or compile the query without statistics. In this case, that resulted in almost instant execution times.
In a production database, this would have meant several hours of downtime. I hope this post helps others that may be seeing the same behavior find the solution more quickly than I did.