Remote Bugs

Work1 Comment

As part of my job, I get to work with customers — when stuff isn’t going right. For MeetingPlace, I am the last line of defense. You can escalate no further. So usually when the case gets to me, its not good. Today, we had a classic case. Over the years, we have had trouble when our application runs on a box with other applications. Server applications tend to not interact well with each other. Most often you run server applications on their own box to prevent this type of stuff. We have moved to that in the past few years. Then sometimes customers want to install other software on our box for monitoring, administration or some other standard they have implemented. We usually say no because it leads to trouble.

So on this case today I got involved with, we had a major cable company having issues. Their servers kept hanging after a few minutes of runtime. Reboot the system all was fine. Restarting just our services would not correct the problem. I noticed that IIS was still processing simple files like images and HTML files but any ISAPI extensions, ASP pages, or other would hang IIS. Your browser would just sit there with the progress bar half loaded. Strange. I spent a lot of time verifying IIS and rebuilding parts of the configuration. Nothing. Then we noticed that the requests that were hung were not being logged in the IIS logs. Weird. After hours of verifying everything we could, we told the customer that its not our app but we don’t know what it is. Now what?

Since IIS was the issue, we opened a case with Microsoft. We got them a dump of IIS while it was hung. They cross-referenced the stack trace with their magic database and found a match with a fix that have previously published under KB #834010. We applied the fix and it work. Huh?

The customer noticed that the article referenced SiteScope as a product that cause Windows to exhibit this behavior. Turns out they are running SiteScope against our machine. When we initially started troubleshooting, we looked for any extra software on the box. Never thought to ask what software is accessing the box but isn’t installed directly on it! SiteScope his the remote registry APIs every 10-15 minutes and doing that causes this deadlock. Perfect explanation to the problem.

So 8 hours later, we had it solved. My favorite part of these cases is that we use MeetingPlace to fix MeetingPlace. For 8 hours, we had a virtual war room going where people would come and go and could access the console of the server as if they were sitting right in front of it. Painful and challenging but another problem solved — that I didn’t create!