Pro-Exchange,Lync & Office 365
Belgian Microsoft Unified Communications Professionals
Microsoft Exchange Server, Microsoft Lync Server & Office 365
Troubleshooting a busy Exchange server that is not so busy

Hi,

 

A strange title I know but reality is not much different.. today I looked into a Exchange 2003 issue that had the following symptoms:

  • Users where not able to logon to the server
  • Users fortunate enough to be logged on to the server had a very slow experience in mail processing
  • The mail queue on the server increased slowly
  • 2 Exchange 2003 servers out of 15 where experiencing this issue
  • an Exchange 2007 cluster was running slow but users could still logon

 

So first thing we did was opening task manager to see if the server was running at 100% CPU, to our surprise it wasn’t and to make it stranger the server was completely IDLE.. no cpu activity at all to be seen !.

My first thought was that there could perhaps be a deadlock (wikipedia) somewhere and I opened up Perfmon to check the MSExchangeIS\RPC Operations /Sec and MSExchangeIS\RPC Requests counters. These counters are quite usefull when troubleshooting Exchange performance as they give you a good indication of how busy the server actually is. When a user performs an activity in Outlook, like reading an email in Online mode, it generates a series of RPC operations that are sent to the Exchange server. These RPC operations will be allocated to single Store “thread” and the number of outstanding store threads can be measured as “MSExchangeIS\RPC Requests”. Exchange 2003 can only allocate as much as 100 Store threads while Exchange 2007 can allocate up to 500. Whenever the store process has reached it limit of number of threads it’s unable to process the user’s RPC operations and therefore the user will be rejected and have a bad experience (disconnect, item cannot be found, mail stuck in outbox etc..).

 

After reviewing these counters in our environment I could see that the RPC Requests was around 98 which is close to the maximum of 100 and was the reason why users couldn’t logon anymore on Exchange 2003. However since the RPC Requests where in fact going up and down it did mean that there was no deadlock. To be sure that no third-party add-on was causing any problem with the store we stopped all services such as Anti-Virus and Backup agents but without any success. Another thing we tried was to clean out the SMTP queue and restart the SMTP and Information Store services but again without much success. At the same time we also looked at the RPC Requests counter on the Exchange 2007 cluster and noticed it was running at 137 which is higher than the normal value but as the limit hasn’t been reached it explained why users where still able to logon to Exchange 2007.

 

So after that we went after the typical causes:

  • Storage components where looked at to see if any issues where reported or errors to be found, in our case we didn’t immediately look at the perf counters (as we already had a hunch). But in your case make sure you are not seeing excessive IO Read and Write latencies on your physical disks. When the store process cannot write information fast enough to disk you will get more and more RPC Requests and eventually hit the maximum. A good troubleshooting technique is to open up your perfmon and use the RPC Averaged Latency counter and the Disk Counters to see if they correlate.
  • Network compponents where investigated like we did for the storage components but we got no reports of any complaints. In any other situation you should check your network card for any errors like CRC checksum errors, Duplex settings, latencies etc..
  • Processor was in our case idle but obviously if the processor is running at 100% all the time you will not be able to process RPC requests fast enough
  • Active Directory was looked to see if any of the domain controllers was producing any errors or was running at 100% etc.. None of the servers in the forest where Exchange resided was experiencing a problem.

 

In our case it got stranger by the minute but as I indicated earlier we already had a hunch where the root cause was heading. You have to know that this is a Exchange Resource Forest environment and a mailbox server in this environment hosts mailboxes from various “customers” and are linked up to their active directory forests (associated external account) or also referred to as “Account Forest”. In this particular case these 2 Exchange 2003 servers and the Exchange 2007 cluster where having lot’s of mailboxes of a particular Account Forest and that drew our attention. For this particular Account Forest we had established specific “Secure Channels” to force authentication traffic to go through a pre-defined path to ensure it remains within the datacenter making it more reliable, performing and we needed less firewall ports to be opened. So knowing this we already started looking in that direction and found that these secure channels where reset and pointing to a different set of domain controllers that where not reachable for Exchange. So every RPC request containing an authentication request for a user in that particular Account Forest resulted in a connection timeout which resulted in the increase of RPC Requests.

 

Here is some information on the counters and the relevant tresholds…

MSExchangeIS\RPC Requests

Indicates the number of MAPI RPC requests currently being serviced by the Microsoft Exchange Information Store service. Once the server reaches the maximum value it rejects new client requests.

An RPC Request could be for example a new mail that arrives in the store and has to be written into the database. the RPC request is finished when that information is actually written on the disk, so if the disk write latency is very high then the RPC Request remains in the “queue” a bit longer. Now if you have many simultaneous RPC Requests like that you can imagine that you can reach 100 quite fast.

If this counter is very high most of the time you need to investigate the following components for high latencies:

  • Storage Subsystem (IO Read/Write Latency)
  • Network Subsystem
  • Processor Subsystem
  • LDAP Provider (ex. connections to AD)

Exchange 2003 Maximum is 100

Exchange 2007 Maximum is 500

 

It should always be lower than 70.

MSExchangeIS\RPC Averaged Latency

Indicates the latency in milliseconds, averaged across all RPC operations in the past 1024 RPC packets.

It should always be less than 25 milliseconds.

MSExchangeIS\RPC Operations/sec

Indicates the current number of RPC operations that are submitted to the Information Store each second.

This is a good indicator of client activity and is used for information only.

 

Other usefull troubleshooting tools are:

Exchange User Monitor: This tool allows you to see all client RPC Requests comming from your users. If you see that your server is running high on RPC operations (deviating from your normal baseline values) you can use this tool to identify the user who is causing it.

Exchange Performance Wizard: This easy to use tool allows you to quickly setup a performance counter log using a predefined set of counters for Exchange

Exchange Troubleshooting Assistant

And don’t forget the classics like EventVwr and Perfmon :-)

 

And here is some more information on RPC Requests –> Troubleshooting Slow RPC Request Processing Issues http://technet.microsoft.com/en-us/library/cc540464.aspx

 

I hope this has been usefull for you and as always feel free to share your experiences or questions in the forum…

 

Sincerely,

Tonino Bruno | ICT Consultant  | PRO-Exchange Community


Posted 07-08-2009 11:43 by Tonino

Comments

pdtit wrote re: Troubleshooting a busy Exchange server that is not so busy
on 07-15-2009 11:48

Hi TB,

Another tool I often use for troubleshooting EX2007 performance issues is a freeware called "ExInsight", available from http://www.bitrunes.com

Cheers, PDTIT

http://trycatch.be/blogs/pdtit

Marc Lognoul wrote re: Troubleshooting a busy Exchange server that is not so busy
on 07-23-2009 3:15

Hi Toni

Additionally, the process of the server component (in this case store.exe) can be deeper analyzed in order to identify the outgoing connections that are in WAITING or the threads that are waiting/stuck. It often give a good indication of where the process misbehave or seems hung while it is not really. Process Explorer being the tool of choice to do that.

The ultimate solution being the use of kerberos with PAC validation disabled since it would prevent that roundtrip to a DC of the user's original domain from happening...

Marc