When we take a look at the past and the present of Exchange, we can definitely state that Microsoft has made great improvements when it comes to the aspect of storage.
Although that configuring storage for Exchange Server 2010 might seem an easy task (and up to a certain degree it is an easy task); there's definitely more to it than just creating a RAID array and adding enough spindles to match performance from your sizing calculations. I'm not going into detail about how to make those storage calculations (using the mailbox server role requirements calculator). However; I've written this article to shed a broader light on what is happening under the hood in Exchange 2010.
Testing for this article was performed using Exchange Jetstress 2010. This is a tool that allows you to perform different tests on your storage configuration, using a simulated Exchange 2010 load. Jetstress uses the same block sizes and io-profile as a real-life Exchange Server, making it the ideal companion to validate storage configurations.
The basics & the theory
Before diving into the internals of Exchange, we needs to grasp the basics of how hard disks actually work.
First, let's take a look at the hardware. Below is an image of a hard disk in which different parts have been indicated. The ones that are the most important (for the further course of this article) are:
- Platter(s): circular disks where - on the surface - the data is written.
- Spindle: the motor that is responsible for spinning the platters up to a certain speed, expressed in RPM (rotations per minute)
- Head: small piece of hardware attached to the actuator arm that will actually perform the read/writes on the platters by transforming the disk's magnetic field
- (Actuator) Arm: mechanical moving arm, responsible for moving the head (in an arch-like movement) across the disk so that it can reach all parts of the platter.
Internals of a HDD ( source: Wikipedia)
Let's just drill a bit further down into how platters are built up. This is a rather important piece of information if you want to know how each of the components individually (or together) can affect a disk's performance...
- A hard disk is usually made out of one ore more "platters" .
- Each of these platters contain multiple tracks (on both sides of the disk). Tracks can be seen as concentric circles on the platter. You could actually compare them to the different lanes on a running track. (A)
- Each track is divided into multiple sectors. A sector is typically 512 bytes. (C)
- Clusters are a set of sectors. The cluster-size can be configured and often varies from 4KB to 64KB. The cluster-size is also referred to as allocation unit size in Windows. (D) e.g. A cluster size of 4KB contains 8 sectors.
Graphical representation of a disk's layout (source unknown)
Whenever data is written to the disk, the minimum size it will use will be equal to the cluster size. If you have a lot of small files, configuring a large cluster size would result in lots of lost space. For example, if you typically write small files to disk (<=1KB) and if you formatted a NTFS-volume with a 8KB cluster size, each file would take up at least 8KB which would result in lots of wasted disk space.
Now that we know how data is actually stored on the disk, let's talk performance. Logically, when thinking about disk performance you automatically think about throughput (usually expressed in MB/s). However, this is only a single aspect of performance.
The actual performance of a disk depends on different characteristics of the disk:
- Rotational speed (RPM)
- (Avg.) Seek time*
- The seek time measures the time it takes to travel the head (on the arm) to the track where data will be read or written. This time is usually expressed in average milliseconds.
- (Rotational) Latency
- Is the time that the head (on the arm) has to wait for the disk to bring the required sector under the head.
- Data Density: this metric measures how many bits can be recorded per inch of track (BPI)
Each of the characteristics above have a different impact on a disk's performance, very much depending on the type of IO you are generating. There are two "common" types of IO: Random and Sequential.
Note: the above only applies to rotational disks. SSD's don't have a moving arm, their seek time is measured from the time needed for the electric circuits to prepare reading of a particular location in the disk's memory. This average seek time is significantly lower than for regular (rotational) disks.
Your IO type is sequential when the data you are trying to read is grouped together on the disk (contiguous blocks). Random I/Os tend to read/write data that is spread all over the disk.
Disk Transfer Rates (throughput) - Sequential I/Os
Once past the initial positioning of the head (thus past the initial seek time), the actuator arm only needs to move a fraction with each rotation in order to move between tracks as data is being transferred (assuming the data is contiguous of course). Therefore the avg. seek time has little or no impact on the performance at this point. It's rather the disks rotational speeds which is the limiting factor.
To determine the throughput of the disk, we first need to calculate how much data a single track can hold. To do so, we need to multiply the disk's circumference with the data density. This information is usually available from the technical spec. sheet of the hard drive. Let's take fictive example of a 2.5" hard disk with a data density of 1.000.000 bits per inch.
e.g. 2.5 * 3,14 * 1.000.000 bits per inch = 7.850.000 bits = +/- 0,93 MB
This means that for every full rotation of the disk, 0,93MB can be written to the disk. A disk rotating at 10.000 RPM, rotates roughly about 166 times per second, which means that maximally 166*0,93MB can be transferred to/from disk per second. Which would be about 154,38MB/s.
Note that these calculations are only rough values but generally tend to be quite truthful though. Why is is not always correct? Because the calculation is only valid for the outermost track of the platter. Tracks that are closer to the middle of the disk have a smaller circumference and therefore hold less data which on it's turn results in a lower throughput.
Let's put this to a simple test: if we take the example of the 2.5" disk above, the outermost track's circumference will be 7,87 (2.5 x pi) whereas the innermost track might only have a circumference of 3 inches; which results in 3*1.000.000 bits per inch = 3.000.000 = +/- 0,36 MB. Resulting in about 60MB/s (166*0,36MB) which is almost one third of the maximum throughput!
That is also the reason why some manufacturers limit the tracks that are used on a platter (only the outermost tracks are used to read/write data). Doing so, the performance is artificially "boosted" by preventing usage of the inner tracks that have a lower throughput rate.
Disk performance optimizations in Exchange 2010
The nature of I/O's of a database is seldom sequential. Mostly they will be random (data is accessed from across different places on the platter(s)).
Exchange was one of the applications that mostly used non-sequential (aka random) I/O's. Back in the early days, Exchange was designed with a reduction of space needed in mind and was therefore less 'performing'. Over the years, storage became cheaper and mailboxes tended to grow, placing a bigger load onto Exchange. This caused Microsoft to focus more on Exchange performance rather than disk space optimization.
Exchange 2007 greatly improved storage needs over Exchange 2003 and Exchange 2010 again improves storage utilization dramatically over Exchange 2007. Microsoft was able to achieve this by modifying the I/O behavior for Exchange from being totally random to being more sequential. In order to achieve this, they had to re-write the database layout in Exchange 2010 and therefore needed to make some hard choices (e.g.: removing single instance storage).
Previous versions of Exchange used a per-database table structure. This layout allowed for single instance storage, but required a lot of (random) I/O's. Whenever a user would access his mailbox, the data needed to be fetched from the different tables in the database where not only the user's data was located but also the data for all other users with a mailbox in that database.
(Table layout of an Exchange 2007 mailbox database)
Exchange 2010 introduced a brand new database schema (together with some other changes to the ESE engine). The database layout is now designed on a per-mailbox basis. By doing so, the possibility to use single instance storage was removed, but the number of transactions (I/O's) needed to read transfer data from/to a mailbox was drastically reduced because all the data for a single mailbox is now stored together (contiguously) on the disk.
Exchange 2010 Database Layout
Does this mean that there are no random I/O's anymore? Unfortunately no. The data within a mailbox is transferred sequentially to/from the disk, but different mailboxes can be held on different parts of the disk (still requiring random I/O's for each different mailbox). Nonetheless, Microsoft was able to reduce I/O's up to a blazing 70%!
In order to keep the data contiguous, the ESE engine has undergone some changes as well. For instance, it is able to re-use or skip freed database pages within a database to ensure that relevant data is always (or at least as much as possible) grouped together.
If you'd like to know more about the new structure and the ESE engine, I suggest you start reading here: http://technet.microsoft.com/en-us/library/bb125040.aspx
Random reads (Non-sequential I/O’s)
Random read are handled a bit differently compared to sequential reads. Because data is written all over the disk, the actuator arm has to reposition the head over the disk to reach the desired track between each I/O operation, adding the duration of the seek time to each I/O operation. As you can see, the lower the seek time, the higher the IOPS.
One thing to note is that the seek time is also usually split into a read seek time and a write seek time. The latter usually takes a fraction longer, allowing the head to settle a bit longer before writing to disk. So usually the total amount of IOPS for writing are a tad lower than those for reading.
How about a RAID array(s)?
So far, we've talked about a single drive. Although Exchange 2010 now allows you to store databases on a single disk because of the built-in high availability of a DAG, you'd mostly still see enterprise deployments based on disk arrays, whether they are locally attached or in a SAN.
Disk arrays handle data a bit differently then single disks do. Although that in the end data is still physically written to one ore more disks as it would be on a single disk, many additional "tricks" come in to play that manipulate the performance of the storage subsystem.
Without any doubt, the array's controller cache will be the biggest influencer. Besides that, it will be your RAID level that will determine what kind of performance you'll get from your array. There are many different RAID levels, which I'm not going to discuss here.
If you want to know more about RAID levels, please read here: http://en.wikipedia.org/wiki/Standard_RAID_levels
Usually a RAID controller (whether it is a local controller in the server or one in an external array) contains a certain amount of cache. Cache sizes fluctuate, usually between 256MB and 1GB. There are basically two usage scenario's for that cache memory: either for reading or for writing (or a combination of both).
When using the cache for reading, the controller will actually perform a so called "read-ahead" operation. In fact, the controller will also read the blocks located after the ones that are requested and store them in cache. It does that because it "predicts" (guesses) that you might need those blocks in the same or subsequent I/O's - even though you might never actually need them... Because these blocks are now stored in cache, the controller can "respond" immediately, without having to go to disk first, whenever an I/O comes in for these blocks; therefore drastically reducing the time needed to provide the data.
Using the cache for read operations is definitely a must in situations where a lot of data is read. The positive performance impact on a sequential read will be higher than for a random read, but it will still prove it's value since I/O requests usually are bigger than a single block.
Using the cache for write operations has even a bigger impact on performance because the write I/O's will be stored in cache first, before being written to disk. Writing to cache (memory) is much, much faster than directly to disk. The data will afterwards be physically committed to disk, whenever there is some "idle" time available. The biggest downside here is that you definitely need to have some sort of battery-backup for the controllers cache to avoid data loss whenever power is lost the the disks or server. If you fail to have a battery backup, all the data that was written to cache would be lost, mostly resulting in corrupt data (or a corrupt database in the case of Exchange).
There are actually two types of write-caching: write-through and write-back. The first (write-through) is what I explained above. When using cache in write-back mode, the I/O's coming from the OS will only be acknowledged AFTER they have physically been written to disk. In such case, the cache no longer functions as a performance-influencer but rather as an additional layer of data protection.
I admit that the information I have provided you so far, does not cover the entire load of storage systems. Not by far. However, it was my purpose to explain you the basic principles, so you'd understand how all the pieces come together.
Not all manufacturers handle storage the same way. Therefore it is impossible for me to describe all of these different cases. I suggest you have a sit-down with your storage-guy to get to know your own storage subsystem.
Exchange storage recommendations
So if we take all the information from above and take a look at how Exchange handles storage, we can now explain/understand Microsoft's recommendations towards storage:
- 64KB Cluster Size
- 256KB Strip Size (or greater) when using a disk array
- 25% read - 75% write cache (write-through)
Microsoft fundamentally changed the way Exchanges handles storage. As I explained before, moving from random I/O's to more sequential I/O's was one of reasons why I/O requirements dropped up to 70%. We saw that whenever data is written sequentially in a contiguous way, the higher the performance is - certainly for drives with an higher rotational speed.
By increasing the database page size from 8kb to 32kb, I/O sizes are bigger and therefore also enforce the sequential behavior by writing more data contiguously per I/O. To measure the impact of the allocation unit size on Exchange's performance, I used JetStress 2010 with the various Strip- & File Allocation Unit sizes possible. I re-used the same test configuration for each scenario in order to be able to compare them afterwards.Without BDM
Testing was performed on a HP DL380 with 12GB RAM, 2x 72GB SAS disks for the OS, 2x 146GB SAS disks for the logs and databases. Note that placing the logs and databases on the same disks does not improve performance and is NOT recommended for any production environment.
As you can deduct from the table above, although the difference in performance is sometimes quite minimal, the highest performance is achieved using a 256K strip-size and 64K NTFS cluster size. It's also clear that the strip-size of the array plays a far more important role. However; other factors like the disk types, array cache and cache settings play an even bigger role. Every Exchange implementation is different and you really need to assess your requirements and compare that to the available storage systems (whether it is DAS or SAN).
Remember, determining your needs starts with using the Exchange Mailbox Server Role Requirements Calculator!
Michael Van Horenbeeck