Managing I/O Performance in Microsoft® Environments

I/O slow-downs can cripple any business application.  These slow-downs can be caused by any number of issues but the culprit, particularly for sporadic slow-downs, typically falls into one of two categories:

  1. A spike in I/O demand
  2. A device malfunction while under a more typical I/O demand profile

Both of these scenarios result in more work required of the underlying I/O subsystem.  While the system is typically designed to handle peaks in demand, the nature of queuing can quickly produce high wait times.  Most often the the system struggles to complete the “in-flight” I/O’s while more I/O demand is continuing to enter the system.

“Life of an I/O”

The figure below illustrates the “Life of an I/O”, using SQL Server® as an example I/O demand generator.  The chart also references the SQL 833 Error, which is SQL Server’s way of letting you know that a very long I/O (>15 seconds) has occurred. This chart by no means captures all the intricacies of the I/O path.  The intent is simply to demonstrate a representative I/O journey from request to completion, highlighting the potential for I/O slow-downs along the path.   Below the figure is a discussion of the various layers and I/O processing along the way, followed by a discussion about how Profor’s PA-Storage can help.

SQL 833 IO Map v4

The Application Layer

The Application Layer is where the I/O journey begins. In the example above, the database, in response to an application request, performs some sort of command or query.  In many cases this is a SELECT query that will require some level of interaction with one or more database tables.  The number of I/O’s for the database command varies considerably. Depending on the size of the table, the use or lack thereof of indexes, and several other factors such as index fragmentation, the single database command can turn into 1000’s of I/O requests.  Database optimization, for instance, often looks at ways to lower the I/O load to improve performance.  Conversely, high I/O loads that can cause I/O slow-downs can sometimes be a result of database design changes that increase the I/O load.

The most important thing to understand is that this layer is where the application will track and report any I/O delays, including SQL Server’s 833 error.  

I/O Subsystem Components

This is the layer that takes the application read/write request and translates it into a storage driver request.  These intermediate layers are critical to abstracting the physical network and devices from the logical storage entities (file systems, volumes, etc.), but typically do not add slow-downs to the system.  The mapping is deterministic and cascaded down to the partition level, with no queuing.

What is important to note here is that tools, like Perfmon and the PowerShell(TM) Get-Counter scripts, measure I/O performance indicators here.  PA-Storage, because it utilizes PowerShell Get-Counter scripts, also measures performance here when monitoring Windows®-based servers and hypervisors.  Because the mapping functions do not add delays to the I/O, the performance monitored here is a good approximation of what SQL Server or other I/O generating applications are seeing for I/O performance.

Protocol and Hardware Abstraction Layer

This layer isolates the I/O subsystem above from the different storage protocols and transports (FC, iSCSI, SMB, etc.).  This is the layer where the operating system queue depths are managed.  Going back to the two primary causes of I/O slow-downs above, when the cause is related to spikes in I/O demand, this is the layer where I/O delays are most likely to happen.

What happens is that the application layer pours down I/O requests that quickly end-up in the storage port driver. When handing the I/O down to the lower-level drivers, the storage port driver has to honor the queue depth limits placed on each device.  It cannot send the I/O down to the lower-level drivers unless there is a free “place in line”, so to speak.  What happens is that the I/O burst can get stuck in the upper-level driver while the lower-level driver awaits I/O completions.  When the I/O subsystem is falling behind at the same time the upper application layer continues to send I/O requests, the number of I/O’s stuck in the upper-level driver grows.  As a result, I/O slow-downs can last for many minutes, or even hours, after the application layer demand slows down.  The extended delays are a result of the upper-level driver working off the I/O.

From a performance management perspective, what is important is that the Perfmon/Get-Counter layer has visibility into the slow-downs at this layer.  This allows these mechanisms, and the tools that use them, to provide ready insight for understanding this critical performance layer: what level of I/O volume causes slow-downs; what is the read/write mix at the onset of the slow-down; is it reads or writes that are slowing down; how long does it take the slow-down to subside; how well is the I/O load spread across servers; etc.  Good I/O performance management tools can answer all these questions, answers that become critical not only for root-cause analysis, but also for forward-looking architectural and technology decisions.

Hardware Adapter

The hardware layer, particularly with established protocols and vendors, is typically a “pass-thru” from a performance perspective.  These adapters have their own queueing mechanisms and resource constraints (CPU, buffer memory), but have largely capitalized on smaller silicon geometries to stay ahead of the curve relative to I/O processing power.

Where hardware adapters most often come into play, during I/O slow-downs, is when there are hardware failures.   While this is rare, such issues are very difficult to diagnose.  A hardware failure can, for example, inadvertently spill “frames” or “packets” onto the network.  This activity is fairly invisible (it’s the work of network taps to pick this up), causing the network to exhaust its resources to process all the frames/packets.  This creates a backup on the hardware adapter, which pushes back on the host drivers, eventually creating a backup in the upper-level drivers.  The impacts of this hardware failure then become I/O backups and high latency.  This is the same impacts seen when there is a high I/O demand.  With proper monitoring, one of the key pieces of data that helps distinguish between the two scenarios is how well, or not, the amount of I/O (both IOPS and throughput) correlates to the magnitude of the slow-down.  If the amount of I/O correlates well with the magnitude of the slow-down, the cause is more likely due to too much I/O demand stressing the system.  If the amount of I/O does not correlate well with the slow-down, this likely indicates an equipment issue.

Network

The impact of the network layer on I/O slow-downs varies by network type and architecture.  For dedicated storage networks (Fibre Channel, SAS, etc.) any contribution to slow-downs, much like hardware adapters, is typically due to hardware or firmware issues.  Such issues can create resource shortages or, in some cases, cause lost I/O’s.  These issues are rare, but difficult to diagnose.  As seen in the diagram above, network taps (protocol analyzers) play a role here, typically between the network switch egress port and the storage device.

For shared networks (shared IP and/or Ethernet infrastructure among storage and data), oversubscription of the network can also create network slow-downs (including dropped packets), which eventually bubble up to the storage driver layers and database layers.

As with hardware adapter issues, monitoring the I/O volume (IOPS and throughput) in reference to the magnitude of the slow-down, along with solid historical data for the same, can be a big help in understanding if the issue is volume related or an equipment malfunction.  This information is helpful in knowing when and where to apply more invasive technology such as protocol analyzers.

Virtualization

This layer is listed as an optional layer between the network layer and the storage device, something becoming more and more common today, particularly with hybrid on-premise/cloud storage architectures.  Needless to say, the virtualization layer can add considerable complexity into the I/O path, basically inserting most of the above diagram inside the virtualization layer, one or more times.  There is not enough room here to go into all the possibilities, and all the impacts on I/O slow-downs.

Storage Controllers and Media

Storage controllers have come a long way.  Not too long ago, controller front-ends created many of their own I/O issues due to slower processing and resource starvation issues when being asked to handle peak I/O loads.  But today’s front-ends have scaled up considerably, including adding large amounts of RAM- and Flash-based cache resources to better manage peak I/O needs.  What’s important to understand is that many storage performance monitoring tools monitor the back-end of the storage controller, which is rarely where a SQL 833 error is caused or seen.  Many a storage engineer has started looking for “Long I/O’s” with such tools, quickly concluding that the “storage is performing fine”.  While a true statement (the storage controller is performing fine), the “storage” is a much bigger system including all the layers above.

How PA-Storage Can Help to Manage I/O Performance

Profor’s largest PA-Storage customer is one of the largest SQL-Server users in the world.  This is not by accident.  PA-Storage has unique capabilities that make it ideally suited for helping manage I/O performance in Windows Server and Hyper-V environments.  One of the biggest advantages of PA-Storage in these environments is the ability to monitor performance, on remote servers, in real-time.  The Windows servers that are configured as “Critical Servers” can be sampled as low as once every 5 seconds.  This is important when trying to catch the occurrence of I/O delays because all such tools capture the average performance of several I/O’s over the sample time period.  If the sample time period is on the order of 2 or 5 minutes, I/O delays could get hidden in the averages.  For this reason, PA-Storage customers leverage the real-time sampling benefit by configuring the servers under investigation as “Critical Servers”.  In addition, PA-Storage is one of the only tools on the market that provides a single view of file (SMB/CIFS) and block (Physical and Logical Disks) performance.  This comes in quite handy for environments that are considering, or running, both file and block back-ends.

The simplest way that PA-Storage can help is based on PA-Storage FREE Edition.  If you’ve never had the time to write the PowerShell scripts to monitor remote servers, with or without credentials, PA-Storage FREE Edition gets you there immediately.  We spent years honing the scripts, along with the reports included in the FREE Edition.  All you need to do is download the software, install it, and point it at the servers that appear to be having issues.  The license allows monitoring a single “Critical Server” (5 sec to 1 min sampling) and two “Standard Servers” (1 minute and higher sampling).  Microsoft’s own advice for grappling with I/O slow-downs is to monitor the following counters:

Average Disk Sec/Transfer
Average Disk Queue Length
Current Disk Queue Length

PA-Storage monitors these counters (for both Read and Write I/O’s where applicable), as well as scores of others, and does so remotely and agentlessly, making getting at the servers simple.   The FREE Edition captures all the data into a “flat file”, a CSV file.  This format makes it simple to perform further analysis on the data, and allows the data capture to grow as large as a file can grow (pretty large these days).  The FREE Edition provides a long list of features to help monitor storage performance, including alerts for I/O delays, allowing you to manage any I/O issues before they become business issues.

With use of the PA-Storage Professional (PRO) version, more servers can be monitored.  The PRO version also provides optional connectivity to a SQLEXPRESS (10GB of database storage) or higher-end SQL Server back-end.  This provides a considerable performance boost as required by larger environments and also adds additional reporting such as performance tracking by block size.  Flexible pricing options provide a quick ROI.  See this use-case link for a discussion on some of the helpful use-cases for proactively managing I/O performance with PA-Storage.  Or download-now and start creating your own use-cases.

 

Microsoft®, Windows®, SQL Server® and PowerShell(TM) are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.