Get Pernix’d

The sudden explosion in the number of solutions built on flash-based storage surprises me. I remember researchers and industrial community discussing the reliability and longevity of Solid State Disks (SSDs) in FAST conference not too long ago. Fast forward today, these no longer seem to be something that worries the solution providers or the consumers. I now work for PernixData, a company that is aiming to carry forward the virtualization journey from where hypervisors left off (post CPU and memory virtualization).  Flash Virtualization Platform (FVP), the flagship product developed by PernixData is a clustered flash tier created by virtualizing server-side flash storage for accelerating virtual machines’ (VM) I/O access to the block-based storage devices. In this blog post, I intend to discuss the motivation behind developing FVP and the key benefits it offers.

Rise of high-performing, expensive storage tier (SANs, NASs)

Over the years, storage technology has taken an interesting course. Although, computing platforms (desktops, servers, laptops) provide persistent storage to the computing units, most IT users don’t trust this layer to be robust enough to provide either the performance that meet their SLAs or the technologies that let the data rest in peace (dedupe, compression, encryption, snapshot). As a result, a new storage layer with dedicated expertise to accomplish both, but  is external to the computing platform, has emerged. Almost all research efforts in the storage area continues in this external tier.

Bi-dimensional Problem

However, this external storage tier is plagued by a problem – improving a single layer in two orthogonal dimensions (performance and capacity) is extremely complicated.

The problem is illustrated better in the following graph:

graph_1

Storage for most data centers are sized in two dimensions – capacity and performance. Most often, storage sized for capacity doesn’t meet performance needs (Application SLAs). In that case, additional storage media have to be added to meet the performance needs (this is mostly true for transaction based applications). With advances in media capacity out-racing advances in media performance, this almost always lead to over-provisioning as adding storage media means adding extra giga and terabytes of unused storage capacity. There is a significant ‘Capex’ implication of sizing the storage this way.

Another complication may arise, when almost all processing cycles of the storage has to be dedicated to process application I/Os to meet their SLA. This leads to postponing all non-application traffic (mostly administrative in nature such as snapshots, backups, storage cloning, migration etc.,) to idle periods. This means that either storage admins have to depend on heavy automation and scheduling of these tasks OR burn mid-night oil to ensure the success of these operations. This has significant ‘Opex’ implications.

In summary, meeting performance requirements requires capacity to be over-provisioned, sticking to capacity needs compromises performance. Hence, it is very hard to expect the convergence of both.

Where is the local storage?

What happened to the local storage? Applications that effectively use local storage can be counted with fingers – Hadoops, High Performance Computing applications, the Googles, the Facebooks to name a few. But, they take a different approach to utilize the local storage capacity. They implement all the afore-mentioned capacity and resiliency features in software using commodity hardware. To obtain the performance their application need, they use a ridiculous amount of hardware that average business can’t even imagine. Don’t forget – they can throw many engineers with specialized expertise to solve the bi-dimensional problem. Finally, there are industry-standard benchmarks such as TPCs that can use local storage for reducing the cost of performance. Here, data protection at the hardware level is not given a high priority.

The lack of interest and demand have largely limited innovations in the local storage tier – except improvements to the media types. Server vendors are now supporting SAS/SATA/PCI-e based Solid State Disks (SSDs) along with traditional SATA/SAS magnetic disks. But the concern remains – who will use them?

Flash Virtualization Platform

Meanwhile, another revolution happened in the IT industry. VMware, with its flagship product vSphere, fork-lifted compute layer from the storage layer. This opens up interesting opportunities. One such opportunity is to solve the problem I have been discussing all along – to split  storage tier into two separate dimensions – Performance and Capacity. PernixData is one among the early few who recognized this opportunity. Result of their tireless effort is what you see today – Flash Virtualization Platform, an ethereal storage layer that uses local fast storage (SAS/SATA/PCI-e SSDs) to accelerate transient data and SAN storage to rest persistent data.

FVP aims to glue the orthogonal challenges that plague the storage layer by intelligently using the two storage tiers. This independent usage of the two tiers leads to plethora of opportunities for the server and storage vendors. Server vendors can focus on providing high-speed, local storage for servicing transient data while not worrying about the complicated data-resting technologies. While, storage vendors can focus on jazzing up their storage devices with attractive capacity-saving, data-protection technologies while not worrying about the performance impact of these technologies. Essentially, FVP lets you use storage solutions from your preferred vendor while speeding up data access by utilizing the best flash technology out there.

Let us revisit the problem.

graph_2When the persistent tier (external storage) is combined with the transient tier (Flash media in the local storage), a new solution emerges that can allow sizing of persistent tier to meet the current capacity requirements (and room for growth), while allowing the transient tier to use the latest technologies in flash to adequately meet the I/O performance demands by absorbing any burst of I/O requests from the applications at the moment it occurs. Even if an emergency administrative task has to be scheduled, application users wouldn’t experience any noticeable impact as the I/O request is serviced by the transient tier. How does FVP achieve this? Check out the videos here.

This solution has significant capex savings as the persistent storage doesn’t have to be over-sized. The only additional investment will be for procuring flash media (cost keeps reducing every day) and the license cost of FVP ;-) . There is noticeable opex savings as well – no need to maintain the extra storage (power, space and cooling savings). Storage admins can breathe easy as the admin tasks they have to execute on the external storage is hidden from the application users and the impact is mostly not felt.

Linguistic Lesson

‘Pernix’ means agile, active. The name is very apt for what FVP can do to your IT environments; it can activate your virtual machines. Think of it as the magical spinach that gives Popeye his awesome power. “Protect your investment, Pernix your data”.

As Satyam (CTO, PernixData) likes to ask – do you want to get ‘Pernix’d?. I do. That’s why I decided to join the team. Questions is – Do you? If the answer is yes, join the beta program today.

Stay tuned as more is yet to come …

Missing in Action

Feels nice to be back after a long hiatus.. Wow! so many things happened in life – travel, injuries, vacation (to attend brother’s marriage), longer term projects etc., But the biggest of all was the birth of an angel. Yes we had our first child – a beautiful girl last year. She kept her daddy busy for the most part of the last year and first half of this year. Now, she understands that daddy has other things to do and has been kind enough to let me do what I love most (well after her) – share my thoughts and findings.

Although not finding a mention here, couple of interesting, but long projects kept me busy all this time. I have blogged/published/presented them elsewhere. I will just provide links here so that you know where to find them.

  1. Achieving 1 million IOps from a single vSphere host - http://blogs.vmware.com/performance/2012/03/a-conversation-about-1-million-iops.html
  2. Storage vMotioning a virtualized SQL Database - http://blogs.vmware.com/performance/2011/11/svmotion-sqlserver.html
  3. Storage vMotioning on a EMC VNX storage using VAAI – Presentation# USD.40 @ EMC World 2012

Collaborated with my friends Y.P.Chien and Eddie @ Kingston to publish several studies on vSphere Memory management. One of them is here:

  1. The Yin and Yang of Memory over commitment  in Virtualization http://media.kingston.com/images/usb/pdf/MKP_339_VMware_vSphere4.0_whitepaper.pdf

You may have already seen these. If not, roll your eyes over them. You may find these interesting to keep your eyes glued to them.

Mem.MinFreePct and Memory Reclamation in vSphere 5

It feels good to be back..

Recently, Frank published a blog about new sliding scale based estimation for minimum free memory % in vSphere 5. An interesting read for anyone looking to estimate memory capacity for his/her vSphere based virtual infrastructure. My good friend YP Chien from Kingston ran some tests to understand the memory reclamation techniques [ballooning, compression and host-swapping] in vSphere 5. But, he noticed that the host free memory levels at which various memory reclamations kicked-in were quite different from what it should have been based on sliding scale logic mentioned in Frank’s blog. YP immediately brought this to my attention (Thanks YP!). I dug a bit on this and found the issue. Instead of commenting on Frank’s blog, I thought of offering a deeper explanation here:

I will use the same example that was used in Frank’s blog. Consider a server configured with 96GB of RAM. The MinFreePct threshold will be set at 1597.36MB based on a sliding scale shown in the following table:

Threshold       Range (MB)                    Reserved Free Memory (MB)

6%                  0 – 4091                 245.76

4%                  4092 – 12287           327.68

2%                  12288 – 28671         327.68

1%                  Remaining                696.32 (in this case)

Total Free Mem                                     1597.36

For the host considered in the above example, various memory reclamation techniques kick-in at different thresholds as explained below:

Free Memory State    Threshold (% of MinFree)    Threshold in MB    Reclamation  Type               

Soft to High                        64 to 100                  1022.31 – 1597.36         Balloon

Low to Hard                       16 to 64                    255.57 – 1022.31          Balloon, Compression and/or Swap

Please note:

  1. There is no separate reclamation target for Memory Compression. It uses ‘Swap Target’ to reclaim memory.
  2. The choice of using memory compression [when enabled] or host-swap is dynamic. vSphere tries to use memory compression, but if it cannot reclaim enough memory soon it will resort to host swapping.
  3. Decrease in memory pressure doesn’t mean that the respective reclamation targets are set to zero immediately. vSphere constantly monitors the memory pressure in the host and gradually reduces a reclamation target if it finds memory pressure to have reduced. On the other hand, memory states could change as soon as the memory pressure in the host changes. Hence, it is possible for you to see some memory reclamation (balloon or swap) for extended time till the respective reclamation targets become zero even after the memory states indicate no or reduced memory pressure.

Hope this helps you understand when a specific type of memory reclamation kicks-in and why you would see it even when you don’t expect to see it. Feel free to throw in your comments or questions :-)

Storage IO Control and Storage vMotion?

My colleague Duncan posted an article on yellow-bricks regarding storage vMotion (sVMotion) of a virtual disk placed in a storage IO control (SIOC) enabled datastore. I thought of providing some more information on this topic..

Yes, sVMotion will be treated as a regular stream of I/O requests coming from a particular VM to a vmdk that is placed on a SIOC enabled datastore. If the datastore wide I/O latency exceeds the congestion threshold of the datastore, SIOC kicks in and adjusts the device queue in the host according to the aggregate disk shares of all the VMs on the host that share the datastore. Within a particular host, I/O requests of each VM is given preferential priority based on VM’s disk shares. The I/O requests can be from an application needing data or from ESX requesting  sVMotion of the vmdk on a non-VAAI compatible storage.

How does SIOC treats sVMotion’s I/O traffic? When SIOC is active, a VM is allowed to have a certain number of concurrent I/O requests queued in the host for the SIOC enabled datastore. If sVMotion is initiated on the VM when it was actively issuing I/O requests, the VM’s quota of concurrent I/O requests will be shared by both sVMotion traffic and the other I/O traffic from the VM to the datastore. If the VM is sparsely issuing I/O requests to the datastore, then its quota of concurrent requests will be dominated by sVMotion traffic.

Note in both cases, the total concurrent I/O requests (sVMotion + other I/O traffic) is limited to a value proportional to the disk shares of the VM on the datastore.

What happens when the storage is VAAI compatible? ESX issues the sVMotion command to the storage. The storage initiates sVMotion on behalf of ESX. ESX doesn’t even see the sVMotion traffic. In this case, the VM is free to use its full quota of concurrent I/O requests.

You will not be able to see the exact number of each I/O request types in the device queue. Good news is, that you don’t have to worry about them. SIOC is capable of  handling these varying traffic conditions for you. If you feel geeky, and really want to get into this, your best bet will be monitoring the difference in sVMotion’s completion time under different load conditions of your VM. But know this – irrespective of the VM’s load situation, SIOC will not let sVMotion affect the I/O traffic on the datastore from any other VM. Though, the response time of I/O operations in the VM on which sVMotion was initiated will be affected by sVMotion.

What if the datastore is not congested? SIOC lets sVMotion use its full quota of bandwidth until datastore becomes congested (datastore wide latency > congestion threshold). Then SIOC does what it is designed to do.

Here is a question for you – If you have to sVMotion a vmdk on a SIOC enabled datastore when do you do it? ;-)

How cool is vscsiStats? Part-II

I enabled vscistats collection in my vSphere host before starting the purge2 operation (check my white paper for more details) in the vCenter database . While the operation ran, I collected 20 samples of vscsiStats output at equal intervals (each interval was 7.5 seconds). vscsiStats output consists of histograms of various metrics – outstanding IOs, seek distance, length of a request, arrival time, all split between reads and writes. To obtain the histogram of a given metric at a particular time instant, I divided the difference in histogram values of the metric collected at successive time intervals by the sampling interval.

Example:

Outstanding Read IOs (=1) at time t(x) = (Outstanding Read IO (=1) until time t(x) – Outstanding Read IO (=1) until time t(x-1))/Sampling interval

Outstanding Read IOs (=2) at time t(x) = (Outstanding Read IO (=2) until time t(x) – Outstanding Read IO (=2) until time t(x-1))/Sampling interval

:

:

Outstanding Read IOs (>64) at time t(x) = (Outstanding Read IO (>64) until time t(x) – Outstanding Read IO (>64) until time t(x-1))/Sampling interval

NOTE: If you are thinking that the above steps are very cumbersome, I agree with you. I have an excel template which does it for me. All I need is vscsiStats output of 20 consecutive samples all saved in a single excel file. Irfan, in his virtual scoop blog, has provided few links to some neat blogs on visualizing vscsiStats. Check his blog.

I followed the above steps for the following histograms in vcscsiStats output.

Outstanding IOs:

Figure 1. Outstanding IOs during purge2 operation.


The graphs in figure 1 show the outstanding IOs during purge2 operation.  The number of read outstanding IOs was 64 (tidbit: pvscsi driver installed in the guest operating system has a default queue length of 64. During purge2 operation, the I/O queue in the pvscsi driver was full with read requests. Hence the number of outstanding read IO requests coming from the VM was 64) whereas the number of write outstanding IOs was zero.

Request Type: Graphs in figure 1 also show that the purge2 operation consisted of only read requests.

NOTE: Since purge2 operation is completely dominated by reads, for the remaining vscsistats histograms I only considered the respective read histograms.

Randomness: To identify the randomness of the purge2 operation I looked at the ‘seek’ histogram in the vscsiStats output.

Figure 2. Seek distance between read requests


The seek read distance histogram shows the distance between consecutive read requests in terms of logical blocks. A seek distance of 1 logical block between consecutive requests indicate a purely sequential workload. A seek distance of < 10 logical blocks indicate a quasi-sequential workload. A seek distance of 10+ logical blocks indicate a random workload. In this case, the seek distance between successive read requests was 500,000+ logical blocks, indicating a pure random read access pattern.

Size of an I/O Read: The last parameter I needed was the size of an I/O read request during the purge2 operation. This was provided by the ‘ioLengthReads’ histogram.

Figure 3. Size of Read Requests during purge2 operation

The size of the read requests seen during purge2 operation varied from 16KB to 64KB with some requests  as large as 128KB. The variation in I/O size indicates some kind of optimization employed during reads to fetch as much data as possible in one read operation.

Arrival Time for Reads: Another interesting histogram provided by vscsiStats (that is not required to create an IOmeter workload profile, but interesting) is the arrival time for the I/O requests (in this case for reads).

Figure 4. Arrival Time for Reads during purge2 operation

An arrival time of ≤100 microseconds  indicates that purge2 operation was very I/O intensive (also evidenced by 64 outstanding read requests throughout the operation).

With the information I collected from vscsiStats, I created a workload in IOmeter with the following paramters:

  • Outstanding IOs: 64
  • Access Type: 100%Read, 100%Random
  • Request size: 48KB (median of 16KB, 32KB, 48KB, 64KB, 128KB)

Rest, you will know when you read the white paper ;-)

Next time, you get into troubleshooting I/O problems or planning storage resource for your vSphere environment, remember that you have the secret sauce at your finger tips. Surprise your storage admins by speaking in a language they understand – outstanding IOs, request size, access pattern and more..

Isn’t vscsiStats cool?

How cool is vscsiStats? Part-I

It has been few days since I published a white paper on the performance characterization of SQL server-based vCenter database. Few people have asked me questions about the vscsiStats graphs in the appendix of the paper. Instead of answering the questions individually I decided to blog here for the benefit of all the readers.

As mentioned in the paper, I observed this rather unusual (yes, I say unusual because I didn’t expect performance of virtual I/O stack to be better than that of native) during some of the experiments. Using the vCenter application to reproduce this behavior was rather complex and involved too many variables. Hence I decided to use a simple I/O benchmark, which most of you are familiar – IOmeter (http://www.iometer.org/). But to reproduce the issue, I needed to use the exact same I/O load as that produced by the stored procedures of the vCenter database. To create a custom workload profile in IOmeter I was required to configure outstanding IOs, I/O request size, read percentage and percentage of randomness (at least). Question was – how to get these parameters from the workload?

vscsiStats provided the answer. Scott Drummonds (during his VMware days) wrote a great blog on vscsiStats. I highly encourage the readers to go through the article and understand the basics of vscsiStats (if you are not already familiar with the tool).  This will help you appreciate the content of this multi-part blog series. Instead of dwelling on the details of vscsiStats, I will illustrate the usefulness of vscsiStats here.

First, a quick description of a sample histogram output from vscsiStats. All the histograms have similar format and should be straightforward to understand.

If you are curious about this tool and want to learn more, check out these technical literatures:

  1. Storage Workload Characterization and Consolidation in Virtualized Enviornments” - Ajay Gulati, Chethan Kumar, and Irfan Ahmad presented at VPACT 09 (Yes, I was one of the authors)
  2. vscsiStats: Fast and Easy Disk Workload Characterization on VMware ESX Server” – Presentation by Irfan Ahmad at VMworld 2007 (excellent presentation by one of the creators of this tool)

Up next: The histograms I collected …

Running Virtual Center Database in a Virtual Machine

I just completed an interesting project. For years, we at VMware believed that SQL server databases run well when virtualized. We have illustrated this through several benchmark studies published as white papers. It was time for us to look at real applications. One such application that can be found in most of the vSphere based virtual environments is the database component of the vCenter server (the brain behind a vSphere environment). Using the vCenter database as the application and the resource intensive tasks of the vCenter databases (implemented as stored procedures in SQL server-based databases) as the load generator, I compared the performance of these resource intensive tasks in a virtual machine (in a vSphere 4.1 host) to that in a native server.

From a sheer size perspective, the database doesn’t pop out many eyes. But from the criticality standpoint it can give sleepless nights. The database though ~93GB in size, included an inventory that in reality can represent a very large vSphere based virtual datacenter. With 500 hosts and 8000 virtual machines, performance of this vCenter database becomes extremely important. The tasks whose performance I studied are among the most common operations that can affect the performance of a vCenter database. These operations included:

  1. Rolling up performance statistics
  2. Calculating top resource consumers
  3. Purging old performance statistics

I won’t go into the details of these operations. Refer to the white paper for an explanation. Though I expected the performance of the virtualized database to be very close that of the native database, I didn’t expect it to be this close. Both in terms of CPU utilization and I/O performance, the virtual machine was very close to native (in some cases better than native!). I also observed an unusual behavior of native I/O stack during my experiments (I will blog about it next time, but for now check the appendix of the paper).

Here, this is for you VI admins – if you are thinking of virtualizing vCenter database (why not? you can virtualize anything and everything these days ;-) ) here is a study that should give you the confidence to take the next step. If you have any comments to share with me or rest of the community, interesting facts, or tough performance issues, feel free to drop a comment or two here.

Oh!, BTW here is the link to the white paper.

Previous studies:

  1. http://www.vmware.com/files/pdf/perf_vsphere_sql_scalability.pdf
  2. http://www.vmware.com/pdf/SQL_Server_consolidation.pdf
  3. http://blogs.vmware.com/performance/2009/07/summary——–vmware-distributed-resource-scheduler-drs-dynamically–allocates-and-balances-computing-resources-in-a-clust.html
  4. http://communities.vmware.com/blogs/chethank/tags/performance
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: