Diaxion Blog

Implementation of ITIL

January 11th, 2011

Implementing ITIL is considered difficult by most and quite daunting, however this depends on the level of maturity the organisation is seeking and what is generally recommended is a staged pathway to move to an optimised state.  Smaller organisations should cherry pick the processes and components within those processes to avoid over complicating the implementation.

Implementations have been successful where they have had support / sponsorship from senior management, project managers, persons dedicated to performing the role of process owner and individual roles within ITIL.  Other critical success factors include training and communication.  Unfortunately, organisations that have implemented ITIL as a service management framework have often been unable to justify the operational costs with the so-called benefits the framework offers.  For this, in most cases it has drifted by the side line and remnants of ITIL are present, but the lack of commitment and management drive has disappeared.

Most organisations have implemented:-  Incident Management, Service Desk, Change Management, Release Management, Service Level Management and Configuration Management (Service Support).

Some organisations have implemented the entire framework, so Service Support and Service Delivery.  These are v2 ITIL and not many organisations have upgraded to v3.

Considerations

  • Roles and Responsibilities (people holding two hat roles, the whole argument that you can only have one master unless the person is allowed a 50/50 split, depending on the commitments of their contribution to a particular process.
  • Cost / Benefit Analysis to be performed prior to implementation.
  • Exercise to determine costs before implementation and after to understand operational cost savings.
  • Engage all stakeholders and communicate effectively throughout the implementation to ensure success.
  • Engage existing suppliers in the implementation for the development of workable SLAs
  • Consider tools that will complement the process early in the planning phase
  • Understand what the ITIL organisation will look like 1/3/5 years down the track

On a side note, most organisations actually perform the processes that are described in the ITIL framework but are reluctant to implement because of the controversy of the so-called benefits.  ITIL provides a common framework and terminology in which organisations can operate, but I guess the question I have is, as long as the organisation has standards, policies, and internally developed frameworks pertinent to their business model – why implement ITIL?

Transitioning from intuitive to data-driven approaches to capacity management

November 17th, 2010

What does capacity planning look like within your organisation?

In most organisations existing capacity planning methodologies are rooted in decades of traditional physical server deployments and dedicated infrastructure. If an application server is at 50% load today, and load has historically doubled every 24 months, chances are your capacity planning methodology predicts that you have two years before you must add further capacity. While such an approach may work acceptably when dealing with dedicated, physical server instances the now widespread use of production server virtualisation limits how accurate and therefore worthwhile such predictions can be.

Further, when this approach fails, IT managers and administrators typically fall back on intuitive approaches to capacity planning –  responding to reports of application slowness, or to changes in headcount, in a linear manner that does not account for the complex relationships between application performance and each layer of the infrastructure upon which the applications are hosted.

These intuitive capacity planning methodologies are at best inefficient, resulting in needless or poorly targeted infrastructure investment. At worst, they can be completely ineffective, resulting in highly approaches to infrastructure management with significant operational costs.

Virtually Unknowable

Virtualisation – along with the adoption of other shared systems, such as clustered database and web servers, hardware load balancing appliances and storage area networks – necessitate a holistic approach to capacity planning. It is no longer enough to simply understand resource utilisation on an application-by-application basis. Instead, IT managers must consider the inter-relationships between applications; when are peak periods for individual applications, which applications have peak periods which overlap, how do applications map to line-of-business functions, etc . Each additional piece of data which must be included in capacity planning calculations exponentially increases the complexity of the forecasting, increasing the likelihood for error and therefore decreasing the value of the capacity planning exercise itself.

Given this it is no wonder that in most organisations capacity planning for virtualised environments remains an ad hoc process, with virtual infrastructure administrators applying the traditional physical server cap planning methodology to ESX hosts and simply trying to manage around its shortfallings via “agility” in infrastructure procurement and deployment.

A New Approach

A new generation of tools are beginning to emerge that seek to resolve these problems. Approaches vary across vendors but we can see common themes between them:

  • the ability to automate application mapping, allowing analysis to incorporate relationships between servers
  • the ability to rationalise performance and capacity metrics from multiple infrastructure layers – typically application, database, operating system, hypervisor, network and storage
  • Scenario-based modelling of growth

By automating discovery and data collection and by operating across all layers of the application/infrastructure stack these tools help drive a transition from the old, intuitive capacity planning methodologies to one that is based on hard data, and therefore much better able to accurately predict capacity demands within your unique environment. And, as we will discussing in a forthcoming post, such a data-driven approach is critical to managing not just capacity forecasts but application performance as well.

Data-Driven Approaches to Performance Management

November 17th, 2010

Does your organisation have a true performance management methodology? For the majority of organisations the answer is simply “no” – performance management amounts to a variety of disparate, ad hoc and predominantly reactive  processes. Examples of such approaches may include:

  • Server utilisation monitoring – perfmon statistics – CPU/memory/disk utilisation etc.
  • CMDB
  • User-feedback – “It seems slow”
  • Transaction response time monitoring (“stopwatch testing”)

Common limitations of these legacy approaches include:

  • Not data-driven
  • Reactive – bottlenecks are typically only identified after they cause performance problems
  • Don’t take into account shared systems – virtualisation/SAN/network
  • Obtaining more useful data requires significantly greater operational investment
  • Tools tend to be focus on individual infrastructure layers making it difficult to build processes that are useful across the entire enterprise infrastructure
  • Baseline performance benchmarking only useful for before/after analysis – cannot be used for accurate “what if” scenario planning

The largest challenge faced by infrastructure administrators in responding to performance problems is a lack of data. When users complain that “it’s slow” administrators lack the critical information needed to effectively respond – how did the application perform before the issue arose; what utilisation metrics correlated to the previous, acceptable, performance level; to what degree is performance now degraded; what has changed between then and now?

As a result, administrators tend to fall back on intuitive approaches to performance troubleshooting – looking for errant performance metrics, reviewing code release schedules and recent infrastructure changes, and frequently fall back on crude techniques such as increasing available computational resources in a hope to resolve performance bottlenecks. Such approaches are inefficient and are not cost effective, and do not scale to large, complex environments.

A new wave of tools is emerging that seek to resolve these problems. These tools are generally “cross domain”, referring to their ability to collect and analyse data from multiple infrastructure layers. Typically, the include the ability to determine whether performance variations are due to increased load, code changes, infrastructure changes, or impacted by performance of shared system components (ie, where an application’s performance is degraded due to increased load on a shared component such as a virtualisation farm).

An additional benefit of a data-driven performance management approach is the ability to “right-sizing” infrastructure – particularly in virtualised environments, resources are often over-allocated to individual servers and are therefore wasted. Once administrators fully understand the true performance and resource requirements of applications, these wasted resources can be reclaimed and reallocated.

In addition, they are capable of complex scenario modelling – this allows administrators to forecast the impacts associated with, for example, a successful web marketing campaign, the hiring of 100 new office staff, or the opening of a new branch office. As a result administrators can proactively identify future performance bottlenecks and IT infrastructure spending can be targeted to where they will deliver most benefit. Further, by understanding application utilisation trends and knowing  where bottlenecks reside in their infrastructure, administrators are able to reesolve performance issues before their users even notice them.

Can your organisation say that communication occurs frequently and clearly between your technology teams?

November 2nd, 2010

Identifiable communication interfaces and planned communication are some of the critical success factors to deliver a successful project. Communication interfaces need to be understood at the outset of a project as this will assist with the transition through the development phase to production. The critical points of contact should be identifiable for all who utilise them as they provide clear delineation of required communication across technology silos. Transparency of information will build effective relationships between the cross functional teams and built trust and strong collaboration.

Adopting clarity amongst existing roles and responsibilities within technology silos will improve communication between technology teams while they prepare to restructure their IT operation to introduce the new function. Representatives of the different support teams should engage in regular “open forum” meetings to ensure each team has visibility of upcoming project issues and changes in scope. Developing open lines of communications between teams will also help prepare the organisation to move the IT project into a production state, drawing on personnel across different support areas to provide a single holistic support team for the IT project.

The engagement between cross functional teams is a consistent theme in the Diaxion roundtable series. It is clear from the research gathered across these events that organisations of all sizes constantly struggle with effective team engagement in IT projects, however representatives of the SMB market have indicated that whilst engaging the team is still an issue, it is much less prevalent due to composite teams and proximity of staff.

For instance, a virtualisation service is complex and therefore has multiple components that constitute the service and equally, there are multiple specialist teams that provide expert knowledge and support to manage the components. Building a successful virtualisation platform requires input from a large number of specialised functional teams, each working on a different component or subsystem of the platform. Of course, these teams cannot work in isolation; in addition to designing their assigned components, they must also integrate their designs with those of the other components to ensure that the entire platform functions as a whole. It is critical, therefore, in planning a complex platform that project managers specify just which resources and information different teams will need from each other at particular stages of the project.

To help manage the communications aspect of such projects, we propose the following approach:

1. Identify unattended interfaces, areas where communication should be occurring but is not.

2. Look for unidentified interfaces, areas where communication is occurring but has not been planned.

To assist in implementing this approach, an alignment matrix can reveal mismatches between the communications and exchanges that are supposed to occur and those that actually do. It also demonstrates how well the project has been planned and executed. Another method of identifying the participation of roles in completing tasks or deliverables for a project is to use RACI matrix. Overall, the stated methods of understanding how communication should flow within your organisation will break down barriers and promote healthy, frequent and clear communication between your teams.

Check out our IT Governance Practice’s services

September 16th, 2010

vCIO

The vCIO program is a residency-based engagement that provides the Client with a skilled resource positioned on-site, on either a permanent basis or a part-time basis in order to manage the daily operation or execute the transformation of a roadmap developed by the organisation/Diaxion.

• Short / Medium term • Part time / Full time • Transformation leadership / Execution of Roadmap

IT OPERATIONS ASSESSMENT

The IT Operations Assessment is a pragmatic review/improvement exercise of IT operations at a company and/or technical silo level. Each of the following assessment areas can be encompassed in a holistic review or alternatively can be reviewed as a single component within your IT organisation.

• Skill set • Process / IT Governance • Toolset • People structure • Operation Transformation

This would typically include an audit of the aforementioned areas followed by the development of a strategy/roadmap that is practical in nature with real implementation steps for the organisation. This process can also be seen as a transformation program
assisting the organisation to move-up the levels of maturity within their IT operation.

IT INFRASTRUCTURE AUDIT

The IT Infrastructure audit encompasses a broad and narrow audit to assist in identifying the scope/areas within an IT organisation that require further analysis.  Each of these assessment areas can be encompassed in a holistic audit or can be audited as a single component within your own IT organisation.

• Key services and applications review • High Level IT Architecture audit • Operation Transformation • Process maturity assessment • Process / procedure development / training / implementation • Run book creation / review.

Understanding IT Governance – Part Two

May 4th, 2010

This is the final part of a two part series discussing the ten key principles of IT Governance.  Part two will cover the last five principles of ten in total.

Key Principles 6-10

6.  Provide the right incentives

One of the most common problems with incentives and reward systems is the misalignment with the behaviours the IT governance arrangements were designed to encourage.  The most prevalent matter is around how the organisation can expect governance to work when the incentive and reward systems are driving a different behaviour.  We believe this is bigger than IT governance, none the less it does contribute to the ineffectiveness when incentives are not aligned to organisational goals.

Avoiding financial disincentives to desirable behaviour is as important as offering financial incentives.  For example, some organisations don’t charge for architectural assistance to encourage project teams to consult with architects.  It’s one of the common problems with charge back when business units make their own decisions and don’t consult internal specialists to avoid paying for internal services.

It’s hard to overestimate the importance of aligning incentives to governance arrangements, however if a well designed IT governance system is not effective, one of the first places to investigate is the incentives.

7.  Assign ownerships and accountability for IT Governance

It’s the old chestnut of management commitment.  Like any major organisational initiative, IT governance must have an owner and accountabilities.  More often than not, the board is responsible for all governance, but sometimes will delegate an individual or group within an organisation to be accountable.  By selecting the right person or group, the board should consider some issues:-

Firstly, IT governance cannot be designed in isolation from other key assets of the firm.  For this reason, the individual or group must have a holistic view of the enterprise and good relationships and credibility with the business leaders in the organisation.

Secondly, IT governance cannot be implemented alone.  It should be made clear by the executive team that all managers are expected to contribute in the same way they would to other governance such as financial or other key assets.

Thirdly, IT assets are becoming more and more important to the successful performance of most enterprises.  The individual or group that owns IT governance must understand what the technology is and isn’t capable of.  Technical details are not critical but an understanding for the two way connection between strategy and IT.

It is generally the board or CEO that holds a CIO accountable for the IT governance performance with some clear measures of success.  It is the responsibility of the board or CEO to announce the CIO as being accountable for IT governance and this is essential for the success of IT governance.  Without this, it is very hard for CIO to engage the senior management team in the process.

8.  Design governance at multiple organisational levels

For large organisations that have multiple business units, it’s beneficial to consider designing IT governance at multiple levels.  A good starting point for enterprise wide IT governance is driven by a small number of strategies and goals.  These separate layers of IT governance exist in divisions, business units or geographies and should be part of a holistic level of IT governance.  At the lower levels of an organisation there is a demand for synergy, whereas the need for autonomy between business units is greatest at the higher levels in an organisation.

As the lower levels of governance are more often than not influenced by mechanisms designed at the higher levels, it is recommended that enterprise wide IT governance is the starting point when designing governance.  As starting at the higher level is sometimes not possible, starting at the business unit level can be more practical.

9.  Provide transparency and education

It’s quite impossible to have too much transparency or education when it comes to IT governance.  Transparency and education go hand in hand as the more education you have, the more transparency and vice versa.  This can be facilitated using portals, intranets, workshop breakfasts and many other ways of marketing IT governance.

Portals should include tools and resources such as a glossary of IT terms and acronyms.  Some portals provide templates for proposing IT investments complete with cost model calculators.

The less transparent the governance processes are, the less people follow them.  The more special deals are made, the less confidence there is in the process and the more workarounds are used. The less confidence there is in the governance, the less willingness there is to play by rules designed to lead to increased firm-wide performance.

Communication and support of IT governance from the executive team is the most important role they play.

10.  Implement common mechanisms across the six key assets

These six key assets are how enterprises accomplish their strategies and generate business value.

  • Human assets:  people, skills, career paths, training, reporting, mentoring, competencies and so on
  • Financial assets:  Cash, investments, liabilities, cash flow, receivables and so on
  • Physical assets: Buildings, plant, equipment, maintenance, security, utilisation and so on
  • IP assets:  Intellectual property (IP), including product, services, and process know how formally patented, copyrighted or embedded in people and systems.
  • Information and IT assets:  Data, information, knowledge about customers, processes performance, finance, information systems and so on
  • Relationship assets:  Relationships within the enterprise  as well as relationships, brand, and reputation  with customers, suppliers, business units, regulators, competitors, channel partners and so on

For example, an organisation that decides to implement a single point of customer contact strategy must coordinate assets to deliver that uniform experience.  Just having good customer loyalty (that is, relationship assets) without the products to sell (IP assets) will drain value. Not having well-trained people (human assets) to work with customers supported by good data and technology (information and IT assets) will drain value. Not having the right buildings and shop fronts to work from or in which to make the goods (physical assets) will drain value. Finally, not coordinating the investments needed (financial assets) will drain value.

Put this way, the coordination of the six assets seems blindingly obvious.

Understanding IT Governance – Part One

March 15th, 2010

In this two part series, I will provide an insight into the ten key principles of IT Governance. Part one will cover the first five principles and part two will cover the remaining five.

Let’s get started!

It seems there are many definitions for IT Governance. I would like to start by describing the definition as I understand it from my experience.

Definition

A framework for the leadership, organisational structures and business processes, standards and compliance to these standards, which ensure that the organisation’s IT supports and enables the achievement of its strategies and objectives.
So what does that mean? I have produced a concise version of what I believe the above statement represents which is “enable your organisation to operate in unity with people, process and technology to achieve business and IT alignment.”

Key Principles 1-5

1. Create / design the mechanisms for IT Governance.

Most mechanisms are designed as a result of a problem such as not gaining sufficient Return On Investment (ROI) on hardware / software or duplication of activities across your IT organisation – this is a tactical way of doing things and limits IT organisations from focusing on implementing mechanisms in a more strategic manner. Creating mechanisms using a strategic approach will help meet the company’s objectives and goals.

As with the implementation of anything whether it be a process or a new service for example, gaining the support of senior management is imperative for success. In many cases, organisations won’t have specific IT Governance in place however it is possible to leverage existing mechanisms that are used in the business such as project review, base lining and cost recovery models. It is important that mechanisms that are built follow a constant improvement cycle to ensure they remain agile and don’t impede the operation of your IT organisation. It is considered beneficial to have the fewest number of effective mechanisms as possible.

2. Redesigning mechanisms for IT Governance

The redesign process for IT Governance structure in your organisation will often require that staff learn new roles and build new relationships with different parts of the business and IT. As learning takes a long time, the redesign process should only be completed on an infrequent basis. Some companies change governance to encourage a certain behaviour resulting from changes in strategy.

Transformation of the way an IT organisation operates can promote many other issues and can often take months to implement. Having said that, IT Governance can be used in the transformation process as a lever to encourage change within your organisation. An example of this might be changing how IT budgets are defined / managed from a single business unit to a holistic company perspective.

3. Get Senior Management involved

Firstly, it is essential that CIOs must be involved in IT Governance and also other senior managers to ensure its success. Participation in the committees, approval and performance reviews is a must.

The participation of senior managers facilitates improving the synergy across the organisation and also creates the awareness to the business that IT should be viewed in the context of the entire company and not just a support function.

However, senior managers are generally willing to be involved but more often than not, are unaware of what value they can add. In light of this, it’s the responsibility of the CIO / key stakeholders in IT Governance to communicate effectively using a Governance Arrangements Matrix. The Arrangements Matrix is similar to a Roles and Responsibilities matrix in that it describes who is empowered to make decisions about specific aspects of IT Projects.

4. Making Choices

Good governance can be compared with having good strategy – both require choices. As with most process, it is not possible to meet every goal but the process should be able to identify conflicting goals and have a sub process that brings these conflicts to the table for debate.

Ineffective governance that has conflicting goals is more common in organisations that have directives from different places. This results in confusion, complexity and mixed messages which can lead to staff ignoring policies and processes. In some cases this can be attributed to having a number of unmanageable goals resulting from poor strategic business choices and had nothing to do with the IT organisation. Staff that are responsible for delivering these goals can often become frustrated and inefficient.

5. How to handle exceptions

Exceptions to the rule are how most organisations learn. In particular IT architecture and infrastructure can receive requests for exceptions that are thoughtless with regards to meeting the true business needs. An example of an exception procedure is:-

  • The process is clearly defined and understood by all. Clear criteria and fast escalation encourage only business units with a strong case to pursue an exception.
  • The process has a few stages that quickly move the issue up to senior management. Thus, the process minimises the chance that architecture standards will delay project implementation.
  • Successful exceptions are adopted into the enterprise architecture, completing the organisational learning process.

Having a formal exception process provides benefit to the organisation by learning about technology. Exceptions can often relieve pressure build up as managers can become frustrated if they are told they cannot do something to help the business.

Metadevices, Striping and You

March 5th, 2010

Storage metadevices are a technology that has been around now for many years.  As a technology it’s great, but the implementation of it should be carefully considered.

 Metadevices as a method for storage allocation was envisioned for a number of reasons, but primarily it is about speed of service.  Metadevices allow for what is normally called “Plaid” striping purely at the storage array layer, rather than as a combination of at the host and array.

 Plaid

 Plaid is the pseudo-technical term for a disk stripe of a disk stripe. It’s quite catchy and has stuck for multiple generations of technology staff.  It entails multiple virtual devices, be they LUNs from a storage array or devices from a local RAID adaptor, being striped together again.  These new stripes are created in a coarse pattern 128, 256 or as much as 1024K wide.  Hence the name “plaid”, so called for the pattern it produces if drawn in a diagram.

 The prime advantage of a plaid is that it spreads the IO load across multiple virtual disks, and hence multiple physical disks in the eventual backend.  More drives being utilised for a particular application will reduce the latency and response time associated with IOs, with few penalties.  If the second layer of stripes is coarse enough, say 512 or 1024k in width, the sequential read and write speed of the metadevice can be preserved.

 Pitfalls of Plaid

 Plaid is, of course, a mitigation strategy for the latency impact of the storage devices.  If metadevices are created without care or consideration of the underlying disk raid types or raid group properties, the plaid pattern can result in hot spots being created on a single drive in the backend of a storage array.

To avoid this situation it is best to consider the stripe width used for the metadevice. Using a coarse stripe width is safest as it avoids the hot spot problem and mitigates the loss of sequential read-write performance, while retaining the spread of IO across drives.

 It’s not all pitfalls

 Plaid striping through the use of metavolumes is a useful tool.  It allows for the best use of all drives for associated applications, resulting in better performance.  Implementation of more complex RAID types is also possible through the use of metavolumes, such as RAID 50 (a striped volume made up of RAID 5 volumes). 

 What does this mean for VMware

 VMware is a very disk performance sensitive application; the less latency incurred at the storage array layer, the less total disk latency experienced by the overlying virtual machines. Metadevices is definitely a useful tool when virtualising heavy IO applications such as Oracle or SQL server workloads. 

 So long as metadevices are used with careful planning, it can give fantastic results. Failure to plan effectively can result in large amounts of load on single drives, resulting in drive failures and poor performance.

Top 4 Storage mistakes with ESX server

February 25th, 2010

1. Sizing

With the advent of thin VMDKs – the ability to create virtual machine disks that grow in size as they are written to, saving space in the process, many pitfalls await the inexperienced. Generally speaking, creating datastores that can house about 15 virtual machines and their VMDKs is considered best practice. It is especially important to balance your workloads where possible between the different datastores and to consider the overall growth patterns of the virtual machines in a datastore as a whole rather than in individual virtual machines. By allowing at least 25% free space for growth, you can avoid potential problems later.

2. Multipathing

With the multitude of supported storage arrays in the market currently, it is easy to go astray with your multipathing configuration for your particular storage array. VMware ESX server defaults to MRU – Most Recently Used as its multipathing regime. This means that the most recently used path to a particular disk is used as the active path until such time as that path becomes unavailable.

One of the most reliable ways to catch this is to look at the vmkernel logs in the service console of the effected ESX host if you see frequent entries like this:

Apr 21 11:52:23 esx01 vmkernel: 09:03:06:57.614 cpu2:1034)SCSI: 8021: vmhba1:0:7:1 status = 8/0 0×0 0×0 0×0
Apr 21 11:52:23 esx01 vmkernel: 09:03:06:57.614 cpu2:1034)SCSI: 8040: vmhba1:0:7:1 Retry (busy)
Apr 21 11:52:23 esx01 vmkernel: 09:03:06:57.814 cpu3:1027)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK
Apr 21 11:52:23 esx01 vmkernel: 09:03:06:57.814 cpu3:1027)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

To resolve these issues;

Check that your different ESX hosts are not set to fixed pathing to different storage processors on your storage array

Check that your storage array is set to use fixed pathing if active-active or MRU if active-passive

Check that your physical cabling and SFP (fibre-channel modules) are sound and are working with low levels of loss – this can be checked at the fibre-channel switch.

3. SCSI Bus reset

Setting your HBAs not to use bus resets avoids potential problems where all scsi devices on a particular HBA are reset rather than just the particular devices required.

In the advanced settings of your ESX hosts, select Advanced Settings, then select Disk.UseDeviceReset and set the field to 0.

4. SCSI locking

Some storage arrays require that SCSI locking is specifically enabled for a particular LUN for it to be available for use with ESX server. VMware ESX server uses SCSI locking to help manage the connections from different ESX hosts to one particular LUN that forms a datastore. For some storage arrays, this is a specific setting and must be manually enabled.

Different storage vendors solutions handle this locking traffic in different ways. Some specific best practice rules apply to each vendor – in relation to SCSI locking some vendors lock all devices that make up a metadevice, some only lock the head metadevice. Those that lock only the head metadevice generally have much better performance with ESX server than those that lock all devices.

Most of these issues can be avoided with the use of NFS – although it brings its own problems with the NFS lock manager. In the past this has caused some problems with locking and “split brain” type clustering issues. However, these have been resolved with the ESX350-200808401-BG patch that resolves these issues. NFS lock manager performance varies from vendor to vendor but most enterprise NAS filers perform this function well. However, the patch listed is a important inclusion for correct operation.

Capacity and Performance – What to look for

February 25th, 2010

Performance and capacity are topics often discussed in a virtual environment, but understanding is often limited to what can and should be measured and how capacity planning takes place. There is an obvious inter-relationship between capacity and performance (and availability). It is quite common to see clients with alleged performance issues but using the basic metrics of CPU and memory usage as the (only) key indicators.

Ultimately, what is important is application delivery, productivity and usability by the customers, as this is essentially the reason for being for computer systems. Applications are typically managed by application teams and systems by systems teams. It follows that service requests (around performance) are primarily initiated by these end users who have visibility of application responsiveness and usability in general. The reality is that capacity shortages cause a large amount of all outages, and often has a direct relationship with performance.

One (non-IT) related example I’ve always remembered was during my engineering days visiting an aluminium smelter. The electricity supply (capacity) was so crucial to the business that a lack of supply for more than a few minutes meant the potlines would solidify, and cause an outage of between months and years.

In the virtual environment, the upside is that visibility of the key metrics is much simpler because of the shared nature of the technology, and visibility is of course a key to management. The VMware client readily exposes fundamental items such as datastore used/free space, custer/host/VM memory and CPU utilisation which we are all familiar with. Under the hood, the performance counters (and hence API’s) expose many more metrics (around 150 in total).

Some key ones are as follows:

CPU ready (cpu.ready.summation) – The amount of time spent waiting for a CPU (core) to become available. With ready times, VMware presents this in milliseconds, whist using esxtop displays as a percentage. This sometimes causes confusion, but the conversion is straight forward: simply divide the value (say 3,500) over the number of milliseconds in the interval (20 seconds @ 1000ms) and multiply by 100 : (3500/(20 * 1000) ) * 100 = 17.5%.

CPU usage (cpu.usage.average) – Expressing CPU utilisation as a percentage of the total presented resources (i.e. for a 2 vCPU machine, 100% would represent full utilisation of both vCPUs, but not necessarily the same 2 physical host cores). This is what is visible in the VI client.

Memory swap-in (mem.swapin.average) – The rate at which VM memory is reclaimed from physical disk

Memory swap-out (mem.swapout.average) – The rate at which VM memory is put to disk. Both swap in and swap out are excellent indicators of insufficient host memory, more so than just swap utilisation.

Memory usage (mem.usage.average) – This is what is displayed in the VI client, and is expressed as a percentage of granted (assigned) memory.

Disk read latency (disk.totalreadlatency.average) – The round trip time (in milliseconds) from ESX to the platter for a read request to be serviced.

Disk write latency (disk.totalwritelatency.average) – The round trip time (in milliseconds) from ESX to the platter for a write request to be serviced.

Both read and write latency is a good indicator of storage health, but should never be used as the sole indicator, and this holds true for all performance.

One important thing to note is when looking at performance is regarding clusters – the VI client and API both present CPU and memory objects for Clusters as well as Hosts. Reporting on cluster performance is simply an aggregate of each host currently in the cluster, so this will skew depending on what host is currently present in which cluster. This will have a drastic impact on historical reporting on cluster performance if the cluster nodes are changed significantly or frequently.

The VMware acquisition of B-Hive in 2008 was no doubt to provide a higher and more orchestrated management approach to application performance, rather than simply systems performance and to align those performance characteristics with SLA’s. The big picture is portraying a virtual world where we have increased visibility and understanding of performance and the relationship to the physical hosting infrastructure to help us plan, manage, integrate and report better.