The concept of "Production Readiness" (or "Production Suitability") is at the heart of Dennis Adams Associates Limited mission statement. Whilst the basic concepts have remained the same, the terminology, wording and approach have evolved over time, as experience has grown in assessing systems for IT Production suitability. I am particularly grateful for input from Lawrence Yarham, who helped me to define the some of these criteria in a more rigorous form.
The main criteria which determine whether an application or solution is suitable for IT Production use are as follows:
- Scalability
- Reliability and Stability
- Resilience
- Backup and Recovery
- Security
- Monitoring and Management
- Supportability
The following are the detailed definitions and supporting descriptions which we use when doing an IT Production Assessment
Scalability
One of the key benchmarks for deciding if a product is "Production-ready" is whether or not it can scale to the number of users / application instances etc. which may be required. More importantly, as the number of end-users or application instances increases, how much (proportionally) additional hardware etc. is required in order to deliver the extra capacity?
"Linear Scalability" by throwing hardware at an application is very unlikely. In practice, some near-linear scalability could be achievable. Sometimes the application architecture (or threading model) means that it works better on a single-CPU environment, rather than being able to take advantage of a multi-CPU system. In addition, sometimes applications are written with subtle in-built limitations that prevent them scaling. As well as Scalability, we are also interested in "expandability", i.e. to what extent can the application be enhanced and expanded to adapt to possible future requirements.
Reliability and Stability
For the purposes of our assessment, Reliability is concerned with the extent to which the application will deliver the expected results in a consistent and repeatable fashion, irrespective of changed load and/or changed environmental circumstances.
In addition, we will define "Stability" as the ability to be able to run unattended for long periods of time without operational intervention. Reliability therefore has to do with predictable, repeatable behaviour, whereas Stability has to do with repeatable behaviour over time.
For example, some applications may be initially reliable, but their performance and/or reliability degrades over time, due to memory leaks, resulting in the necessity for the application to be restarted. This is a classic example of an application that is "Reliable" (i.e. behaves properly when it works), but at the same time it is not "Stable" (i.e. it degrades over time).
Resiliance
"Resilience" can be defined as the ability to recover quickly from a failure of one or more components that make up an overall system.
Resilience therefore differs from Reliability and Stability. Reliability and Stability are concerned about behaviour under load and behaviour over time, assuming that all the components are in place. Resilience is concerned about how the system behaves if a component is lost.
A Resilience assessment takes the view that "if something could go wrong, it will do so". At a very basic level, Resilience of a Server is implemented by using dual power supplies, RAID storage controllers, Redundant Parity RAM etc. In general, it is expected that these low-level resilience issues are addressed adequately in modern high-performance production systems. In this context, Resilience assessment is concerned with how to implement Clustering mechanisms to guard against the possibility of failure of an Operating System, and how to ensure that there is no single point of failure within the architecture. Resilience should also include an assessment of how to implement Disaster Recovery mechanisms, and how to implement off-site recovery.
Backup and Recoverability
Backup and Recovery should be supported by applications and systems for two distinctly different reasons.
Firstly, Backup extends the idea of Resilience - i.e. how to respond to failure of a component - to look at how to respond to the failure of all components. This is typically implemented by using backup & recovery techniques. For example, failure of an entire data centre may be addressed by implementing a live standby data centre (resilience), or by taking off-site backups which can be used to re-create the application on a replacement machine elsewhere (backup & recovery). In some cases, an application may be so resilient that there is no necessity for backup or recovery for the purpose of guarding against failure.
Secondly, Backup can be used in order to recover the system to a known stage at a specific period of time. There may be a number of reasons for this. One reason might be that some business logic (or dependant application) has resulted in corruption and it is necessary to go back in time to recover. A second reason may be to build an ÒarchiveÓ or ÒhistoricalÓ copy of the application for the purposes of analysing historical trends, or setting up a test or development environment.
"Backup and Recovery" can often has subtle implications for the design of systems. It is axiomatic that to just backup the disk contents at Operating System level is not adequate if the application is expected to perform "ACID" transactions. (The basic properties of a database transaction are: Atomicity - The entire sequence of actions must be either completed or aborted. The transaction cannot be partially successful, Consistency - The transaction takes the resources from one consistent state to another, Isolation - A transaction's effect is not visible to other transactions until the transaction is committed, and Durability - Changes made by the committed transaction are permanent & must survive system failure.) Therefore, it is necessary to have the ability to identify the begin-end point of business transactions so that data is consistent.
Security
"Security" is one of those aspects of IT development and deployment that can sometimes be ignored in the early stages of software design. In this context, we are concerned not only with the security of the application as presented to the end user (e.g. the ability to implement IP fire walls, packet filters etc.), but also with isolation of the Production Application from any development / test versions. For example, some applications have a documented API that enables any developer to call the business logic. Under those circumstances, what is to prevent a developer (either deliberately or by mistake) from calling the Live Production business logic from within an application sub-net ?
Security is concerned with the following basic principles:
- Authentication: is the person or object attempting access who they say they are?
- Authorisation: is the capability of this person/object clearly defined & appropriately restricted?
Monitoring and Management
Monitoring and Management is a key part of the day-to-day function of any IT Production Team. One of the purposes of monitoring is to pro-actively identify any adverse changes in the behaviour of the system and/or itÕs environment, in order to take appropriate corrective action before the change impacts the business client. For this reason, "Monitoring by exception" is most appropriate. Implementing Asynchronous SNMP traps are one way of achieving this.
A second form of monitoring is "trend analysis", the purpose of which is to extract time-series data in order to model the long-term behaviour of the system and to collate it against business trends for Capacity Planning purposes.
Management is also another key role in IT Production. In this case, were are concerned with how easy it is to amend or adjust the configuration of the application, and adjust itÕs environmental behaviour. The important point is that such configuration should be as automated (and intuitive) as possible, in order to minimise IT Production costs for supporting the running application.
Supportability
"Supportability" can be defined as the features which make the application or system able to be supported by a "Business as Usual" IT team. This is a general extension of the concepts of Monitoring and Mangement, above.
The significant issue with the "Supportability" assessment is whether the application can be supported at a reasonable cost. In practice, this means ensuring that we can minimise the amount of manual intervention required to keep to application at its appropriate level of activity.