End-to-End DeliveryIn today's marketplace, because the Internet has become an integral part of the distributed enterprise, the effect of QoS on the mission of the enterprise and on the applications that drive it is more critical than ever. QoS can affect every server, client, network, and application in every sector of the IT industry. The problems inherent in this area are neither trivial nor short term, yet at the same time the industry is demanding immediate solutions with guarantees for QoS levels. One of the major challenges in this regard relates to the distributed IT paradigm and the requirement to provide QoS guarantees for end-to-end delivery of information. Performance management has to be viewed from an application perspective, and application transactions need to be tracked end to end through the entire infrastructure, including client devices, routing protocols, network configurations, server architectures, and other components. Integrated solutions that address this end-to-end requirement will have a big impact. End-to-end management is the process of coordinating resources through the use of tools, such that the following goals can be achieved from the desktop through the network to the server:
Figure 9-3 shows an end-to-end system, a consolidated view of the entire conversational process. Figure 9-3. Logical End-to-End System
Performance monitoring software systems have many capabilities that are fundamental for IT managers. Finding problems after they occur is difficult, expensive, and time-consuming. Users are rarely tolerant of slowdowns or failures when they impact their productivity. Traditionally, network managers have approached this issue in a piecemeal fashion. One management tool would outline how a desktop computer was functioning, a second would measure the efficiency of the router, a third would determine various parameters of the server, and a fourth would outline how a database is managing queries. Although this approach produces useful insights, it does not provide a comprehensive picture as to how end-to-end performance is impacted. The variance in equipment standards from different vendors obfuscates the process, making it more complicated and time-consuming. Tools that incorporate these disparate sources of information into a single report with an inclusive picture of the network are very valuable. It is important to have an end-to-end view. If you cannot easily determine where the delay is (server, client, network), you cannot expect to find the root cause of the performance issue. Subsequently, without the ability to isolate performance problems within the end-to-end path, troubleshooting and optimization efforts are just guesswork. To assign responsibility for a particular performance issue to the relevant group, it is imperative that performance management initiatives present a cooperative and generalized focus on the entire picture. This can be achieved by providing a unified end-to-end view. To provide the best optimization possible to your delivery model, you must be in a position to view the whole picture; application performance on the client and server must be examined together with the network performance to control the quality of performance delivered to the end user. Starting Points to Building an End-to-End ViewTraditionally, performance management vendors have approached the end-to-end process from the following three distinct functional areas:
The advantage of the programmed breakpoint is that the original programmers know precisely what the right events for accurate and comprehensive instrumentation are, and there should be no doubt that the measurement is correct. Furthermore, the measurement can be as granular as needed. Applications that are not amenable to meaningful performance measurements based on generic measurements can be measured and controlled only with the custom-programmed management approach. The disadvantage of this approach is that very few client/server or n-tier applications have been instrumented in this fashion. Even enterprises prepared to devote skilled programming resources to develop custom management solutions are out of luck if the application was not created with management hooks and application programming interfaces (APIs) in the first place. The other primary disadvantage of a custom-programmed performance solution is that skilled programming resources are rare, costly, hard to control, and usually slower than expected. The traffic-inference approach has the big advantage of not requiring programmatic intervention in the target application. Probes or agents deployed on the local network close to the user may also be the simplest approach in terms of the number of computers and connections that must be modified. Although traffic inference provides a relatively accurate view in isolation, the biggest disadvantage of traffic inference is a limitation to only the least common denominators of application performance. Specific application behaviors may not be detected (for example, multitiered server activity). Furthermore, performance bottlenecks in the client itself may not be detected. Continuous monitoring of network delay is also essential for optimization and planning activities. There are several common methods for measuring network delay. Active methods include scheduling Internet Control Message Protocol (ICMP) pings or simulating TCP session connects. Passive methods include measuring TCP session connects or more general application packets. Of each of these methods, network delay measurements based on observing general application packets provide the most accurate representation of performance. It is important to understand the network delay components to appreciate the merits and limitations of each approach. The network delay consists of the following five components:
Serialization or transmission delay is the time required to put all the bits in the packet on the transfer medium. Table 9-1 shows serialization delay's dependence on both the packet size and the link access rate.
TCP session connects involve 64-byte packets. As a result, measurements based on TCP session packets will generally underestimate the network delay experienced by the rest of the application. ICMP pings can be configured to assume any size, but the packet size is always the same in both directions. Most applications do not have this symmetry, which makes it difficult for ICMP to accurately capture the serialization delay experienced by the application. Note also that the default ICMP packet size is 64 bytes. Queuing delay is the time the packet waits in a buffer for its turn to be transmitted. It depends on the serialization delay for the packets served ahead, the dimension of the buffers, the amount of congestion, and the configuration of the router or switch scheduling policies. Congestion can change dramatically in microseconds, but a TCP session may be open for seconds or hours or even days. Therefore, the queuing delay experienced by the TCP session connects can be significantly different from that of the main application. The same is true for any scheduled probe such as ICMP; the queuing delay even 60 seconds earlier may bear little resemblance to that experienced by the application. In addition, the router or switch may place ICMP packets in a special queue for preferential (either better or worse) handling. During periods of congestion, ICMP packets may be preferentially dropped while the application packets wait, and therefore ICMP never measures the true delays. ICMP packets may be preferentially moved to the head of the queue and experience shorter delays; they may be selectively moved to the rear of the queue and experience longer delays (unless dropped). Propagation or distance delay refers to the time it takes the packet to travel along the physical path. It is dependent only on distance and the type of medium. If the TCP session connects and ICMP packets travel the same physical path as the main application packets, the propagation delays will be identical. However, it is not guaranteed that the same paths will be traversed; if they are not, the ICMP measurement is rendered irrelevant. Processing delay refers to the time it takes the router or switch to prepare the packet for delivery. Processing delay depends on a wide variety of factors, but it is normally insignificant. Note that TCP session connects may require more processing than the remaining packets in the flow, and ICMP packets may require less processing. Protocol delay refers to the time the packet waits because of underlying protocols. In a shared medium, for example, the packet must wait for its node to acquire access. The effect of this delay varies greatly with protocol. In summary, network delay measurements based on ICMP pings only reveal the delay experienced by ICMP pings at that moment in time. Network delay measurements based on TCP session connects only reveal the delay experienced by 64-byte packets at the time the session was established (seconds, hours, or even days ago). Of each of these methods, passively observing general application packets is the most effective means for measuring network delay because it reflects what users are actually seeing. Although it should be stressed that the network agent approach is the best starting option (because this is most likely to provide you with indicative performance metrics that can be tracked and represented relative to business requirements), the ultimate target is to cover each of three key areas (client, network, and server) and provide collaboration between the collected data to present a unified view reflecting the performance on an end-to-end basis. Monitoring the End-to-End SystemFollowing on from that approach, the logical first step (assuming network monitoring is in place) is the client station. Desktop monitoring provides a diagnostic starting point for latency troubleshooting. The moment an unmet service level threshold is triggered, IT organizations can execute a health check of the business infrastructure to look for specific network and application failures. There are two ways to measure the round-trip transaction response time experienced by end users. Synthetic transaction (active) measurement involves simulating transactions from a designated machine at defined intervals. Desktop (passive) monitoring involves deploying agents on desktops at the end user's location. These agents watch events or packets without slowing down the transaction. In both cases, because monitoring is conducted from a user location, the response-time measurement accounts for all the intervening components of the infrastructure. Server, database, and network monitors effectively make assumptions for the client wait and processor times. Active tools generate synthetic transactions from dedicated robot PCs, so service level measurements are taken from designated computers of known and constant configuration. The advantage is you can capture diagnostic data at the time a service level exception occurs, without having to re-create the problem to identify the cause. With passive measurements, independent agents are deployed on many representative desktops in a corporate network to validate the end-user experience for Windows, web, and other applications such as SAP. Figure 9-4 illustrates a combination of active and passive agents interacting with a central console. Figure 9-4. Passive and Active Agents Reporting to Control Console
In Figure 9-4, the passive agent is placed on a user terminal (1). This agent will monitor the application performance at the user's desktop as the user makes requests and receives information from the server. The data recorded from this agent will also include things such as the user wait time. This is the time between requests that the user takes to digest the information returned on the screen before moving on. In the case of a robot agent (2), these wait times will be eliminated, because the agent has no need to understand the response for the server. It only needs to know it received one before moving on to the next request. Both systems will return common statistics regarding the application's performance to a centralized console (3) for reporting and correlation. Both approaches have pros and cons. The active agent gives an uninterrupted view of the transaction and can potentially identify issues out of hours and as a result alert support staff to take action before the general population is affected. However, it has the disadvantage of creating additional load on the system resources, and takes additional hardware to run. If synthetic agents are used to repeatedly run the same transactions, the results may be cached either by the client or the server. This caching effectively invalidates the results because it is not representative of the real user experience. If the server is caching the information, it cannot be selectively disabled. If the transactions are randomized, the main benefit of synthetic agents (their determinism) is lost. This selective caching effect can render synthetic agents inaccurate for measuring server response time. Passively monitoring server performance for all transactions and all system users eliminates these problems and can provide a useful baseline for future performance. The passive agent has the disadvantage of being affected by human intervention; it is also reporting after an event (after the user has an issue). Most organizations would benefit from a mix of agent technologies to monitor the user base. This also gives the advantage of being able to correlate performance between the robot and real users and in doing so assess whether the transaction monitoring is accurate and indeed reflecting the end-user experience. The other important thing to consider is the physical deployment of such technology. It is impractical to deploy agent technology to each and every one of your desktops. After all, the logistics of deploying and keeping the relevant agent current (OS and so forth) and running may prove overly burdensome. (Remember, you never want to implement a management system that is in place for the sake of management; ultimately the aim is to improve application delivery.) More importantly, even if you could deal with the logistics and support issues involved, the sheer amount of data created would be excessive, and once again render the management process impractical. The server side is very well covered, and visibility into specific application counters on each server is equally an important part of the overall process. They will provide insight as to how the business-critical applications are performing on the relevant server. Having monitored the conversational journey, you are in a position to view and understand the various aspects of the application's journey and are in a position to present a single unified view of the business transaction's performance if required. Figure 9-5 shows a simplified view of a network system. Clients exchange data across the network with the three servers (web server, application server, and database server). Performance statistics are collected at specific points on the application's journey, such as the client station response time (at the client), the WAN utilization (at the network), and server statistics such as delivery statistics, database utilization, and server performance (at the server). Although this data represents parts of the complete journey, it effectively is taken from a single-point perspective (such as client, network, or server). The data is then amalgamated in a single centralized database. Individual functional areas (for example, network or server teams) can view data relevant to their requirements (such as network performance or server load) from this centralized database; alternatively, this information can be passed to associated management systems such as a policy manager or business service management system for inclusion in additional reports or analysis. Figure 9-5. Combined Monitoring Systems Cover the End-to-End View
|