APM Application Performance Monitoring Tools in the Theatre of War tells the story of how application performance management tools were deployed to analyze and report on key inteligence biometrics applications throughout US CENTCOM Area of Responsibility, and CONUS back to the FBI IAFIS and Biometric Fusion Centers.
Transcript for this Video
Hi and Good Afternoon. This is Bill Alderson, thank you for joining us. I’m going to do a presentation on Network Application Monitoring Tools in the Theater of War. This is a presentation that I did at the MilCIS conference last year in Australia. It was a lot of fun, going Down Under and talking to some of our military counterparts down there about network management- Had a good time! Anyway, appreciate you joining us today.
So a little bit about me, I consider myself an application and network performance analysis advocate. I know that end users are the most important folks in my career. I focus on working toward their benefit in everything that I do. I started out at Lockheed in Sunnydale, California, the heart of the Silicon Valley, and was there from 1978 to 1984. I was a communications analyst. I was a young guy that loved communications, and was playing with every kind of computer that was coming out at that time- IBM, Apple, others.
I was right in the middle of it. It was a lot of fun. A little later on, I got working with Network General and the Network General Sniffer, at the start-up. Then I took my experience with the Sniffer, and I founded Pine Mountain Group, and created Network General Sniffer Training, which Network General licensed from me, in perpetuity. Then Pine Mountain Group, we trained about 50,000 people in 22 countries and certified over 3,000 network forensic professionals.
I’ve done a lot of work with 75 of the Fortune 100, federal and state government agencies. I did a large event for Net World InterOp, called Network Forensics Day. I sold Pine Mountain Group to NetQOS in 2005, and continued on as was the Technology Consulting Officer. CA then acquired NetQoS, and I hung around as principle services architect on CA Application Performance Management services products. Then just this last year I founded Apalytics Corporation. I’ve got several customers at Apalytics. We did a sub-contract to U.S. CENTCOM, where we went and did analysis of pretty much the entire war area.
That’s what I’m talking about, and some of the things I learned through that experience, of looking through every network control center throughout CENTCOM’s AOR, in all those various countries. Then working with those systems and helping to upgrade and architect, document their systems so we could perform better network management, etc. I’m also consultant to the OSD-CIO, at the Pentagon in area communications. I also provide services to the Department of Justice. I’ve got a pretty wide variety of experience. The talk today, is about network management, but in order to understand network management, one must understand everything from end to end.
From the client, where you’ve got the user, the business transactions, the security access, the application interface, the operating system, the computing platform, the OSI model. The seven layer model that brings it down to where you can take all of your data across disparate networks and systems, and then have it arrive at the server and get error checked and that sort of thing.
We’re basically looking at the full spectrum of everything end to end, from client to server. Well, one of the things in the war area is that there’s a lot of different people, and a lot of different organizations, that are in charge of different packets. From where the war fighter puts the packet on the wire and until it gets to the server.
So consequently, in that traversing all of these different network management zones, and different countries, continents, going up to space and back, these packets are really mistreated. Everyone along the path has their own SLAs, their own network management systems, and they’re all different. By golly, if you go through and you talk to anybody, all the contractors and everybody involved, everybody did a wonderful job, and all met their SLAs. Everybody inside those silos is very happy- with one exception, the end user.
The end user is going to cross all these dissimilar systems, and when he’s got a problem, where do you go to capture the packet, if it doesn’t get there? Who do you blame? Who do you call up, who do you talk to? The end to end analysis that I’ve done from all those areas, back to CONUS for different applications and that sort of thing, basically said, “Hey man. If you’re an end user, you have no advocate. You have no centralized advocate.” It’s very, very difficult. You can call your local guy and he’ll do local things, but he always comes back and says, “Must be somewhere else.” Now your end user is in trouble. So those end users are usually the ones that contact me. Powerful, high visibility, high stakes end users, usually call me up and say, ‘Hey.’ That’s exactly what the military did. They called me up to have me help with a big application, and we went out and looked at all their network management and said, “Well, you don’t know where any of your packets are going. You need some NetFlow information.”
We implemented NetFlow across the entire AOR so we could see what applications were running and where. In particular- one biometric application, it was very interesting; it was awesome to analyze this application. I helped the programmers actually recode several parts of their application to improve performance and that sort of thing. It was a really great thing. So network management does not just involve the purchase of some network management tool then turning it on and installing it. It involves understanding the clients that are accessing the applications that are going through it. So you can fine tune it, nurture it. So you can find the signatures of the vital signs of those applications. When somebody tells me about network management and oh yeah, I’m a network management guy. Well, do you know applications? Do you know the issues on the client? Do you know the issues on the server?
Do you see the virtualization issues? Do you see all the man-in-the-middle, all the firewalls, all the WAN optimizers? Do you see all the load balancers and that sort of thing? Do you understand how those things come into practice? Do you understand the quality of service across that end to end network? “Well, I know everything there is to know about this network management package. I can install it, and I’ve installed it.” Yeah, but in order to install and use these systems, you must intrinsically understand what it is you’re trying to help the clients achieve; and how to measure the performance of those systems all along the way. You’ve got to document your systems.
I’m going to talk to you a bit about, after we get off this slide, a number of things. We’ve got a lot of slides to cover today. But we’re going to get there very quickly and give you a cursory overview. Here’s your basic client to server, so the client is separated from the server. We started doing that back in the late 80s. You took your client and your applications, you run it across various stacks, using HTTP or Java. You go across your transport network, you come up the other side, you go into your server. You pop back up your stack and into your respective upper layer system. Then you go into your applications and processes on your server.
You have to draw this out. You have to know where your connection points are. You have to know where you are in the universe. If you’re having a problem somewhere, you need to know exactly how you’re going to hook in, where you’re going to hook in, what kind of tool you’re going to use, so that you can deconstruct the problem.
Now, people going into the cloud and that’ll be another topic that we talk about. Cloud computing can be very problematic. Because, with the client and the server, you own all of the infrastructure; but when you put that server over in the cloud you have no access to packets, network management information, statistics, etc., other than what that vendor gives. It’s going to be a cloud all right. So basically, I’m trying to help you understand that when you go in to analyze your environment, you need to know from end-to-end.
Here’s a client talking to an app server or an HTTP server. The back end is talking SQL or some other type of middleware or back end process. Well you have to know where to go to connect your network management systems for packet capture, for metrics, for analysis. So that you can understand, and if anything goes wrong, which it will at some point or another, and you’ll need to get macro information about the use of the network, the use of the platform, the CPU, etc., of all the different systems. You need to see the entire system from end-to-end, and understand all the various components so that you can basically design the vital signs for the applications and systems. It’s not as simple as just being, okay, I’m monitoring layer three of the network. You have to be a little bit higher than that, you have to help your customers. That’s my definition of network management.
Now typically at one time or another, whether during implementation or later on, you’re going to have a problem. You’re going to be looking for that proverbial needle in a haystack. Here’s your needle in a haystack. It’s out there, it’s somewhere, let’s take a look where it might be. Hmm. It could be anywhere in Afghanistan, Iraq or some point in between, or over somewhere in the United States or down to Fort Watuka. Your packets are traversing all of these various environments. Where is it? What’s slowing it down, who’s holding it up? Who’s routing it. Who’s misrouting it? Who’s redirect it? Who’s changed it in its path because of a WAN optimizer or load balancer that you were unaware of? Those are your men-in-the-middle, and they’re sitting out there impacting your environment. So you need to know your environment.
Then every once and a while you’ve got to gather some packet traces. You’ve got to gather them from somewhere in between where the user is and their resultant service. So it could be anywhere. We usually go and gather packet traces. We gather packet traces based upon the macro information that we get from our network management system, to tell us and help us triangulate where the problem might be. That’s what we use network management for, and then we’ll go out and we will capture some traces and then get down to the bottom line. I’ll just tell you right here, and you can all argue with me, you can fight with me. If you were a network management person and if you are implementing large scale systems, and you have no ability to capture packets across your path- You are impotent. You can’t do it, you never will be able to do it. I’m telling you if you don’t have bottom line, root cause deep packet analysis skills and capabilities, and the ability to capture those packets. And you are impotent and you cannot do your job. Sooner or later you’re going to come up against a problem that absolutely requires deep packet captures. You design your network management systems to help you find the area, and then zero on it. Can you solve many problems with network management systems and capacity? Absolutely. Can you solve every one? No. It’s not a very pretty picture when you’re standing around with several million dollars’ worth of network management tools, giving you charts and graphs and all sorts of wonderful views of stuff, and you can’t solve the problem.
It’s because you don’t have deep packet capture capability. You don’t have the root cause analysis capabilities. That’s what I focused my whole career on, trained 50,000, and executed for 75 of the Fortune 100. To help them solve these types of problem. Then some people say, “I like packets so much I’m going to capture every single one and I’m going to store them forever.” Then I ask, “Okay. You’re going to store every packet everywhere? You’ve got unlimited amounts of money and resources and storage? Who’s going to analyze those packets? Who’s qualified within your organization? How are you going to filter in, and have the worst offending problems bubble up? What’s your strategy for using these systems?” I know a lot of folks have a lot of this sort of stuff up in their environment, and they have all these tools and capabilities, but nobody ever goes and looks at them. “Oh, but they’re there in case we need to do retrospectives.”
That’s good, and I’m not saying you shouldn’t have any of it, but I’m saying that you’d better make certain that if you’re going to buy this stuff, turn it on and start caching packets all over the world, that you at least ought to have somebody who’s around who can help you take a look at them and know when and why. So when you finally get down to the stack with the problem, now you go to work in deep packet analysis, to deconstruct in order to ultimately, boom, find the needle in the haystack. That’s what it’s all about, is getting to the bottom line. Definitive results. Performance Orientation. Getting the job done. No ifs, ands, wells or buts about the situation. In order to get all that done, we use ITIL , we use People and Processes, Paradigms, tools, systems, platforms. We use all of these various systems. It’s not just network management. It’s not just deep packet analysis; it’s not just application optimization. We have a whole portfolio of capabilities. As I’ll talk about in some future, I have a list of these things that I believe are the most important and we’ll talk about those at some point. Actually, on the 30th of the month we’re going to talk about that.
It requires a seamless integration of people, processes, knowledge and technology. It’s a well-rounded organization. I’m going to talk to you a little bit about what I call my network management servo-loop. Over on the left, you’ve got a website, let’s say. You’ve got an application, and you want it to perform. Let’s just say that you want it to perform in a way that allows that webpage to come up within 15 seconds. So what you do is over on the left hand side, you say I want the maximum possible performance to accomplish this business objective. I’m going to set my command, my desired response time, to be at 15 seconds. So that goes into the network management servo-loop. This is a feedback servo-loop mechanism. If you haven’t seen this before, you can look it up on the internet and see some others. But I have taken this and adapted this, because it’s a feedback loop, and it’s a servo-loop, it’s a closed loop system. Where you say, this is what I want and expect from the system. Then you have a resulting actual user experience that you compare to what it is that your objective is. You come up with a signal deviation.
At the top left there, you have what your desire is, 15 seconds. You then have your feedback signal, which is let’s say it takes 25 seconds. You have a signal deviation of ten seconds. You are not meeting your SLA or your desired performance by ten seconds. So what do you do? In most environments, executive management thinks all the technologists have it all figured out, and the technologists think that the executive management have it all figured out. I’ve got news for you. I’ve been working in this industry for 20 some odd years. Neither one of them really knows, nor have a systematic approach, until you get them into the servo-loop. Let’s get the executives in there. Let’s show them some metrics, show them some business metrics.
When you talk to an executive about their business metrics and how it coincides with the network and the application metrics, they start listening. You don’t lose them in presentations; you have their full, undivided attention, because you’re talking about deliverables for the business. I always like to get metrics, feedback sensors. Metric are a vital part of the business. Performance indicators. End user reports. Right? You get your executives involved, because why? Because executives control the policy, procedures, the resources, and the entire organization. They need to be involved in this. Okay?
You take a guy like Steve Jobs. He was involved in the process. He understood his company’s technology. Then you take another guy like John Scully, and you put him in there who’s selling sugar water for years. He’s a bottom line dollars guy, how much can we make? But in technology, you have to have someone involved in the technology. If your executives are not involved in the technology, it’s your job as the CIO, or the executive technologists to get them involved in a meaningful way. The way that you do that is you pop in some business metrics and compare them to some of your technical metrics, and you’ll bring them right in. Then they’ll get involved and help you solve the problems that you have by allocating resources, priorities, focus, policy and budget.
Then you’ve got the actions. The actions are carried out by the technologists and the technology folks. They go out and they apply. They’re the tinkerers; they’re the guy behind the curtain, changing this knob, changing that knob. Optimizing and installing and improving the environment in the system. After executive management puts the controls on there, the technologists apply those priorities. Then we should measure the actual user experience again. We measure that actual user experience, it comes back, Okay. Then little by little, between the executive function and the technology functions, we start to move the servo-loop toward the maximum possible performance. We keep tweaking it, but it does involve executives & management, and it does involve the technologists. If the executives are not involved, you won’t have the type of priority and prioritization and resources necessary.
Okay. Now let’s talk performance indicators. Well, these are the three main performance indicators that I advocate for. You’ve got network flows. Network flows are from the left, you’ve got clients and you’ve got an IP address, and you’ve got a socket. Those are the application sockets. If you know every packet that goes across your network, because every router in the network is reporting on these network flows. You don’t need another gismo to pop in there. Those routers perform the actual task of looking at all those PCP connections, from IP address on the left at the client, and IP address on the right at the server, and the application ports that they’re using between those two. Then they tell you and record the rate and flow volume information. I can tell you if an application is or isn’t working well, and how many people are using it. Why? Because I have rate and flow information, and it’s coming from my routers, and it’s going to a database, getting in the database, and I can go back and tell you what applications are on your network. That was very interesting in the war environment, because there were certain applications that were consuming large amounts of bandwidth and others that were consuming hardly any bandwidth. Some of them were being accused of consuming bandwidth that they weren’t. So when you really go out and get the facts of who’s talking to who through whom, that’s bottom line. That’s what net flow does. Who is talking to whom through whom?
Then you’ve got rate and volume information, you’ve got capacity planning, you’ve got all of those sorts of things available at your fingertips. Net flow and network flows is a very important thing. It tells you where those applications are running around the world, and who is running those applications by location. And, the volume and the rate can be compared, and tell you what type of performance they’re getting. If you understand the theory, you understand the network management. So again, this is not about turning up a product. This is not about clicking and installing a product. This is not about buying it, installing it and having it there. It’s about interpreting it, it’s about looking at it, and it’s about architecting it. It’s about understanding the applications; it’s about understanding the vital signs. And it’s about trying to figure out what’s going wrong and where, and being able to triangulate that and being able to solve those typical problems.
Okay. So then you’ve got standard SNMP, that’s device status. Standard SNMP just tells you a router’s overloaded with CPU, or a circuit is overloaded or an interface is overloaded. It’s very good information. But what if you don’t even know the path that your packets are taking? You’ve got to know the path your packets are taking. If you know your path, then you can exploit device status, to find out what devices on your particular path are impacted by memory, by CPU, by other types of capacity limiting characteristics. The first thing is, you’ve got to know your path. Well, there are not very many tools out there or capabilities at layers two and three. You can give a layer three path, but it’s hard to get a layer two path as well.
You’ve basically got the network flows. You’ve got device status, and then third, response time. Response time monitors, there’s several different flavors. AuthNet has some really nice stuff. NetQoS, which was acquired by CA, has some nice stuff. There are a couple of other players in the field that I don’t have a lot of confidence in quite yet, but it’s growing. The performance marketplace is no longer about red, green, up, down. It’s about, ‘it’s slow’, ‘something’s wrong’, ‘we don’t know where, it’s working, but not very well’ and ‘it’s impacting our end users’. So you put these response time monitors, which basically capture the packets or listen to the packets as they go into the server. They timestamp it at the server, and they say this packet response time took 100 milliseconds.
You can put timers on these things and thresholds, so that if your response time is slow, you can automatically trigger a trace file. You don’t have to wait until 2:00 in the morning. You can actually have your response time monitor over monitoring a bank of servers, and when the response time gets slow it automatically triggers a packet capture, so your technologists can go back in the next day and do retrospective analysis on the packets that were slow. So do we care if everything is fast? Do we want to capture all those packets? No. What do we want? We want to trigger on the events that are poor, and that’s what these systems do. They record the response time, it’s an awesome system. It also gives you the network round trip, because the [inaudible 26:54] come back and it records that so you can see. One of the problems that we had was satellite delay and multiple router hops adding satellite delay. It’s about 700 milliseconds round trip across a satellite. Well, when I see a response time that says it’s 2.1 seconds, I know that that 2.1 seconds is because they’ve gone across three satellite hops at three times 700. You can figure out and start to surmise what’s going on, and then you can look at optimizing your routing.
These are your tools and your key issues. You’ve got your performance management system. You’re getting your performance metric, the feedback, and your business metrics. You’re getting your trouble tickets; you’re getting your information back. Some of the things that you need to look at are route analytics, to identify instability in routes, server response time, net flow base, and rate and volume information. It gives you hints on conditions limiting user experience. Also look at SMNP device status. Well, you’ve got to know the path in order to know what devices are in that path, and because you may have hundreds or thousands of different devices. Knowing what your particular path is and exploiting that path, to see if anything in that path is failing. Those are value added steps. Then of course packet capture at key locations, so you can get down to root cause analysis.
Here’s a network performance management architecture diagram. There’s a portfolio of tools. You’ve got Remedy Trouble Tickets; you’ve got Opsware, ArcSight. You’ve got a large number of different systems-Packet Design’s route analytics, CA Application Performance Management NetQoS’ response time monitoring, and NetFlow tools, , OPNET’s Network Diagramming capabilities. There’s a portfolio. It’s a Swiss army knife, all working together. You can’t use just one company or one suite of products. You can reduce that set as much as you possibly can. But you always want to make certain that you’ve got every best of breed and aspect of your system. This is where we went out and we installed systems at 15 different sites across Afghanistan, Iraq and areas in the United States so that we could have NetFlow and response time monitoring. This is how long it took us. It took us one year to install. This was pretty much unheard of, and the ability to get this system up and running and actually do analysis on major applications in fewer than 12 months, inside a war environment, was unheard of. We had an incredible team down at CENTCOM out there. The war fighters, man, they were working their buns off to try to help us get all of this stuff, so we could figure out what was wrong with this biometric application. This biometric application by the way takes fingerprints and iris scans. And it looks at latent prints of the guys who were building bombs and that sort of thing, those IEDs. That application was helping them put those guys on an alert list.
A war fighter would have a bad guy walk through, or people would be walking through their security station going into Fallujah or into Baghdad or whatever. And they would immediately pick up on the fact that thousands of people were coming through, and we just found the guy who had a latent print on an IED. This guy is now caught. That’s how the surge actually worked in my opinion, was because we had electronic means to zero in on who’s a bad guy and who’s in proximity and who’s not. This application and the analysis of and the fixing of this application really helped the effort over there. We found route changes, due to packet loss, slow server response time. We found problems with TCP offload engine. TCP offload engine is where the TCP stack is moved from the server, inside the server, using the server’s CPU and memory, off onto the network interface card. There was some problems, and I’ll show you a couple of slides here on some of the examples of this sort of stuff. Of course, this took many weeks and months of analysis after we had all this stuff put together. But we found all these problems and started mitigating them, and things improved.
And that application I was telling you about was even more effective in performing.
Across the network there are qualities of service incongruities. Network and application issues requiring packet level capture, we analyzed a lot of stuff. Here’s a few screenshots of some of the stuff. Here’s some route change analytics going on. Here’s some satellite retransmission, took three and a half seconds to retransmit across many of those satellite circuits. This is a detailed analysis of that sort of stuff. Here’s a TCP offload engine recovery issue. These are all commercial, off the shelf stuff. In the U.S. or other parts of the world, these particular systems are not always found to be wanting in a low latency environment. But you put them in a satellite environment with high latency, and manifestations of severe problems start to show up. That wouldn’t show up otherwise. We found TCP offload engine problems. We also found WAN optimizers and load balancers and other things, doing weird and funny things. I called them our own man-in-the middle. I developed a method by which we could identify that man-in-the-middle, and this is some of the analysis associated with that.
Then processing analysis.- How long it takes? Over on the left, you see processing, processing, processing. That was before we optimized the application. Over on the right, the same transaction only took under two seconds instead of 18 seconds. You add those types of optimizations up and you’ve got a lot of optimization. Then there was data duplication. What does data duplication mean? Data duplication is when you have a problem with that TCP offload engine where you’re sending the same data multiple times. This is an example of where the data is being sent across the network. So not only did you have clogged pipes, not only did you have networks that were saturated, but now you’re sending multiple copies of the same data. Here is an example of that packet loss induced TCP offload problem, and the wasted bandwidth associated. You’ll see a red line and a green line. A red line was the wasted bandwidth; the green line is what was normally required. But in this particular environment, because of that TCP offload engine and the packet loss that was associated with it that triggered it or was the catalyst.
See, if you had no packet loss, a lot of these problems don’t manifest. When you do have packet loss you end up with three and a half second response times. You end up with data duplication. There are a lot of things that are exacerbated, because in a war environment you are not dealing with perfect commercial communication systems. So bottom line. Multi-tier identification of your environment; you have to know and document your network, and then you set your test points up. Your front end tier, your middleware tier, your SQL tiers; you have to know how to find and design your network management and your response time monitoring and NetFlow systems. In order to do that, you have to document very effectively. Then you can instrument your front end, your middleware tier, your back end tier, and your mainframe tiers. You can then take and instrument your systems.
This is basically showing test point two, where we pulled over to a super-agent to do response time monitoring, with the ability to capture those packets like I talked to you about, when response times get slow, it would automatically capture packets at that location. Pull it all together, what do you have? A user clicks on their screen on the top left. You see where it says ‘user click?’ Then it comes down. Over on the right, the different colors, red, green, blue, those represent processing times, network serialization, transport and switching queue times. Then, of course security authentication- How long does it take to authenticate the user to be able to access that data? If you take a gander up at the top left, he clicked, and on the bottom right the user finally gets the information. Did that take 15 seconds or 25 seconds or 71 seconds, what have you? Then basically take it with the next step, and you bring it down to the various tiers. The process, where is it? We fixed the process problem in the application, reducing the response time. This is an example of what some of those response time monitors look like over there. You’ve got your web tier monitor at the same time as your app tier, your SQL tier, your mainframe tier and we found the process at the app tier that was causing a lot of issues, and we helped resolve those.
I know I’m going just a couple minutes over today, but I just wanted to finish this out. The bottom line is, in order to find that needle in a haystack, in order to make certain that your users are productive; you have to design your systems not to find problems necessarily, but to nurture and build a system whereby you are obviating the problem. In other words, you are fixing the problem because you have such good intelligence on the information. In ‘The Art of War,’ the first thing that’s said, is know yourself. Know Yourself. Document. Know Yourself, Monitor Yourself. Know where you are, so that when you do have a problem, you can go over and you can find that needle in a haystack, using scientific, automated capabilities. The future of network-centric warfare is dependent upon having these types of capabilities.
Just buying a bunch of stuff and installing it is not what is needed. You have to have architects who are looking at this, who understand the big picture, and who can solve problems in all manners, from client to server and all points between. Anyway, if there’s anything we can do to help you; we know how to do it. And we’re at your service. It was good being with you today. This is Bill Alderson, signing off. Appreciate it.